Child pages
  • Hyak spark
Skip to end of metadata
Go to start of metadata

Apache Spark has below advantages over Hadoop:

(1) Spark is upto 100 times faster than Hadoop for running programs.

(2) Spark can directly use the /gscratch filesystem on hyak.
Hence it also saves time compared to Hadoop in loading input data to HDFS and getting output data from HDFS.

(3) Spark needs 4 times less disk space since Hadoop requires 4 copies of the data: one original copy on the native file system /gscratch and 3 copies on HDFS. (HDFS keeps three copies since it assumes the disk is likely to fail during the running of the program.)

(4) Spark can be used from Scala (spark-shell), Python (pyspark) shells. Hadoop has a HDFS shell for HDFS file operations but it does not have a shell for a standard programming language.

Below is an outline of steps for running Spark on hyak using an interactive Scala or Python Spark Shell.

Install Java 8
Install Java and in your .bashrc put below line. Change the right hand side of below line appropriately
export JAVA_HOME=/gscratch/abc/xyz/javastuff/jdk1.8.0_45
Install Spark
From hyak login node, get an interactive build node:
 srun -p build --time=2:00:00 --mem=100G --pty /bin/bash
mkdir sparkstuff
cd sparkstuff
wget http://www.us.apache.org/dist/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz
tar -xvf spark-1.5.2-bin-hadoop2.6.tgz
exit
Start a multi-node hyak interactive session
From hyak login node, get an interactive hyak session with 2 nodes and 28 cores per node for 12 hours
srun -p hpc -A hpc --time=12:00:00 --nodes=2 --ntasks-per-node 28 --mem=248G --pty bash -l


Configure the Spark Cluster
scontrol show hostnames
Note the two nodes assigned to your e.g. n0020 and n0021
cd sparkstuff/spark-1.5.2-bin-hadoop2.6/conf
edit file 'slaves' and enter below two lines (modify below two lines appropriately to list your two nodes from above cat command)
n0020
n0021
Start the Spark Cluster
cd
cd sparkstuff/spark-1.5.2-bin-hadoop2.6
./sbin/start-all.sh
Above command will print an IP and PORT number. Use them in below command to replace IP and PORT.
Start the Scala or Python Spark Shell
Start either the Scala or the Python shells.
Below command will start a Scala Spark Shell.
./bin/spark-shell --master spark://IP:PORT
Below command will start a Python Spark Shell.
./bin/pyspark --master spark://IP:PORT
Run programs in the Scala or Python Spark Shell
Use the Scala Spark Shell to run the Scala examples and the Python Spark shell to run the Python examples in below quick-start.html link.
Note that the README.md file used in below example is a plain text file on the regular hyak file system.
http://spark.apache.org/docs/latest/quick-start.html
Exit the Spark Shell
Use quit() for the Python spark shell and exit for Scala Spark shell.
Stop the Spark Cluster
cd
cd sparkstuff/spark-1.5.2-bin-hadoop2.6
./sbin/stop-all.sh
Exit the hyak interactive session
exit

Further Reading:

More details are at below link. It also shows how you can submit jobs to the cluster instead of running them at the Spark Shell.

http://spark.apache.org/docs/latest/spark-standalone.html#installing-spark-standalone-to-a-cluster

  • No labels