Apache Spark has below advantages over Hadoop:
(1) Spark is upto 100 times faster than Hadoop for running programs.
(2) Spark can directly use the /gscratch filesystem on hyak.
Hence it also saves time compared to Hadoop in loading input data to HDFS and getting output data from HDFS.
(3) Spark needs 4 times less disk space since Hadoop requires 4 copies of the data: one original copy on the native file system /gscratch and 3 copies on HDFS. (HDFS keeps three copies since it assumes the disk is likely to fail during the running of the program.)
(4) Spark can be used from Scala (spark-shell), Python (pyspark) shells. Hadoop has a HDFS shell for HDFS file operations but it does not have a shell for a standard programming language.
Below is an outline of steps for running Spark on hyak using an interactive Scala or Python Spark Shell.
Install Java 8
Install Java and in your .bashrc put below line. Change the right hand side of below line appropriately
From hyak login node, get an interactive build node:
qsub -I -q build
tar -xvf spark-1.5.2-bin-hadoop2.6.tgz
Start a multi-node hyak interactive session
From hyak login node, get an interactive hyak session with 2 nodes for 12 hours:
qsub -I -l nodes=2:ppn=16,walltime=12:00:00
Configure the Spark Cluster
Note the two nodes assigned to your e.g. n0020 and n0021
edit file 'slaves' and enter below two lines (modify below two lines appropriately to list your two nodes from above cat command)
Start the Spark Cluster
Above command will print an IP and PORT number. Use them in below command to replace IP and PORT.
Start the Scala or Python Spark Shell
Start either the Scala or the Python shells.
Below command will start a Scala Spark Shell.
./bin/spark-shell --master spark://IP:PORT
Below command will start a Python Spark Shell.
./bin/pyspark --master spark://IP:PORT
Run programs in the Scala or Python Spark Shell
Use the Scala Spark Shell to run the Scala examples and the Python Spark shell to run the Python examples in below quick-start.html link.
Note that the README.md file used in below example is a plain text file on the regular hyak file system.
Exit the Spark Shell
Use quit() for the Python spark shell and exit for Scala Spark shell.
Stop the Spark Cluster
Exit the hyak interactive session
More details are at below link. It also shows how you can submit jobs to the cluster instead of running them at the Spark Shell.