Name: SparkSlurm
Owner: Duke Center for Genomic and Computational Biology
Description: Notes and scripts for running Spark on a slurm cluster.
Created: 2016-05-27 18:53:52.0
Updated: 2016-05-27 18:53:52.0
Pushed: 2016-06-01 17:08:42.0
Homepage: null
Size: 51
Language: null
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Running Spark on a Slurm cluster. This works via spark in the standalone mode: docs.
Steps:
export SPARK_HOME=.../spark-2.3.0-bin-hadoop2.7
.spark-2.3.0-bin-hadoop2.7
for the version of spark you downloaded. You may also edit this to increase the number of nodes/cpus/memory used by your spark cluster.Start a sbatch job that will run your cluster.
This will start up multiple nodes and continue running until you scancel
this job.
ch spark.sbatch
Check the slurm-*.out file created by this job.
The top line should contain the spark master address. eg. spark://<nodename>:7077
This needs to be passed in to your spark commands.
K_MASTER=spark://<nodename>:7077
RK_HOME/bin/spark-submit --master $SPARK_MASTER $SPARK_HOME/examples/src/main/python/pi.py
cel <JOBID>
RK_HOME/bin/spark-submit $SPARK_HOME/examples/src/main/python/pi.py
If you have kryoserializer errors where file chunks where too big to be passed around. To fix this run:
SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/defaults.conf
Then add the following to the end of $SPARK_HOME/conf/defaults.conf
k.kryoserializer.buffer.max 1g