FredHutch/au-slurm-package

Name: au-slurm-package

Owner: Fred Hutchinson Cancer Research Center

Description: The various configs and tools we use on the GenomeDK cluster

Forked from: runefriborg/au-slurm-package

Created: 2015-12-10 17:09:25.0

Updated: 2015-12-10 17:09:26.0

Pushed: 2015-12-08 14:41:56.0

Homepage: null

Size: 474

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

au-slurm-package

The various configs and tools we use on the GenomeDK cluster

Folder overview
folder          install location        note
------          ----------------        ----
config/         /opt/slurm/etc/         Our configuration files (see further notes before installing)
scripts/        /opt/slurm/scripts/     This folder holds the various prolog/epilog scripts
replacements/   /opt/slurm/bin/         Replacements for most of our old tools
tools/          /opt/slurm/bin/         New tools to make things nicer for the user
support-bin/    various                 A few supporting programs
init.d/         /etc/init.d/            Simple script for starting/stopping slurm
ganglia/        ...                     Ganglia gmetric script and web scripts
Config files

You need to pay attention to what you are installing on what machines here. slurmdbd.conf contains the user/pass for the database and should only be installed on the controller machine(s).

slurm.conf

This is the main config file that is needed on all machines. Theoretically you could probably get away with a smaller version on compute nodes and frontends, but the files are compared via hashing by default so just install identical configs everywhere.

Must be readable by all users.

job_submit.lua

This script is only needed on the controller, but is not sensitive so you can install it everywhere if that is easier. It does two things for our setup:

  1. If a job is submitted to the express queue, it is also added to the normal queue.
  2. If no specific stdout/stderr names have been asked for we set it to jobname-jobid.out in stead of the standard slurm-jobid.out.

Must be readable by the user that slurmctld runs under.

slurmdbd.conf

This is the config for the accounting module. Since it has the password and user for the database it is important that it is not accessible to regular users.

Must be readable by the user that slurmdbd runs under.

cgroup.conf

We configure cgroups to constrain cores.

Scripts
slurm-prolog & slurm-epilog

These are the standard prolog and epilog scripts that run before and after a job, with root permissions. The default for slurm is to run the epilog on all nodes involved in a job, at the end of the job – as expected. I found the behaviour for the prolog surprising though, it only runs on a node when the job starts something on the node. That means that with a script like this:

#SBATCH -n 32
echo nothing
sleep 1000
srun hostname

The prolog will run immediately on one node, the other nodes will only run with srun – leaving 1000 seconds where the user can't ssh in, or can ssh in but without the node having been setup. In order to change this we have set PrologFlags=Alloc in slurm.conf. This ensures that the prolog is run on all machines as soon as they are allocated to a job.

The scripts them selves are pretty simple. We create job specific folders, make sure our audit service is running and call bash-login-update to open for ssh connections from the user. The epilog then closes for ssh connections from the user (disconnecting them,

    and deleting all their /tmp data).

Then it deletes the job specific folders, and runs a sanity check to make sure the node is still healthy.

Must be present on all compute-nodes.

slurm-task-prolog

The task prolog is run as the user before the users script, it sets a few environment variables for compatibility with the old Torque system.

Must be present on all compute-nodes.

controller-prolog & slurm-remote-prolog

We don't want a node to take a job and then immediately fail. It should probably be avoidable by putting a sanity-check in the regular prolog script,

     but we couldn't get it to work so we went for another solution.

When the controller has found a suitable set of nodes to run a job, it calls the controller-prolog. The controller-prlog script then connects to all the proposed nodes and have them run a sanity-check (the slurm-remote-prolog). If any of the nodes fail, the proposed set of nodes is discarded and the job goes back in the queue.

The remote prolog must be present on all compute-nodes, the controller prolog only needs to be on the controller.

Tools

We have only needed one completely new tool so far. jobinfo collects the most useful fields from sacct (and sstat for running jobs) and presents it in a format that is easier to read and grep.

It takes a very wide format with multiple entries, like this:

   JobID        JobName    Partition  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  ...
   ------------ ---------- ---------- ---------- -------------- -------------- ...
   219304               94 express,n+                                          ...
   219304.batch      batch               314132K         s01n36              0 ...

And converts it in to something like this:

Name                : 94
User                : qianyuxx
Partition           : express,normal
Nodes               : s01n36
...
Max Mem used        : 3.54M (s01n36)
Max Disk Write      : 348.00M (s01n36)
Max Disk Read       : 348.00M (s01n36)
Misc

The slurm_ld.conf files is put into /etc/ld.so.conf.d/ to make sure the binaries can find the libraries they need.

cgroups

Enabling cgroups means that when ever a job is started it is allocated a set of cores. Every subprocess of the job is also bound to these constraints. This means that we can have a bad job, pushing the load average of a machine to 100 with no discernible impact on the other jobs.

The install procedure is to install the cgroup.conf file next to the slurm.conf file on all compute nodes, and installing the slurm release_common script where your CgroupReleaseAgentDir variable points. Finally you should create some aliases of the script, like this:

cd /opt/slurm/scripts/cgroup/
for subsystem in blkio cpuacct cpuset freezer memory; do
    ln -s release_common release_$subsystem 
done
simplified-install.sh

This is a slightly simplified/cleaned version of our install script. Probably has a few missing or broken things in it.

init.d/slurm

Very primitive script for starting and stopping slurm - no proper header, no status function, can probably wait forever when shutting down. It automatically finds out which services need to be run on the machine (might

    be none if the machine is just used for submitting jobs).

There is no difference between starting and restarting, the slurm daemons figure out if they need to replace an old process on their own.

Ganglia

It is very simple to configure, if you already have a running Ganglia Monitoring system.

  1. Edit the constants in ganglia/gmetric/slurm-gmetric and start it from a host with access to the slurm executables.

  2. Copy the files from ganglia/www to your ganglia web installation (3.6.0) and point your browser to http://ganglia-installation/slurm.php


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.