JuliaParallel/ClusterManagers.jl

Name: ClusterManagers.jl

Owner: Parallel Julia

Description: null

Created: 2013-07-08 09:32:59.0

Updated: 2017-11-14 22:54:01.0

Pushed: 2017-12-31 16:14:20.0

Homepage: null

Size: 71

Language: Julia

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

ClusterManagers

Support for different job queue systems commonly used on compute clusters.

Currently supported job queue systems

| Job queue system | Command to add processors | | —————- | ————————- | | Sun Grid Engine | addprocs_sge(np::Integer, queue="") or addprocs(SGEManager(np, queue)) | | PBS | addprocs_pbs(np::Integer, queue="") or addprocs(PBSManager(np, queue)) | | Scyld | addprocs_scyld(np::Integer) or addprocs(ScyldManager(np)) | | HTCondor | addprocs_htc(np::Integer) or addprocs(HTCManager(np)) | | Slurm | addprocs_slurm(np::Integer; kwargs...) or addprocs(SlurmManager(np); kwargs...) | | Local manager with CPU affinity setting | addprocs(LocalAffinityManager(;np=CPU_CORES, mode::AffinityMode=BALANCED, affinities=[]); kwargs...) |

You can also write your own custom cluster manager; see the instructions in the Julia manual

Slurm: a simple example
g ClusterManagers

guments to the Slurm srun(1) command can be given as keyword
guments to addprocs.  The argument name and value is translated to
srun(1) command line argument as follows:
 If the length of the argument is 1 => "-arg value",
 e.g. t="0:1:0" => "-t 0:1:0"
 If the length of the argument is > 1 => "--arg=value"
 e.g. time="0:1:0" => "--time=0:1:0"
 If the value is the empty string, it becomes a flag value,
 e.g. exclusive="" => "--exclusive"
 If the argument contains "_", they are replaced with "-",
 e.g. mem_per_cpu=100 => "--mem-per-cpu=100"
rocs(SlurmManager(2), partition="debug", t="00:5:00")

s = []
 = []
i in workers()
host, pid = fetch(@spawnat i (gethostname(), getpid()))
push!(hosts, host)
push!(pids, pid)


e Slurm resource allocation is released when all the workers have
ited
i in workers()
rmprocs(i)

SGE - a simple interactive example
a> using ClusterManagers

a> ClusterManagers.addprocs_sge(5)
id is 961, waiting for job to start .
ement Array{Any,1}:






a> @parallel for i=1:5
   run(`hostname`)
   end

a>  From worker 2:  compute-6
    From worker 4:  compute-6
    From worker 5:  compute-6
    From worker 6:  compute-6
    From worker 3:  compute-6
SGE - an example with resource list

Some clusters require the user to specify a list of required resources. For example, it may be necessary to specify how much memory will be needed by the job - see this issue.

a> using ClusterManagers

a> addprocs_sge(5,res_list="h_vmem=4G,tmem=4G")
id is 9827051, waiting for job to start ........
ement Array{Int64,1}:






a> pmap(x->run(`hostname`),workers());

a>  From worker 26: lum-7-2.local
    From worker 23: pace-6-10.local
    From worker 22: chong-207-10.local
    From worker 24: pace-6-11.local
    From worker 25: cheech-207-16.local
Using LocalAffinityManager (for pinning local workers to specific cores)

where

Using ElasticManager (dynamically adding workers to a cluster)

The ElasticManager is useful in scenarios where we want to dynamically add workers to a cluster. It achieves this by listening on a known port on the master. The launched workers connect to this port and publish their own host/port information for other workers to connect to.

Usage

On the master, you need to instantiate an instance of ElasticManager. The constructors defined are:

ticManager(;addr=IPv4("127.0.0.1"), port=9009, cookie=nothing, topology=:all_to_all)
ticManager(port) = ElasticManager(;port=port)
ticManager(addr, port) = ElasticManager(;addr=addr, port=port)
ticManager(addr, port, cookie) = ElasticManager(;addr=addr, port=port, cookie=cookie)

On the worker, you need to call ClusterManagers.elastic_worker with the addr/port that the master is listening on and the same cookie. elastic_worker is defined as:

terManagers.elastic_worker(cookie, addr="127.0.0.1", port=9009; stdout_to_master=true)

For example, on the master:

g ClusterManagers
lasticManager(cookie="foobar")

and launch each worker locally as echo "using ClusterManagers; ClusterManagers.elastic_worker(\"foobar\")" | julia &

or if you want a REPL on the worker, you can start a julia process normally and manually enter

g ClusterManagers
edule ClusterManagers.elastic_worker("foobar", "addr_of_master", port_of_master; stdout_to_master=false)

The above will yield back the REPL prompt and also display any printed output locally.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.