Name: expert-enigma
Owner: Fred Hutchinson Cancer Research Center
Description: debugging rmpi problem
Created: 2017-08-24 16:47:59.0
Updated: 2017-08-24 16:48:16.0
Pushed: 2017-08-24 18:30:04.0
Homepage: null
Size: 11
Language: R
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
These two “master” examples demonstrate a difference in
behavior when using mpi.bcast()
and mpi.bcast.Robj2slave()
functions. The former causes the script to hang with slaves
running at 100%. The latter seems to work.
Start an interactive job on multiple nodes:
srun -N 4 -n 4 --pty /bin/bash -i
Load R (3.3.3) via ml
ml R
Run master.R
via mpirun
:
of2[~/Work/]: mpirun -n 1 ./master.R
"universe is: 4"
"starting 4 slaves"
"giving 25000 iterations per slave"
"creating mpi machine"
slaves are spawned successfully. 0 failed.
er (rank 0, comm 1) of size 5 is running on: gizmof2
e1 (rank 1, comm 1) of size 5 is running on: gizmof3
e2 (rank 2, comm 1) of size 5 is running on: gizmof4
e3 (rank 3, comm 1) of size 5 is running on: gizmof5
e4 (rank 4, comm 1) of size 5 is running on: gizmof6
"broadcasting iters_per_slave"
"broadcasting functions"
"exec-ing command"
"pi is: 3.14192"
"Done"
1
This works. The broken one is run similarly, but you will see different behavior:
of2[~/Work/]: mpirun -n 1 ./master-broken.R
"universe is: 4"
"starting 4 slaves"
"giving 25000 iterations per slave"
"creating mpi machine"
slaves are spawned successfully. 0 failed.
er (rank 0, comm 1) of size 5 is running on: gizmof2
e1 (rank 1, comm 1) of size 5 is running on: gizmof3
e2 (rank 2, comm 1) of size 5 is running on: gizmof4
e3 (rank 3, comm 1) of size 5 is running on: gizmof5
e4 (rank 4, comm 1) of size 5 is running on: gizmof6
"broadcasting iters_per_slave"
25000
"broadcasting functions"
"exec-ing command"
The job will hang at this point. If you check the slaves you
will see R
processes running at 100%. The intermediate log
files produced by rmpi
will only show:
of2[~/Work/]: cat gizmof5.8740+1.3709.log
st: gizmof5 Rank(ID): 3 of Size: 5 on comm 1
Two ^C
in succession will kill the job