FredHutch/expert-enigma

Name: expert-enigma

Owner: Fred Hutchinson Cancer Research Center

Description: debugging rmpi problem

Created: 2017-08-24 16:47:59.0

Updated: 2017-08-24 16:48:16.0

Pushed: 2017-08-24 18:30:04.0

Homepage: null

Size: 11

Language: R

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Breaking things with R

These two “master” examples demonstrate a difference in behavior when using mpi.bcast() and mpi.bcast.Robj2slave() functions. The former causes the script to hang with slaves running at 100%. The latter seems to work.

to run:

Start an interactive job on multiple nodes:

srun -N 4 -n 4 --pty /bin/bash -i

Load R (3.3.3) via ml

ml R

Run master.R via mpirun:

of2[~/Work/]: mpirun -n 1 ./master.R 
"universe is:  4"
"starting  4  slaves"
"giving 25000 iterations per slave"
"creating mpi machine"
slaves are spawned successfully. 0 failed.
er (rank 0, comm 1) of size 5 is running on: gizmof2 
e1 (rank 1, comm 1) of size 5 is running on: gizmof3 
e2 (rank 2, comm 1) of size 5 is running on: gizmof4 
e3 (rank 3, comm 1) of size 5 is running on: gizmof5 
e4 (rank 4, comm 1) of size 5 is running on: gizmof6 
"broadcasting iters_per_slave"
"broadcasting functions"
"exec-ing command"
"pi is: 3.14192"
"Done"
1

This works. The broken one is run similarly, but you will see different behavior:

of2[~/Work/]: mpirun -n 1 ./master-broken.R 
"universe is:  4"
"starting  4  slaves"
"giving 25000 iterations per slave"
"creating mpi machine"
slaves are spawned successfully. 0 failed.
er (rank 0, comm 1) of size 5 is running on: gizmof2 
e1 (rank 1, comm 1) of size 5 is running on: gizmof3 
e2 (rank 2, comm 1) of size 5 is running on: gizmof4 
e3 (rank 3, comm 1) of size 5 is running on: gizmof5 
e4 (rank 4, comm 1) of size 5 is running on: gizmof6 
"broadcasting iters_per_slave"
25000
"broadcasting functions"
"exec-ing command"

The job will hang at this point. If you check the slaves you will see R processes running at 100%. The intermediate log files produced by rmpi will only show:

of2[~/Work/]: cat gizmof5.8740+1.3709.log 
st: gizmof5   Rank(ID): 3   of Size: 5 on comm 1 

Two ^C in succession will kill the job


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.