Name: scr
Owner: Lawrence Livermore National Laboratory
Description: SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
Created: 2012-12-29 20:48:21.0
Updated: 2018-03-27 06:00:39.0
Pushed: 2018-02-02 14:44:57.0
Homepage: http://computation.llnl.gov/projects/scalable-checkpoint-restart-for-mpi
Size: 304618
Language: C
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
The Scalable Checkpoint / Restart (SCR) library enables MPI applications to utilize distributed storage on Linux clusters to attain high file I/O bandwidth for checkpointing and restarting large-scale jobs. With SCR, jobs run more efficiently, recompute less work upon a failure, and reduce load on critical shared resources such as the parallel file system.
Detailed usage is provided at SCR.ReadTheDocs.io.
SCR uses the CMake build system and we recommend out-of-source builds.
clone git@github.com:llnl/scr.git
r build
r install
uild
e -DCMAKE_INSTALL_PREFIX=../install ../scr
install
test
Some useful CMake command line options:
-DCMAKE_INSTALL_PREFIX=[path]
: Place to install the SCR library-DCMAKE_BUILD_TYPE=[Debug/Release]
: Build with debugging or optimizations-DBUILD_PDSH=[OFF/ON]
: CMake can automatically download and build the PDSH dependency-DWITH_PDSH_PREFIX=[path to PDSH]
: Path to an existing PDSH installation (should not be used with BUILD_PDSH
)-DWITH_DTCMP_PREFIX=[path to DTCMP]
-DWITH_YOGRT_PREFIX=[path to YOGRT]
-DSCR_ASYNC_API=[CRAY_DW/INTEL_CPPR/IBM_BBAPI/NONE]
-DSCR_RESOURCE_MANAGER=[SLURM/APRUN/PMIX/LSF/NONE]
SCR searches the following locations in the following order for a parameter value, taking the first value it finds.
To find a user configuration file, SCR looks for a file named .scrconf
in the prefix directory (note the leading dot).
Alternatively, one may specify the name and location of the user configuration file by setting the SCR_CONF_FILE
environment variable at run time.
This repository includes some example configuration files (scr.conf.template
, scr.user.conf.template
, and examples/test.conf
).
Numerous people have contributed to the SCR project.
To reference SCR in a publication, please cite the following paper:
Additional information and research publications can be found here:
http://computation.llnl.gov/projects/scalable-checkpoint-restart-for-mpi