LLNL/scr

Name: scr

Owner: Lawrence Livermore National Laboratory

Description: SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.

Created: 2012-12-29 20:48:21.0

Updated: 2018-03-27 06:00:39.0

Pushed: 2018-02-02 14:44:57.0

Homepage: http://computation.llnl.gov/projects/scalable-checkpoint-restart-for-mpi

Size: 304618

Language: C

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Scalable Checkpoint / Restart (SCR) Library

The Scalable Checkpoint / Restart (SCR) library enables MPI applications to utilize distributed storage on Linux clusters to attain high file I/O bandwidth for checkpointing and restarting large-scale jobs. With SCR, jobs run more efficiently, recompute less work upon a failure, and reduce load on critical shared resources such as the parallel file system.

Detailed usage is provided at SCR.ReadTheDocs.io.

Quickstart

SCR uses the CMake build system and we recommend out-of-source builds.

clone git@github.com:llnl/scr.git
r build
r install

uild
e -DCMAKE_INSTALL_PREFIX=../install ../scr

 install
 test

Some useful CMake command line options:

Dependencies
Configuration Files

SCR searches the following locations in the following order for a parameter value, taking the first value it finds.

  1. Environment variables,
  2. User configuration file,
  3. System configuration file,
  4. Compile-time constants.

To find a user configuration file, SCR looks for a file named .scrconf in the prefix directory (note the leading dot). Alternatively, one may specify the name and location of the user configuration file by setting the SCR_CONF_FILE environment variable at run time. This repository includes some example configuration files (scr.conf.template, scr.user.conf.template, and examples/test.conf).

Authors

Numerous people have contributed to the SCR project.

To reference SCR in a publication, please cite the following paper:

Additional information and research publications can be found here:

http://computation.llnl.gov/projects/scalable-checkpoint-restart-for-mpi


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.