FredHutch/slurm-pwalk

Name: slurm-pwalk

Owner: Fred Hutchinson Cancer Research Center

Description: Our wrappers around pwalk to run it across nodes in a slurm-managed cluster

Created: 2016-12-20 00:21:00.0

Updated: 2016-12-20 01:05:38.0

Pushed: 2017-12-22 20:26:28.0

Homepage: null

Size: 32

Language: Shell

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

slurm-pwalk

Use pwalk to walk big file systems and store metadata in a database.

Currently slurm is supported for job scheduling and PostgreSQL for storage.

Requirements
How to use
  1. Clone repo
  2. Install requirements (and ensure paths in storcrawldb.config are accurate)
  3. create dirs: before_scripts.d after_scripts.d
  4. Create a PostgreSQL database
  5. Edit storcrawldb.config and storcrawldb.postgresql_functions to suit your environment
  6. Ensure your account can run pwalk, sbatch, psql, csvquote, and awk
  7. run storcrawldb.sh --action start < folderlist

Note: the format for folderlist is: owner,path [however, owner can be blank or omitted] - all folders in the list will be crawled with subfolders excluded from parent folder crawls (allowing different ownership)

Recommendations
Details

Each crawl auto-generates a TAG of the current timestamp to the minute. This TAG is appended to table names, and is how you identiy each crawl. The TAG can be specified on the commandline, which is required to delete scans, print logs, reports, etc.

Output is done to two directories, which must be on cluster shared storage. The metadata goes into csv_<TAG> and the job output goes into output_<TAG>.

Custom local scripts are run from several directories, and have access to the current TAG. Examples of things done in these scripts: sync UID/GID names from local system into DB for use in queries and views (and remove those tables later), create data and output directory symlinks, etc.

A brief log is kept and a report generated at the end of the scan. These are available through storcrawldb.sh.

Note on our storage_chargeback_ownership setup

We have a private repo on github that allows collaboration on specifying ownership of folders. The storcrawl system will include some auxiliary scripts to use this repo to supply ownership information during the copy of pwalk output into the database.

To use this, you will need to set up a Deploy Key with github. I found this to be most helpful: https://www.justinsilver.com/technology/github-multiple-repository-ssh-deploy-keys/


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.