Name: slurm-pwalk
Owner: Fred Hutchinson Cancer Research Center
Description: Our wrappers around pwalk to run it across nodes in a slurm-managed cluster
Created: 2016-12-20 00:21:00.0
Updated: 2016-12-20 01:05:38.0
Pushed: 2017-12-22 20:26:28.0
Homepage: null
Size: 32
Language: Shell
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Use pwalk to walk big file systems and store metadata in a database.
Currently slurm is supported for job scheduling and PostgreSQL for storage.
storcrawldb.sh --action start < folderlist
Note: the format for folderlist is: owner,path [however, owner can be blank or omitted] - all folders in the list will be crawled with subfolders excluded from parent folder crawls (allowing different ownership)
Each crawl auto-generates a TAG of the current timestamp to the minute. This TAG is appended to table names, and is how you identiy each crawl. The TAG can be specified on the commandline, which is required to delete scans, print logs, reports, etc.
Output is done to two directories, which must be on cluster shared storage. The metadata goes into csv_<TAG>
and the job output goes into output_<TAG>
.
Custom local scripts are run from several directories, and have access to the current TAG. Examples of things done in these scripts: sync UID/GID names from local system into DB for use in queries and views (and remove those tables later), create data and output directory symlinks, etc.
A brief log is kept and a report generated at the end of the scan. These are available through storcrawldb.sh
.
We have a private repo on github that allows collaboration on specifying ownership of folders. The storcrawl system will include some auxiliary scripts to use this repo to supply ownership information during the copy of pwalk output into the database.
To use this, you will need to set up a Deploy Key with github. I found this to be most helpful: https://www.justinsilver.com/technology/github-multiple-repository-ssh-deploy-keys/