Name: fs-reporting
Owner: Wellcome Trust Sanger Institute - Human Genetics Informatics
Description: null
Created: 2018-03-19 12:20:31.0
Updated: 2018-03-22 16:27:38.0
Pushed: 2018-03-22 16:27:37.0
Homepage: null
Size: 50
Language: Shell
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Generate reports and visualisations of filesystem usage, grouped by:
*.{crai,bai,sai,fai,csi}
)*.{bzip2,gz,tgz,zip,xz,bgz,bcf}
){README,*.{sam,fasta,fastq,fa,fq,vcf,csv,tsv,txt,text}}
)*jobstate.context
)*.{log,stdout,stderr,o,out,e,err}
)*{tmp,temp}*
)Specifically, we are interested in:
ctime
File paths in the mpistat
data are base64 encoded. For aggregating by
filetype, the data must be preclassified:
zcat lustre01.dat.gz | ./classify-filetype.sh
This appends a field to the mpistat
data containing the filetype and
streams it to stdout
.
mpistat
data can be aggregated using aggregate-mpistat.sh
, taking
uncompressed data from stdin
:
zcat lustre01.dat.gz lustre02.dat.gz | ./aggregate-mpistat.sh
The aggregation is output to stdout
as tab-delimited data with the
following fields:
group
or user
)all
, cram
, bam
, index
, compressed
,
uncompressed
, checkpoint
, log
, temp
or other
)The aggregation script takes three optional, positional arguments:
all
. Note that if the input
data has not been preclassified by filetype, the aggregation step
will fail.lustre
.The filesystem types, that define cost per terabyte year, are enumerated
in fs-cost.map
. Note also that, for shared filesystems, it may be
worth filtering the mpistat
data before classifying and/or aggregating
it; for example, by group.
Aggregated data can be mapped to PI by running it through map-to-pi.sh
.
This script strips out any user
records and replaces group
records
with an appropriate pi
record, using gid-pi_uid.map
:
zcat foo.dat.gz | ./aggregate-mpistat.sh | ./map-to-pi.sh
Note that this will not aggregate records by PI, it's strictly a mapping
operation. It defines a third organisational tag, pi
, in the
aggregated output, where the organisation ID (third field) is the PI's
Unix user ID (per the mapping definition).
An additional mapping step can then be applied, which maps Unix user and
group IDs to their human readable counterparts; defined in uid-user.map
and gid-group.map
, respectively. Note that a full join (in relational
algebra terms) is performed, so any records in the aggregated data
without a corresponding mapping will be lost; this can be beneficial
when looking at subsets of data. The mapping can be performed using the
map-to-readable.sh
convenience script, which reads from stdin
and
writes to stdout
:
cat lustre01-logs lustre01-logs-by_pi | ./map-to-readable.sh
Final aggregation/merging can be done using the merge-aggregates.sh
script:
./merge-aggregates.sh lustre01-all lustre01-cram lustre01-cram-by_pi
This will produce the output data that drives report generation.
Once the completely aggregated output has been produced, it can be run
through generate-assets.R
, which will generate the tables, plots and
LaTeX source for the final report. The script takes three positional
arguments:
It will produce assets named OUTPUT_DIR/FILESYSTEM-ORG_TAG.EXT
– for
example, /path/to/assets/lustre-pi.pdf
for the plot of Lustre data
usage by PI – as well as OUTPUT_DIR/report.tex
, which makes use of
these. (Note that the PI assets are unconstrained, but the group
and user assets will be limited to the top 10, by cost.)
The final report can be compiled using LaTeX. For example:
cd /path/to/assets
latexmk -pdf report.tex
To compile the aggregated data (i.e., running the complete pipeline as outlined above) into the final report, a convenience script is available that will submit the pipeline to an LSF cluster:
submit-pipeline.sh [--output FILENAME]
[--work-dir DIRECTORY]
[--bootstrap SCRIPT]
[--base TIME]
[--email ADDRESS]
[--lustre INPUT_DATA]
[--nfs INPUT_DATA]
[--warehouse INPUT_DATA]
[--irods INPUT_DATA]
[--lsf-STEP OPTION...]
Taking the following options:
Option | Behaviour
————————— | ——————————————————–
--output FILENAME
| Write the report to FILENAME
, defaulting to $(pwd)/report.pdf
--work-dir DIRECTORY
| Use DIRECTORY
for the pipeline's working files, defaulting to the current working directory
--bootstrap SCRIPT
| Source SCRIPT
at the beginning of each job in the pipeline
--base TIME
| Set the base time to TIME
, defaulting to the current system time
--mail ADDRESS
| E-mail address to which the completed report is sent; can be specified multiple times
--lustre INPUT_DATA
| INPUT_DATA
for a Lustre filesytem; can be specified multiple times
--nfs INPUT_DATA
| INPUT_DATA
for a NFS filesytem; can be specified multiple times
--warehouse INPUT_DATA
| INPUT_DATA
for a warehouse filesytem; can be specified multiple times
--irods INPUT_DATA
| INPUT_DATA
for an iRODS filesytem; can be specified multiple times
--lsf-STEP OPTION...
| Provide LSF OPTION
s to the STEP
job submission; can be specified multiple times
Note that at least one --lustre
, --nfs
, --warehouse
or --irods
option must be specified with its INPUT_DATA
readable from the cluster
nodes. In addition to the final report, its source aggregated data will
be compressed alongside it, with the extension .dat.gz
. Otherwise, the
working directory and its contents will be deleted upon successful
completion; as such, do not set the output or any logging to be written
inside the working directory.
The following pipeline STEP
s are available:
foo
Do something…