Name: coverage-depth
Owner: Hammer Lab
Description: Generate genomic-coverage-depth histograms using Apache Spark
Created: 2017-06-05 17:20:18.0
Updated: 2017-11-27 18:57:47.0
Pushed: 2018-01-13 20:53:41.0
Homepage: null
Size: 491
Language: Scala
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Analyze coverage in a BAM file or files, optionally intersected with an “interval file” (e.g. an exome capture kit .bed
).
CoverageDepth
This tool computes coverage-depth statistics about one or two sets of reads (e.g. .bam
s), optionally taking an intervals file (e.g. a .bed
, denoting “targeted loci” of some upstream analysis, e.g. whole-exome sequencing) and generating coverage-depth statistics for on-target loci, off-target loci, and total.
When run on two samples with an interval file, it can plot the fraction of the targeted loci which were covered at at ?X depth in one sample and ?Y depth in the other, for all (X,Y):
After setting $COVERAGE_JAR
to point to a coverage-depth assembly JAR:
RK_HOME/bin/spark-submit \
properties-file $spark_props \
class org.hammerlab.coverage.Main \
OVERAGE_JAR \
intervals-file $intervals \
out $out_dir \
ormal $tumor
In the above, you'll want to fill in:
$spark_props
: path to a Spark properties file$intervals
: optional path to e.g. a .bed
file to view on-/off-target stats about$normal
/$tumor
: paths to .bam
s (or .adam
alignment records).bam
can also be passed, resulting in a different and simpler 1-dimensional histogram output.$out
: output directoryA full list of arguments/options can be found by running with -h
:
PARK_HOME/bin/spark-submit \
class org.hammerlab.coverage.Main \
OVERAGE_JAR \
HS : Paths to sets of reads: FILE1 FILE2 FILE3
ir (-d) PATH : When set, relative paths will be prefixed with this path (default: None)
orce (-f) : Write result files even if they already exist (default: false)
nclude-duplicates : Include reads marked as duplicates (default: false)
nclude-failed-quality-checks : Include reads that failed vendor quality checks (default: false)
nclude-single-end : Include single-end reads (default: false)
nterval-partition-bytes (-b) N : Number of bytes per chunk of input interval-file (default: 1048576)
ntervals-file (-i) PATH : Intervals file or capture kit; print stats for loci matching this intervals file, not matching, and total.
(default: None)
oci VAL : If set, loci to include. Either 'all' or 'contig[:start[-end]],contig[:start[-end]],?' (default: None)
oci-file VAL : Path to file giving loci to include. (default: None)
in-alignment-quality INT : Minimum read mapping quality for a read (Phred-scaled) (default: None)
o-sequence-dictionary : If set, get contigs and lengths directly from reads instead of from sequence dictionary. (default: false)
nly-mapped-reads : Include only mapped reads (default: false)
ut (-o) DIR : Directory to write results to
ersist-distributions (-v) : When set, persist full PDF and CDF of coverage-depth histogram (default: false)
ersist-joint-histogram (-jh) : When set, save the computed joint-histogram; if one already exists, skip reading it, recompute it, and overwrite
it (default: false)
ample-names STRING[] : name1,?,nameN
plit-size VAL : Maximum HDFS split size (default: None)
(-help, --help, -?) : Print help (default: true)
int_metrics : Print metrics to the log on completion (default: false)
This tool writes out a directory with a few files of note; see this test-data for a live example:
misc
: plaintext file with high-level statscdf.csv
: CSV with stats about the number of loci with “normal” depth ?X and “tumor” depth ?Y, for (X,Y) filtered to (a relatively dense set of) “round numbers”.pdf.csv
: same as above, but stats are about loci with depth ==X and ==Y, resp.pdf
/cdf
: when run with the --persist-distributions
(-v
) flag, the unfiltered “pdf” and “cdf” above are written out as sharded CSVs.The plot.js
script in this repo can be used to consume the cdf.csv
produced above and send it to plot.ly:
rc/main/js/plots
install
cdf.csv
to plot.js
ut argument should be the output directory from above
$out/cdf.csv | node plot.js
If $out
is in a gcloud bucket (gs://?
), use gsutil
to pipe the file to the plot script:
il cat $out/cdf.csv | node plot.js
generating an interactive 2D-histogram like the one shown above.
Running on an ephemeral Google Cloud Dataproc cluster is easy and cheap (~$0.02/cpu-hr using predominantly pre-emptible nodes, as of current writing).
You'll want to install the gcloud
command-line utility and then follow the steps below.
scripts/run-on-gcloud
This script uses hammerlab/dataproc to set up a cluster, run one CoverageDepth
app, then tear down the cluster; set-up and tear-down typically add just a couple of minutes to the overall run-time.
ripts/run-on-gcloud -h
e: dataproc [-h] [--cluster CLUSTER] [--timestamp-cluster-name]
[--cores CORES] [--properties PROPS_FILES] [--jar JAR]
[--main MAIN] [--machine-type MACHINE_TYPE] [--dry-run]
[--job-only]
a Spark job on an ephemeral dataproc cluster
onal arguments:
, --help show this help message and exit
cluster CLUSTER Name of the dataproc cluster to use; defaults to
$CLUSTER env var
timestamp-cluster-name, -t
When true, append "-<TIMESTAMP>" to the dataproc
cluster name
cores CORES, -c CORES
Number of CPU cores to use (default: 200)
properties PROPS_FILES, -p PROPS_FILES
Comma-separated list of Spark properties files; merged
with $SPARK_PROPS_FILES env var
jar JAR URI of main app JAR; defaults to JAR env var
main MAIN, -m MAIN JAR main class; defaults to MAIN env var
machine-type MACHINE_TYPE
Machine type to use (default: n1-standard-4)
dry-run, -n When set, print some of the parsed and inferred
arguments and exit without running any dataproc
commands
job-only, -j When set, skip cluster setup/teardown commands; just
run a job
It sets $CLUSTER
, $MAIN
, and $JAR
by default:
rt JAR=gs://hammerlab-lib/coverage-depth-707fccc.jar
rt MAIN=org.hammerlab.coverage.Main
rt CLUSTER=coverage-depth
You can manually run the cluster-creation, job-submission, and cluster-deletion commands yourself, as well:
e.g. with 51 4-core nodes (2 reserved and 49 pre-emptible), pointing at a GCloud bucket with your data:
ud dataproc clusters create coverage-depth \
--master-machine-type n1-standard-4 \
--worker-machine-type n1-standard-4 \
--num-workers 2 \
--num-preemptible-workers 49
ud dataproc jobs submit spark \
--cluster coverage-depth \
--class org.hammerlab.coverage.Main \
--jars gs://hammerlab-lib/coverage-depth-707fccc.jar \
-- \
--intervals-file <path to .bed> \
--out <out directory> \
<path to normal .bam> \
<path to tumor .bam>
This uses a coverage-depth
JAR that's already on GCloud storage, so that no bandwidth- or time-cost is incurred uploading a JAR.
You may wish to include some Spark configs in either the cluster-creation step (to set defaults across multiple jobs that may be run before the cluster is torn down):
operties spark:spark.speculation=true,spark:spark.speculation.interval=1000,spark:spark.speculation.multiplier=1.3,spark:spark.yarn.maxAppAttempts=1,spark:spark.eventLog.enabled=true,spark:spark.eventLog.dir=hdfs:///user/spark/eventlog
or in the job-creation step:
operties spark.speculation=true,spark.speculation.interval=1000,spark.speculation.multiplier=1.3,spark.yarn.maxAppAttempts=1,spark.eventLog.enabled=true,spark.eventLog.dir=hdfs:///user/spark/eventlog
ud dataproc clusters delete coverage-depth
Alternatively, you can just resize it down to the minimum 2 reserved nodes:
ud dataproc clusters update coverage-depth --num-preemptible-workers 0
Download a pre-built assembly-JAR, and set $PCOVERAGE_JAR
to point to it:
https://oss.sonatype.org/content/repositories/snapshots/org/hammerlab/coverage-depth_2.11/1.0.0-SNAPSHOT/coverage-depth_2.11-1.0.0-SNAPSHOT-assembly.jar
rt COVERAGE_JAR=$PWD/coverage-depth_2.11-1.0.0-SNAPSHOT-assembly.jar
or clone and build it yourself:
clone git@github.com:hammerlab/coverage-depth.git
overage-depth
assembly
rt COVERAGE_JAR=target/scala-2.11/coverage-depth-assembly-1.0.0-SNAPSHOT.jar
coverage-depth
runs on Apache Spark:
$SPARK_HOME
to the Spark installation directorycoverage-depth
currently builds against Spark 2.1.0, but some other versions will also work?