hammerlab/coverage-depth

Name: coverage-depth

Owner: Hammer Lab

Description: Generate genomic-coverage-depth histograms using Apache Spark

Created: 2017-06-05 17:20:18.0

Updated: 2017-11-27 18:57:47.0

Pushed: 2018-01-13 20:53:41.0

Homepage: null

Size: 491

Language: Scala

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

coverage-depth

Analyze coverage in a BAM file or files, optionally intersected with an “interval file” (e.g. an exome capture kit .bed).

Build Status Coverage Status

CoverageDepth

This tool computes coverage-depth statistics about one or two sets of reads (e.g. .bams), optionally taking an intervals file (e.g. a .bed, denoting “targeted loci” of some upstream analysis, e.g. whole-exome sequencing) and generating coverage-depth statistics for on-target loci, off-target loci, and total.

When run on two samples with an interval file, it can plot the fraction of the targeted loci which were covered at at ?X depth in one sample and ?Y depth in the other, for all (X,Y):

3-D plot preview

Running Locally

After setting $COVERAGE_JAR to point to a coverage-depth assembly JAR:

RK_HOME/bin/spark-submit \
properties-file $spark_props \
class org.hammerlab.coverage.Main \
OVERAGE_JAR \
intervals-file $intervals \
out $out_dir \
ormal $tumor

In the above, you'll want to fill in:

A full list of arguments/options can be found by running with -h:

PARK_HOME/bin/spark-submit \
class org.hammerlab.coverage.Main \
OVERAGE_JAR \

HS                             : Paths to sets of reads: FILE1 FILE2 FILE3
ir (-d) PATH                   : When set, relative paths will be prefixed with this path (default: None)
orce (-f)                      : Write result files even if they already exist (default: false)
nclude-duplicates              : Include reads marked as duplicates (default: false)
nclude-failed-quality-checks   : Include reads that failed vendor quality checks (default: false)
nclude-single-end              : Include single-end reads (default: false)
nterval-partition-bytes (-b) N : Number of bytes per chunk of input interval-file (default: 1048576)
ntervals-file (-i) PATH        : Intervals file or capture kit; print stats for loci matching this intervals file, not matching, and total.
                                 (default: None)
oci VAL                        : If set, loci to include. Either 'all' or 'contig[:start[-end]],contig[:start[-end]],?' (default: None)
oci-file VAL                   : Path to file giving loci to include. (default: None)
in-alignment-quality INT       : Minimum read mapping quality for a read (Phred-scaled) (default: None)
o-sequence-dictionary          : If set, get contigs and lengths directly from reads instead of from sequence dictionary. (default: false)
nly-mapped-reads               : Include only mapped reads (default: false)
ut (-o) DIR                    : Directory to write results to
ersist-distributions (-v)      : When set, persist full PDF and CDF of coverage-depth histogram (default: false)
ersist-joint-histogram (-jh)   : When set, save the computed joint-histogram; if one already exists, skip reading it, recompute it, and overwrite
                                 it (default: false)
ample-names STRING[]           : name1,?,nameN
plit-size VAL                  : Maximum HDFS split size (default: None)
(-help, --help, -?)            : Print help (default: true)
int_metrics                    : Print metrics to the log on completion (default: false)
Output

This tool writes out a directory with a few files of note; see this test-data for a live example:

Plotting

The plot.js script in this repo can be used to consume the cdf.csv produced above and send it to plot.ly:

Install JS dependencies
rc/main/js/plots
install
Pipe cdf.csv to plot.js
ut argument should be the output directory from above
$out/cdf.csv | node plot.js

If $out is in a gcloud bucket (gs://?), use gsutil to pipe the file to the plot script:

il cat $out/cdf.csv | node plot.js

generating an interactive 2D-histogram like the one shown above.

Running on GCloud

Running on an ephemeral Google Cloud Dataproc cluster is easy and cheap (~$0.02/cpu-hr using predominantly pre-emptible nodes, as of current writing).

You'll want to install the gcloud command-line utility and then follow the steps below.

scripts/run-on-gcloud

This script uses hammerlab/dataproc to set up a cluster, run one CoverageDepth app, then tear down the cluster; set-up and tear-down typically add just a couple of minutes to the overall run-time.

ripts/run-on-gcloud -h
e: dataproc [-h] [--cluster CLUSTER] [--timestamp-cluster-name]
            [--cores CORES] [--properties PROPS_FILES] [--jar JAR]
            [--main MAIN] [--machine-type MACHINE_TYPE] [--dry-run]
            [--job-only]

a Spark job on an ephemeral dataproc cluster

onal arguments:
, --help            show this help message and exit
cluster CLUSTER     Name of the dataproc cluster to use; defaults to
                    $CLUSTER env var
timestamp-cluster-name, -t
                    When true, append "-<TIMESTAMP>" to the dataproc
                    cluster name
cores CORES, -c CORES
                    Number of CPU cores to use (default: 200)
properties PROPS_FILES, -p PROPS_FILES
                    Comma-separated list of Spark properties files; merged
                    with $SPARK_PROPS_FILES env var
jar JAR             URI of main app JAR; defaults to JAR env var
main MAIN, -m MAIN  JAR main class; defaults to MAIN env var
machine-type MACHINE_TYPE
                    Machine type to use (default: n1-standard-4)
dry-run, -n         When set, print some of the parsed and inferred
                    arguments and exit without running any dataproc
                    commands
job-only, -j        When set, skip cluster setup/teardown commands; just
                    run a job

It sets $CLUSTER, $MAIN, and $JAR by default:

rt JAR=gs://hammerlab-lib/coverage-depth-707fccc.jar
rt MAIN=org.hammerlab.coverage.Main
rt CLUSTER=coverage-depth
Manually

You can manually run the cluster-creation, job-submission, and cluster-deletion commands yourself, as well:

Create a cluster

e.g. with 51 4-core nodes (2 reserved and 49 pre-emptible), pointing at a GCloud bucket with your data:

ud dataproc clusters create coverage-depth \
--master-machine-type n1-standard-4 \
--worker-machine-type n1-standard-4 \
--num-workers 2 \
--num-preemptible-workers 49
Submit a job
ud dataproc jobs submit spark \
--cluster coverage-depth \
--class org.hammerlab.coverage.Main \
--jars gs://hammerlab-lib/coverage-depth-707fccc.jar \
-- \
--intervals-file <path to .bed> \
--out <out directory> \
<path to normal .bam> \
<path to tumor .bam>

This uses a coverage-depth JAR that's already on GCloud storage, so that no bandwidth- or time-cost is incurred uploading a JAR.

Optional: extra Spark configs

You may wish to include some Spark configs in either the cluster-creation step (to set defaults across multiple jobs that may be run before the cluster is torn down):

operties spark:spark.speculation=true,spark:spark.speculation.interval=1000,spark:spark.speculation.multiplier=1.3,spark:spark.yarn.maxAppAttempts=1,spark:spark.eventLog.enabled=true,spark:spark.eventLog.dir=hdfs:///user/spark/eventlog

or in the job-creation step:

operties spark.speculation=true,spark.speculation.interval=1000,spark.speculation.multiplier=1.3,spark.yarn.maxAppAttempts=1,spark.eventLog.enabled=true,spark.eventLog.dir=hdfs:///user/spark/eventlog
Tear down the cluster
ud dataproc clusters delete coverage-depth

Alternatively, you can just resize it down to the minimum 2 reserved nodes:

ud dataproc clusters update coverage-depth --num-preemptible-workers 0
Local Installation

Download a pre-built assembly-JAR, and set $PCOVERAGE_JAR to point to it:

 https://oss.sonatype.org/content/repositories/snapshots/org/hammerlab/coverage-depth_2.11/1.0.0-SNAPSHOT/coverage-depth_2.11-1.0.0-SNAPSHOT-assembly.jar
rt COVERAGE_JAR=$PWD/coverage-depth_2.11-1.0.0-SNAPSHOT-assembly.jar

or clone and build it yourself:

clone git@github.com:hammerlab/coverage-depth.git
overage-depth
assembly
rt COVERAGE_JAR=target/scala-2.11/coverage-depth-assembly-1.0.0-SNAPSHOT.jar
Spark Installation

coverage-depth runs on Apache Spark:

coverage-depth currently builds against Spark 2.1.0, but some other versions will also work?


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.