CD2H gitForager

hammerlab/adam

Name: adam

Description: A genomics processing engine and specialized file format built using Apache Avro, Apache Spark and Parquet. Apache 2 licensed.

Created: 2014-02-04 03:57:44.0

Updated: 2017-02-07 14:14:34.0

Pushed: 2018-01-13 20:53:34.0

Homepage:

Size: 7147

Language: Scala

GitHub Committers

User	Most Recent Commit	# Commits

Other Committers

User	Email	Most Recent Commit	# Commits

README

ADAM

This fork of bigdatagenomics/adam:

reduces the repo to the adam-core module,
ports it to SBT (using hammerlab/sbt-parent to simplify build/packaging configuration),
is continuously-built and test-coverage-measured by TravisCI and Coveralls (resp.; see badges above), and
is occasionally published to Maven Central as org.hammerlab.adam:core.

The code (and README, from this point on) is essentially unchanged from upstream, which is periodically merged in.

Introduction

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark and Apache Parquet. Apache 2 licensed. Some quick links:

Why ADAM?

Over the last decade, DNA and RNA sequencing has evolved from an expensive, labor intensive method to a cheap commodity. The consequence of this is generation of massive amounts of genomic and transcriptomic data. Typically, tools to process and interpret these data are developed at academic labs, with a focus on excellence of the results generated, not on scalability and interoperability. A typical sequencing pipeline consists of a string of tools going from quality control, mapping, mapped read preprocessing, to variant calling or quantification, depending on the application at hand. Concretely, this usually means that such a pipeline is a string of tools, glued together by scripts or workflow engines, with data written to files in each step. This approach entails three main bottlenecks:

scaling the pipeline comes down to scaling each of the individual tools,
the stability of the pipeline heavily depends on the consistency of the intermediate file formats, and
writing to and reading from disk is a major slow-down.

We propose here a transformative solution for these problems, by replacing ad-hoc pipelines by the ADAM framework, developed in the Apache Spark ecosystem. ADAM provides specialized file formats for the standard data structures used in genomics analysis: mapped reads (typically stored as .bam files), representation of genomic regions (.bed files), and variants (.vcf files), using Avro and Parquet. This allows to use the in-memory cluster computing functionality of Apache Spark, ensuring efficient and fault-tolerant distribution based on data parallelism, without the intermediate disk operations required in classical distributed approaches.

Furthermore, the ADAM-Spark approach comes with an additional benefit. Typically, the endpoint of a sequencing pipeline is a file with processed data for a single sample: e.g. variants for DNA sequencing, read counts for RNA sequencing, etc. The real endpoint. however, of a sequencing experiment initiated by an investigator is interpretation of these data in a certain context. This usually translates into (statistical) analysis of multiple samples, connection with (clinical) metadata, interactive visualization, using data science tools such as R, Python, Tableau and Spotfire. In addition to scalable distributed processing, Spark also allows such interactive data analysis in the form of analysis notebooks (Spark Notebook or Zeppelin), or direct connection to the data in R and Python.

Hello World: Counting K-mers

Here's an example ADAM CLI command that will count 10-mers in a test .sam file that lives in this repository:

am-submit count_kmers /tmp/small.adam /tmp/kmers.adam 10
ad /tmp/kmers.adam/part-*
TGGCACT,1)
CGATTTT,1)
CAGCCTT,1)
GCTGTAT,1)
TAAGGTT,1)
CAGGACT,1)
GTCCCTC,1)
TTTGAAT,1)
GACGTGG,1)
TCCCTGT,1)

More than K-mer Counting

ADAM does much more than just k-mer counting. Running the ADAM CLI without arguments or with --help will display available commands, e.g.

am-submit

   e         888~-_          e             e    e
  d8b        888   \        d8b           d8b  d8b
 /Y88b       888    |      /Y88b         d888bdY88b
/  Y88b      888    |     /  Y88b       / Y88Y Y888b
____Y88b     888   /     /____Y88b     /   YY   Y888b
     Y88b    888_-~     /      Y88b   /          Y888b

e: adam-submit [<spark-args> --] <adam-args>

se one of the following commands:

 ACTIONS
      countKmers : Counts the k-mers/q-mers from a read dataset.
countContigKmers : Counts the k-mers/q-mers from a read dataset.
       transform : Convert SAM/BAM to ADAM format and optionally perform read pre-processing transformations
ransformFeatures : Convert a file with sequence features into corresponding ADAM format and vice versa
     mergeShards : Merges the shards of a file
  reads2coverage : Calculate the coverage from a given ADAM file

ERSION OPERATIONS
        vcf2adam : Convert a VCF file to the corresponding ADAM format
        adam2vcf : Convert an ADAM variant to the VCF ADAM format
      fasta2adam : Converts a text FASTA sequence file into an ADAMNucleotideContig Parquet file which represents assembled sequences.
      adam2fasta : Convert ADAM nucleotide contig fragments to FASTA files
      adam2fastq : Convert BAM to FASTQ files
 fragments2reads : Convert alignment records into fragment records.
 reads2fragments : Convert alignment records into fragment records.

T
           print : Print an ADAM formatted file
        flagstat : Print statistics on reads in an ADAM file (similar to samtools flagstat)
            view : View certain reads from an alignment-record file.

You can learn more about a command, by calling it without arguments or with --help, e.g.

am-submit transform --help
UT                                                           : The ADAM, BAM or SAM file to apply the transforms to
PUT                                                          : Location to write the transformed data in ADAM/Parquet format
d_md_tags VAL                                                : Add MD Tags to reads based on the FASTA (or equivalent) file passed to this option.
igned_read_predicate                                         : Only load aligned reads. Only works for Parquet files.
che                                                          : Cache data to avoid recomputing between stages.
alesce N                                                     : Set the number of partitions written to the ADAM output directory
ncat VAL                                                     : Concatenate this file with <INPUT> and write the result to <OUTPUT>
fer_merging                                                  : Defers merging single file output
mp_observations VAL                                          : Local path to dump BQSR observations to. Outputs CSV format.
rce_load_bam                                                 : Forces Transform to load from BAM/SAM.
rce_load_fastq                                               : Forces Transform to load from unpaired FASTQ.
rce_load_ifastq                                              : Forces Transform to load from interleaved FASTQ.
rce_load_parquet                                             : Forces Transform to load from Parquet.
rce_shuffle_coalesce                                         : Even if the repartitioned RDD has fewer partitions, force a shuffle.
(-help, --help, -?)                                          : Print help
own_indels VAL                                               : VCF file including locations of known INDELs. If none is provided, default
                                                               consensus model will be used.
own_snps VAL                                                 : Sites-only VCF giving location of known SNPs
mit_projection                                               : Only project necessary fields. Only works for Parquet files.
g_odds_threshold N                                           : The log-odds threshold for accepting a realignment. Default value is 5.0.
rk_duplicate_reads                                           : Mark duplicate reads
x_consensus_number N                                         : The maximum number of consensus to try realigning a target region to. Default
                                                               value is 30.
x_indel_size N                                               : The maximum length of an INDEL to realign to. Default value is 500.
x_target_size N                                              : The maximum length of a target region to attempt realigning. Default length is
                                                               3000.
_tag_fragment_size N                                         : When adding MD tags to reads, load the reference in fragments of this size.
_tag_overwrite                                               : When adding MD tags to reads, overwrite existing incorrect tags.
ired_fastq VAL                                               : When converting two (paired) FASTQ files to ADAM, pass the path to the second file
                                                               here.
rquet_block_size N                                           : Parquet block size (default = 128mb)
rquet_compression_codec [UNCOMPRESSED | SNAPPY | GZIP | LZO] : Parquet compression codec
rquet_disable_dictionary                                     : Disable dictionary encoding
rquet_logging_level VAL                                      : Parquet logging level (default = severe)
rquet_page_size N                                            : Parquet page size (default = 1mb)
int_metrics                                                  : Print metrics to the log on completion
align_indels                                                 : Locally realign indels present in reads.
calibrate_base_qualities                                     : Recalibrate the base quality scores (ILLUMINA only)
cord_group VAL                                               : Set converted FASTQs' record-group names to this value; if empty-string is passed,
                                                               use the basename of the input file, minus the extension.
partition N                                                  : Set the number of partitions to map data to
ngle                                                         : Saves OUTPUT as single file
rt_fastq_output                                              : Sets whether to sort the FASTQ output, if saving as FASTQ. False by default.
                                                               Ignored if not saving as FASTQ.
rt_lexicographically                                         : Sort the reads lexicographically by contig name, instead of by index.
rt_reads                                                     : Sort the reads by referenceId and read position
orage_level VAL                                              : Set the storage level to use for caching.
ringency VAL                                                 : Stringency level for various checks; can be SILENT, LENIENT, or STRICT. Defaults
                                                               to LENIENT

The ADAM transform command allows you to mark duplicates, run base quality score recalibration (BQSR) and other pre-processing steps on your data.

Getting Started

Installation

Binary Distributions

Bundled release binaries can be found on our releases page.

Building from Source

You will need to have Apache Maven version 3.1.1 or later installed in order to build ADAM.

Note: The default configuration is for Hadoop 2.7.3. If building against a different version of Hadoop, please pass -Dhadoop.version=<HADOOP_VERSION> to the Maven command. ADAM will cross-build for both Spark 1.x and 2.x, but builds by default against Spark 1.6.3. To build for Spark 2, run the ./scripts/move_to_spark2.sh script.

t clone https://github.com/bigdatagenomics/adam.git
 adam
port MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=128m"
n clean package -DskipTests

Outputs


O] ------------------------------------------------------------------------
O] BUILD SUCCESS
O] ------------------------------------------------------------------------
O] Total time: 9.647s
O] Finished at: Thu May 23 15:50:42 PDT 2013
O] Final Memory: 19M/81M
O] ------------------------------------------------------------------------

You might want to take a peek at the scripts/jenkins-test script and give it a run. It will fetch a mouse chromosome, encode it to ADAM reads and pileups, run flagstat, etc. We use this script to test that ADAM is working correctly.

Installing Spark

You'll need to have a Spark release on your system and the $SPARK_HOME environment variable pointing at it; prebuilt binaries can be downloaded from the Spark website. Currently, our continuous builds default to Spark 1.6.1 built against Hadoop 2.6, but any more recent Spark distribution should also work.

Helpful Aliases

You might want to add the following to your .bashrc to make running ADAM easier:

s adam-submit="${ADAM_HOME}/bin/adam-submit"
s adam-shell="${ADAM_HOME}/bin/adam-shell"

$ADAM_HOME should be the path to a binary release or a clone of this repository on your local filesystem.

These aliases call scripts that wrap the spark-submit and spark-shell commands to set up ADAM. Once they are in place, you can run adam by simply typing adam-submit at the command line, as demonstrated above.

Running ADAM

Now you can try running some simple ADAM commands:

`transform`

Make your first .adam file like this:

-submit transform $ADAM_HOME/adam-core/src/test/resources/small.sam /tmp/small.adam

If you didn't obtain your copy of adam from github, you can grab small.sam here.

`flagstat`

Once you have data converted to ADAM, you can gather statistics from the ADAM file using flagstat. This command will output stats identically to the samtools flagstat command.

If you followed along above, now try gathering some statistics:

am-submit flagstat /tmp/small.adam
 0 in total (QC-passed reads + QC-failed reads)
0 primary duplicates
0 primary duplicates - both read and mate mapped
0 primary duplicates - only read mapped
0 primary duplicates - cross chromosome
0 secondary duplicates
0 secondary duplicates - both read and mate mapped
0 secondary duplicates - only read mapped
0 secondary duplicates - cross chromosome
 0 mapped (100.00%:0.00%)
0 paired in sequencing
0 read1
0 read2
0 properly paired (0.00%:0.00%)
0 with itself and mate mapped
0 singletons (0.00%:0.00%)
0 with mate mapped to a different chr
0 with mate mapped to a different chr (mapQ>=5)

In practice, you'll find that the ADAM flagstat command takes orders of magnitude less time than samtools to compute these statistics. For example, on a MacBook Pro flagstat NA12878_chr20.bam took 17 seconds to run while samtools flagstat NA12878_chr20.bam took 55 seconds. On larger files, the difference in speed is even more dramatic. ADAM is faster because it's multi-threaded and distributed and uses a columnar storage format (with a projected schema that only materializes the read flags instead of the whole read).

`adam-shell`

The adam-shell command opens an interpreter that you can run ad-hoc ADAM commands in.

For example, the following code snippet will generate a result similar to the k-mer-counting example above, but with the k-mers sorted in descending order of their number of occurrences. To use this, save the code snippet as kmer.scala and run adam-shell -i kmer.scala.

kmer.scala

rt org.bdgenomics.adam.rdd.ADAMContext._
rt org.bdgenomics.adam.projections.{ AlignmentRecordField, Projection }

oad alignments from disk
reads = sc.loadAlignments(
data/NA21144.chrom11.ILLUMINA.adam",
ojection = Some(
Projection(
  AlignmentRecordField.sequence,
  AlignmentRecordField.readMapped,
  AlignmentRecordField.mapq
)



enerate, count and sort 21-mers
kmers =
ads
.rdd
.flatMap(_.getSequence.sliding(21).map(k => (k, 1L)))
.reduceByKey(_ + _)
.map(_.swap)
.sortByKey(ascending = false)

rint the top 10 most common 21-mers
s.take(10).foreach(println)

adam-shell -i kmer.scala

am-shell -i kmer.scala

771,TTTTTTTTTTTTTTTTTTTTT)
17,ACACACACACACACACACACA)
23,TGTGTGTGTGTGTGTGTGTGT)
74,CACACACACACACACACACAC)
95,GTGTGTGTGTGTGTGTGTGTG)
97,TAATCCCAGCACTTTGGGAGG)
81,AATCCCAGCACTTTGGGAGGC)
75,TGTAATCCCAGCACTTTGGGA)
84,CCTCCCAAAGTGCTGGGATTA)

Running on a cluster

The adam-submit and adam-shell commands can also be used to submit ADAM jobs to a Spark cluster, or to run ADAM interactively. Cluster mode can be enabled by passing the same flags you'd pass to Spark, e.g. --master yarn --deploy-mode client.

Under the Hood

ADAM relies on several open-source technologies to make genomic analyses fast and massively parallelizable?

Apache Spark

Apache Spark allows developers to write algorithms in succinct code that can run fast locally, on an in-house cluster or on Amazon, Google or Microsoft clouds.

Apache Parquet

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Parquet compresses legacy genomic formats using standard columnar techniques (e.g. RLE, dictionary encoding). ADAM files are typically ~20% smaller than compressed BAM files.
Parquet integrates with:
- Query engines: Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto
- Frameworks: Spark, MapReduce, Cascading, Crunch, Scalding, Kite
- Data models: Avro, Thrift, ProtocolBuffers, POJOs
Parquet is simply a file format which makes it easy to sync and share data using tools like distcp, rsync, etc
Parquet provides a command-line tool, parquet.hadoop.PrintFooter, which reports useful compression statistics

In the counting k-mers example above, you can see there is a defined predicate and projection. The predicate allows rapid filtering of rows while a projection allows you to efficiently materialize only specific columns for analysis. For this k-mer counting example, we filter out any records that are not mapped or have a MAPQ less than 20 using a predicate and only materialize the Sequence, ReadMapped flag and MAPQ columns and skip over all other fields like Reference or Start position, e.g.

Sequence| ReadMapped | MAPQ | ~~Reference~~ | ~~Start~~ | … ——–|————|——|———–|——-|——- ~~GGTCCAT~~ | ~~false~~ | - | ~~chrom1~~ | - | … TACTGAA | true | 30 | ~~chrom1~~ | ~~34232~~ | … ~~TTGAATG~~ | ~~true~~ | ~~17~~ | ~~chrom1~~ | ~~309403~~ | …

Apache Avro

Apache Avro is a data serialization system.
All Big Data Genomics schemas are published at https://github.com/bigdatagenomics/bdg-formats.
Having explicit schemas and self-describing data makes integrating, sharing and evolving formats easier.

Our Avro schemas are directly converted into source code using Avro tools. Avro supports a number of computer languages. ADAM uses Java; you could just as easily use this Avro IDL description as the basis for a Python project. Avro currently supports c, c++, csharp, java, javascript, php, python and ruby.

Downstream Applications

There are a number of projects built on ADAM, e.g.

avocado Variant caller
cannoli Pipe API wrappers for bioinformatics tools
cs-bwamem Scalable alignment using bwamem
eggo Parquet-formatted public 'omics datasets
gnocchi Genotype-phenotype analysis
guacamole Variant caller from Hammer Lab at Mt. Sinai
mango Scalable genome visualization
quinine Read and variant quality control metrics
rice RNA quantification pipeline
snappea Parallel alignment using SNAP

License

ADAM is released under an Apache 2.0 license.

Citing ADAM

ADAM has been described in two manuscripts. The first, a tech report, came out in 2013 and described the rationale behind using schemas for genomics, and presented an early implementation of some of the preprocessing algorithms. To cite this paper, please cite:

hreport{massie13,
tle={{ADAM}: Genomics Formats and Processing Patterns for Cloud Scale Computing},
thor={Massie, Matt and Nothaft, Frank and Hartl, Christopher and Kozanitis, Christos and Schumacher, Andr{\'e} and Joseph, Anthony D and Patterson, David A},
ar={2013},
stitution={UCB/EECS-2013-207, EECS Department, University of California, Berkeley}

The second, a conference paper, appeared in the SIGMOD 2015 Industrial Track. This paper described how ADAM's design was influenced by database systems, expanded upon the concept of a stack architecture for scientific analyses, presented more results comparing ADAM to state-of-the-art single node genomics tools, and demonstrated how the architecture generalized beyond genomics. To cite this paper, please cite:

roceedings{nothaft15,
tle={Rethinking Data-Intensive Science Using Scalable Analytics Systems},
thor={Nothaft, Frank A and Massie, Matt and Danford, Timothy and Zhang, Zhao and Laserson, Uri and Yeksigian, Carl and Kottalam, Jey and Ahuja, Arun and Hammerbacher, Jeff and Linderman, Michael and Franklin, Michael and Joseph, Anthony D. and Patterson, David A.},
oktitle={Proceedings of the 2015 International Conference on Management of Data (SIGMOD '15)},
ar={2015},
ganization={ACM}

We prefer that you cite both papers, but if you can only cite one paper, we prefer that you cite the SIGMOD 2015 manuscript.

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.