aigenomics/gwarehouse

Name: gwarehouse

Owner: Open Computing Platform for Bioinformatics

Description: Genome Warehouse

Created: 2017-07-13 15:03:31.0

Updated: 2017-07-13 15:03:31.0

Pushed: 2017-07-17 00:35:02.0

Homepage: null

Size: 2

Language: null

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Genome Warehouse

A Key Technology for “Precision Medicine”

Background

A concept of Genome Warehouse was (first?) discussed in 2012 [1]. While the basic requirement for Genome Warehouse hasn't changed much, we see various technology changes that have happened since then.

There are the three biggest changes in this area:

Given the above changes, it?s a good time to think again how we should build Genome Warehouse.

High Level Requirements

At high level, Genome Warehouse requires (a) pipelines for processing genome sequence data and (b) database for storing genome sequence data and data produced by subsequence analysis (e.g., variant discovery).

A typical pipeline for DNA sequencing is as follows:

  1. Sequencing: A sequencer (e.g., Illumina) generates raw sequence data from DNA samples (format; FASTQ [1]).
  2. Alignment: An aligner (e.g., BWA-MEM [2, 3, 4]) align reads by finding positions that reads are most likely to have come from (format: SAM, BAM [5], CRAM [6, 7]).
  3. Variant Discovery: A caller finds nucleotides that are different from reference genome at given positions in an individual genome or transcriptome (format: VCF [8]).

In addition to the above major steps, pre-processing and filtering steps such as duplicate elimination would be needed.

Some of the stages can take longer than 10 hours with a single thread, and they achieve higher throughput with parallelization.

The goal of Genome Warehouse to process a large number of DNA samples (say more than one million) and store generated data in a queryable format. For a human genome sample, a sequencer can produce one billion short reads of 200-1000 bases each, totaling 0.5-1 TB of data. More than one DNA sample can be collected from one person at different timestamps.

References

Background:

High Level Requirements:


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.