CD2H gitForager

Paradigm4/variant_warehouse

Name: variant_warehouse

Owner: Paradigm4 Labs

Description: Examples for analyzing Genomic Variant data in SciDB

Created: 2014-11-04 19:48:19.0

Updated: 2017-08-04 12:33:18.0

Pushed: 2017-03-27 16:44:21.0

Homepage:

Size: 13549

Language: HTML

GitHub Committers

User	Most Recent Commit	# Commits
rvernica	2017-03-24 05:16:08.0	1
Timothy Danford	2014-12-05 06:34:08.0	1
Chris Beaumont	2014-11-23 19:25:35.0	1
Alex Poliakov	2018-03-12 22:12:52.0	127
Paradigm4Labs	2015-05-20 03:34:34.0	1
Jonathan Rivers	2015-05-11 21:27:28.0	31
mingshengzhangp4	2016-01-18 21:36:02.0	56
Kriti Sen Sharma	2018-02-22 17:17:24.0	7

Other Committers

User	Email	Most Recent Commit	# Commits
apoliakov	apoliakov@kali.local	2015-03-24 18:21:50.0	2
mingsheng zhang	mingshengzhangp4@paradigm4.com	2015-10-07 15:38:53.0	1
scidb	scidb@ip-10-95-163-155.ec2.internal	2015-09-10 00:23:14.0	2
SciDB user	scidb@salty1.local.paradigm4.com	2015-10-06 00:07:49.0	2

README

Genomic Variant Data Warehouse

This repository has been constructed to organize the functions to load and process variant datasets and provide other functionality to facilitate the exploration of the publicly available variant datasets in general. A few of the scripts may still be prototype. These can be adapted quickly for a variety of purposes and your particular use case.

In the base directory(variant_warehouse) are examples of loading and processing Genomic Variant Datasets in SciDB, currently built around the 1000 Genomes dataset. (http://www.1000genomes.org)

Part of the original prototype was adapted from scidb-genotypes by Douglas Slotta (NCBI) (http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/) See: https://github.com/slottad/scidb-genotypes

These scripts were created for SciDB 14.12 or newer. The larger the cluster - the faster these will run as they are designed for scalability. The load_tools plugin is required for a vast majority of the examples. See: www.github.com/paradigm4/load_tools

Data Loaders for Various Common Datasets

load_gene_37: simple gene symbols and positions according to GRCh37
load_1000g: the 1000 Genomes Project phase 3: http://www.1000genomes.org/
load_esp: the Exome Sequencing Project: http://evs.gs.washington.edu/EVS/
load_dbnsfpv2.9: the dbNSFP Project (version 2.9): https://sites.google.com/site/jpopgen/dbNSFP
load_dbnsfpv3: the dbNSFP Project (version 3.0)
load_gvcf: for the Broad's GVCF format: https://www.broadinstitute.org/gatk/guide/article?id=4017

Below are examples of demonstration code for variant processing use cases.

Use Case Demonstration

rmarkdown/vcf_toolkit.R

A set of example queries using 1000 Genomes and ESP data using R. Includes sample lookups, allele counts, PCA plot, range joins.

rmarkdown/vcf_toolkit.Rmd

A set of example queries using 1000 Genomes and ESP data using R-Markdown.

jupyterNotebook/vcf_toolkit.ipynb

A set of example queries using 1000 Genomes using jupyter notebook.

example_afl_scripts

Some sample queries in AFL, including grouped allele count and a join of ESP and 1000 Genomes.

shiny_browser

A variant browser app that computes allele counts grouped by major population and makes an interactive plot.

shiny_tcga_dbnsfp

An app that can filter and plot TCGA alteration frequencies filtered against dbNSFP scores, as well as clinical keywords. You need to have TCGA data loaded in order to run it - you can use the AMI, for example.

AMI

Some examples are shown in the Bioinformatics AMI. Last updated June 2015. Instructions for that are here: http://www.paradigm4.com/try_scidb/

Spark Benchmark

The Benchmark comprises common genomic processing queries to highlight the differences between SciDB and Spark-Adam. The code for the spark benchmark is located in variant_warehouse/spark_benchmark.

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.