Paradigm4/variant_warehouse

Name: variant_warehouse

Owner: Paradigm4 Labs

Description: Examples for analyzing Genomic Variant data in SciDB

Created: 2014-11-04 19:48:19.0

Updated: 2017-08-04 12:33:18.0

Pushed: 2017-03-27 16:44:21.0

Homepage:

Size: 13549

Language: HTML

GitHub Committers

UserMost Recent Commit# Commits
rvernica2017-03-24 05:16:08.01
Timothy Danford2014-12-05 06:34:08.01
Chris Beaumont2014-11-23 19:25:35.01
Alex Poliakov2018-03-12 22:12:52.0127
Paradigm4Labs2015-05-20 03:34:34.01
Jonathan Rivers2015-05-11 21:27:28.031
mingshengzhangp42016-01-18 21:36:02.056
Kriti Sen Sharma2018-02-22 17:17:24.07

Other Committers

UserEmailMost Recent Commit# Commits
apoliakovapoliakov@kali.local2015-03-24 18:21:50.02
mingsheng zhangmingshengzhangp4@paradigm4.com2015-10-07 15:38:53.01
scidbscidb@ip-10-95-163-155.ec2.internal2015-09-10 00:23:14.02
SciDB userscidb@salty1.local.paradigm4.com2015-10-06 00:07:49.02

README

Genomic Variant Data Warehouse

This repository has been constructed to organize the functions to load and process variant datasets and provide other functionality to facilitate the exploration of the publicly available variant datasets in general. A few of the scripts may still be prototype. These can be adapted quickly for a variety of purposes and your particular use case.

In the base directory(variant_warehouse) are examples of loading and processing Genomic Variant Datasets in SciDB, currently built around the 1000 Genomes dataset. (http://www.1000genomes.org)

Part of the original prototype was adapted from scidb-genotypes by Douglas Slotta (NCBI) (http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/) See: https://github.com/slottad/scidb-genotypes

These scripts were created for SciDB 14.12 or newer. The larger the cluster - the faster these will run as they are designed for scalability. The load_tools plugin is required for a vast majority of the examples. See: www.github.com/paradigm4/load_tools

Data Loaders for Various Common Datasets

Below are examples of demonstration code for variant processing use cases.

Use Case Demonstration
rmarkdown/vcf_toolkit.R

A set of example queries using 1000 Genomes and ESP data using R. Includes sample lookups, allele counts, PCA plot, range joins.

rmarkdown/vcf_toolkit.Rmd

A set of example queries using 1000 Genomes and ESP data using R-Markdown.

jupyterNotebook/vcf_toolkit.ipynb

A set of example queries using 1000 Genomes using jupyter notebook.

example_afl_scripts

Some sample queries in AFL, including grouped allele count and a join of ESP and 1000 Genomes.

shiny_browser

A variant browser app that computes allele counts grouped by major population and makes an interactive plot.

shiny_tcga_dbnsfp

An app that can filter and plot TCGA alteration frequencies filtered against dbNSFP scores, as well as clinical keywords. You need to have TCGA data loaded in order to run it - you can use the AMI, for example.

AMI

Some examples are shown in the Bioinformatics AMI. Last updated June 2015. Instructions for that are here: http://www.paradigm4.com/try_scidb/

Spark Benchmark

The Benchmark comprises common genomic processing queries to highlight the differences between SciDB and Spark-Adam. The code for the spark benchmark is located in variant_warehouse/spark_benchmark.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.