SciLifeLab/joint_variant_calling

Name: joint_variant_calling

Owner: Science For Life Laboratory

Description: null

Forked from: vezzi/joint_variant_calling

Created: 2016-09-22 09:37:16.0

Updated: 2016-09-22 09:37:18.0

Pushed: 2017-03-15 14:49:17.0

Homepage: null

Size: 339

Language: Shell

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

joint_variant_calling

Python package to run Join Calling on population at NGI (National Genomics Infrastructure) Sweden. The script joint_variant_calling.py implements the GATK-Workflow described in https://www.broadinstitute.org/gatk/guide/article?id=3893

With option --mixed-positions VQSR step is executed following best-practice described in http://gatkforums.broadinstitute.org/gatk/discussion/2805/howto-recalibrate-variant-quality-scores-run-vqsr in order to avoid the problem with MIXED positions (i.e., position where an indel overlaps a SNP). This is the mode used to run SweGen dataset.

Samples to be be join called can be specified in two ways:

If run like

reates the following folder structure 
00_intervals`: optional see Intervals section
00_samples.txt`: samples that are processed (i.e., samples that are join called)
01_CombineGVCFs`: step one is CombineGVCFs, batching gvcf files together
02_GenotypeGVCFs`: then run GenotypeGVCFs
03_CatVariants`: merge the gvcfs into one in case computation has been spread out into intervals (see Intervals section)
04_SelectVariants`: extract SNPs and INDELs and run eveluation tool from GATK to prepare VQSR
05_VariantRecalibrator`: first step of VQSR
06_ApplyRecalibration`: second step of VQSR

un like

it creates the following folder structure

Folders 01, 02, …, 06 are all organised in the same way:

Folder 01_CombineGVCFs contains an extra sub-folder:

If run like

esumes the joint calling adding new samples and recomputing only the last (if needed) and the new batches (in `01_CombineGVCFs`)


ntervals
most time consuming steps of the workflow are `01_CombineGVCFs` and `02_GenotypeGVCFs`. These two steps can be parallised running the commands on non overlapping sections of the genome.
this purpose an utility script is provided:

the file intervals/human_g1k_v37.dict can be found in this repo. If run in a directory (e.g. 00_intervals) it creates the intrvals files (.intervals). This folder need to be provided in the config.yaml file. Running the example command will create 4 blocks, first 3 of 1Gbp and the last one of ~200Mbp.

The example directory in this repo contains a dry-run of joint_variant_calling.py run on intervals generated by this command, on 7 samples, in batches of 4 (i.e.,this creates two batches, one of 4 and one of 3 samples)


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.