Name: joint_variant_calling
Owner: Science For Life Laboratory
Description: null
Forked from: vezzi/joint_variant_calling
Created: 2016-09-22 09:37:16.0
Updated: 2016-09-22 09:37:18.0
Pushed: 2017-03-15 14:49:17.0
Homepage: null
Size: 339
Language: Shell
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Python package to run Join Calling on population at NGI (National Genomics Infrastructure) Sweden.
The script joint_variant_calling.py
implements the GATK-Workflow described in
https://www.broadinstitute.org/gatk/guide/article?id=3893
With option --mixed-positions
VQSR step is executed following best-practice described in
http://gatkforums.broadinstitute.org/gatk/discussion/2805/howto-recalibrate-variant-quality-scores-run-vqsr
in order to avoid the problem with MIXED positions (i.e., position where an indel overlaps a SNP). This is the mode used to run SweGen dataset.
Samples to be be join called can be specified in two ways:
If run like
reates the following folder structure
00_intervals`: optional see Intervals section
00_samples.txt`: samples that are processed (i.e., samples that are join called)
01_CombineGVCFs`: step one is CombineGVCFs, batching gvcf files together
02_GenotypeGVCFs`: then run GenotypeGVCFs
03_CatVariants`: merge the gvcfs into one in case computation has been spread out into intervals (see Intervals section)
04_SelectVariants`: extract SNPs and INDELs and run eveluation tool from GATK to prepare VQSR
05_VariantRecalibrator`: first step of VQSR
06_ApplyRecalibration`: second step of VQSR
un like
it creates the following folder structure
00_intervals
: optional see Intervals section00_samples.txt
: samples that are processed (i.e., samples that are join called)01_CombineGVCFs
: step one is CombineGVCFs, batching gvcf files together02_GenotypeGVCFs
: then run GenotypeGVCFs03_CatVariants
: merge the gvcfs into one in case computation has been spread out into intervals (see Intervals section)04_VQSR
: VQSR step executed as explained in http://gatkforums.broadinstitute.org/gatk/discussion/2805/howto-recalibrate-variant-quality-scores-run-vqsrFolders 01, 02, …, 06 are all organised in the same way:
sbatch
: sbatch files to be submitted to the slurm queue (Uppmax assumed)std_err
: output of the standard errorstd_out
: output of the standard outputVCF
: contains the gvcf or vcf files (in general results files, in case of 05_VariantRecalibrator
contains recalibration tables)Folder 01_CombineGVCFs
contains an extra sub-folder:
batches
: used to restart joint calling if new samples are available [UNSTABLE: UNDER TEST]If run like
esumes the joint calling adding new samples and recomputing only the last (if needed) and the new batches (in `01_CombineGVCFs`)
ntervals
most time consuming steps of the workflow are `01_CombineGVCFs` and `02_GenotypeGVCFs`. These two steps can be parallised running the commands on non overlapping sections of the genome.
this purpose an utility script is provided:
the file intervals/human_g1k_v37.dict
can be found in this repo.
If run in a directory (e.g. 00_intervals) it creates the intrvals files (.intervals). This folder need to be provided in the config.yaml file. Running the example command will create 4 blocks,
first 3 of 1Gbp and the last one of ~200Mbp.
The example
directory in this repo contains a dry-run of joint_variant_calling.py run on intervals generated by this command, on 7 samples, in batches of 4 (i.e.,this creates two batches, one of 4 and one of 3 samples)