verilylifesciences/genomewarp

Name: genomewarp

Owner: Verily Life Sciences

Description: GenomeWarp translates genetic variants from one genome assembly version to another.

Created: 2016-08-23 16:18:49.0

Updated: 2017-12-30 00:31:59.0

Pushed: 2018-01-08 17:33:40.0

Homepage:

Size: 167

Language: Java

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Disclaimer

This is not an official Verily product.

GenomeWarp

GenomeWarp is a command-line tool that translates genetic variants in confidently-called genomic regions from one genome assembly version to another, such as from GRCh37 to GRCh38.

Purpose

The goal of GenomeWarp is to translate the variation within a set of regions deemed “confidently-called” in one genome assembly to another genome assembly. In cases where a VCF file represents “all variation in an individual with respect to the genome assembly against which the VCF was generated”, GenomeWarp can be used to transform that data to the analogous set of all variation in the individual with respect to a new genome assembly.

This is semantically different from existing tools that support the translation of VCF files from one assembly to another, including:

These tools operate only on the sites present in an input VCF, and return the representation of those sites in a new genome assembly. This does not capture all variation, however. Consider an individual who has sequence reads that indicate they match the GRCh37 reference genome assembly at position GRCh37.chr1:169,519,049 (i.e. the individual's genotype is T/T). Because the individual is homozygous reference at that site, there will be no variation present in their VCF file created on GRCh37. However, the analogous position on the updated GRCh38 reference genome assembly, position GRCh38.chr1:169,549,811, has the reference base C. Consequently, if the individual's read data were analyzed on GRCh38, they would be identified as homozygous for a C->T SNP. Because this site is not present in the input GRCh37 VCF, it is never added when creating a GRCh38 VCF by these other tools.

Nomenclature and format definitions

In the below descriptions, the genome assembly on which the confidently-called regions and variants are given is denoted the “query” assembly. The genome assembly onto which the user wishes to warp the variants and regions is denoted the “target” assembly.

File formats:

Inputs

To warp variants from a query assembly to a target assembly, five inputs are required:

The BED file of confidently-called regions can be created by emitting an all-sites VCF file (when calling variants with the GATK) and filtering the homozygous-reference calls at a desired quality threshold.

Common file downloads

FASTA files can be downloaded from NCBI or the UCSC Genome Browser, among other places. See the NCBI How To for details.

Common chain file downloads are available from the UCSC Genome Browser at the URLs

://hgdownload.cse.ucsc.edu/goldenpath/${QUERY}/liftOver/${QUERY}To${TARGET^}.over.chain.gz

for the appropriate definitions of QUERY and TARGET. For example, the chain to transform from hg19 to hg38 is http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz.

Building this project
  1. git clone this repository.
  2. Use a recent version of Apache Maven (e.g., version 3.3.3) to build this code:
package
Running the GenomeWarp tool

Once all five input files are available, performing the transformation involves running a single Java program. The driver script is GenomeWarpSerial.java, and it generates two output files:

The program is executed as follows:

 -jar target/verilylifesciences-genomewarp-1.0.0-runnable.jar \
lift_over_chain_path "${chain}" \
raw_query_vcf "${queryvcf}" \
raw_query_bed "${querybed}" \
ref_query_fasta "${queryfasta}" \
ref_target_fasta "${targetfasta}" \
work_dir "${workdir}" \
output_variants_file "${targetvcf}" \
output_regions_file "${targetbed}"

When run, logging statements provide progress indications. GenomeWarp should convert a single-sample VCF containing millions of variants genome-wide in under 30 minutes.

Notes

There are multiple reasons why a confidently-called region in the query assembly (and any variants therein) may not appear in the target assembly. GenomeWarp is deliberately conservative in tricky cases, preferring to omit a confidently-called region and its constituent variants if there is not an unambiguous mapping. The guarantee GenomeWarp provides is that all confidently-called regions in the target assembly faithfully reproduce the same haplotypes as were provided in the query assembly (i.e., GenomeWarp gives 100% specificity at a possible sacrifice to sensitivity).

GenomeWarp currently handles variant-only VCF files (i.e. gVCFs are not supported). A gVCF can be processed using the workaround described here.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.