LANL-Bioinformatics/DETEQT

Name: DETEQT

Owner: LANL-Bioinformatics

Owner: LANL-Bioinformatics

Description: Diagnostic targeted sequencing adjudication

Created: 2018-02-22 16:55:47.0

Updated: 2018-04-02 16:50:13.0

Pushed: 2018-04-02 16:50:11.0

Homepage: https://chienchilo.bitbucket.io/targetedNGS/

Size: 19058

Language: Perl

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

DETECT – Diagnostic targETEd sequenCing adjudicaTion

Pipeline for assay design evaluation.

This tool been designed to be robust enough to handle a range of assay designs. Therefore, no major assumptions of input reads are made except that they represent amplicons from a multiplexed targeted amplification reaction and that the reference is comprised of only target regions in the assay. The idea is to survey the reads and delineate whether each reference sequence, or target, is present or absent. This means that we're only dealing with reads that will map to the reference and ignoring reads that do not under the assumption that if the target is amplified it will be clearly present. False positives are the primary issue with these assays due to sample bleed (low diversity, barcode, or flow cell oversaturation induced) and low-abundance run-to-run contamination.

The concept is to provide an estimate of overall quality of reads mapping to the reference by reducing 4 mapping metrics into a single calculated value, compare that to the depth of mapping, and apply thresholds to separate positive, negative, and indeterminate results. The metrics used are (per sample, per reference) mean base quality, mean mapping quality, linear coverage, and identity. Linear coverage and identity are represented by the range 0 to 1, so to scale the mean qualities we divide the mean base quality by 37 as an expected value for Illumina systems and divide the mean mapping quality by 60 as expected for BWA. This brings all measured to the range 0-1, except for base quality, which is automatically reduced to 1 should the value be higher. Thus, the formula is:

Quality Calculation = linear coverage * identity * (uBaseQ / 37) * (uMapQ / 60)

This calculation is designed to be simple enough to quickly compute in a script and stringent enough to handle most indicators of a false positive, to include reduced base quality from sample bleed, reduced mapping quality from mismapped reads, reduced identity from divergent strain, and any combination that could indicate anything other than a true positive (see attached mapping stats graphs). However, this calculation alone does not constitute as an analytical model because it does not consider abundance. Like with GOTTCHA, reling on read depth as an indicator of abundance, not number or percent of reads. Considering a depth of at least 1E3 and a quality calculation of at least 0.95^4 (0.8145) as a positive (see example results graph). These thresholds are not set in stone. We're looking into some less subjective ways to set them, but for now they seem to be very robust for all of our assays, especially when low diversity amplicon sequencing is expected.

Installing DETECT

Download the latest version of DETECT from github or use git clone from command line.

clone https://github.com/LANL-Bioinformatics/DETECT.git

cd into the DETECT directory

ETECT
STALL.sh 

The sudo privileges are not needed for installation. The installation script will pull required dependencies with internet. A log of all installation can be found in install.log

Dependencies

DETECT run requires following dependencies which should be in your path. All of the dependencies will be installed by INSTALL.sh.

Programming/Scripting languages
Unix
Third party softwares/packages
R packages

(Optional)

Perl modules
Python package
Running DETECT
e: 
DETECT -ref ref.fa --indir input_dir --samples sample_description_file
ts:(requried)
--ref            Reference FASTA file
--indir          Directory with input Fastq files
--samples        Sample descriptions tab-delimited files or xlsx format. (header:#SampleID Files)
ral:
--help           This help
--version        Print version and exit
--quite          No screen output (default OFF) 
uts:
--outdir         Output directory
--prefix         Filename output prefix [auto] (default '')
--force          Force overwriting existing output folder (default OFF)
nment
--align_options  BWA mem options in quotation (ex: "-x ont2d" for Oxford Nanopore 2D-reads)
utation:
--cpus           Number of CPUs to use [0=all] (default '4')
--mode           Paired-End (PE) or Single-End (SE) (default PE)
--q_cutoff       Quality Calculation cutoff (default 0.8145)
--depth_cutoff   Depth of coverage cutoff (default 1000)

Expected value     Expected value for respective quality metric 
--expectedCoverage  (1)
--expectedIdentity  (1)
--expectedBaseQ     (37)
--expectedMapQ      (60)

Weigth options     Weight for respective metric (sum=1) [double]
--coverageWeight    (0.25)
--identityWeight    (0.25)
--baseqWeight       (0.25)
--mapqWeight        (0.25)

sample_description_file: is a tab-delimited file with header #SampleID Files. In the Files column, the paired-end fastq files are separated by a comma and all the fastq files should be located in the input directory (–indir) ex:

pleID      Files
ue         sample.1.fq,sample.2.fq
           flu.1.fq,flu.2.fq 
a          ebola.1.fq,ebola.2.fq
           mers.1.fq,mers.2.fq
           sars.1.fq,sars.2.fq
           zika.1.fq,zika.2.fq
           rota.1.fq,rota.2.fq
           hiv.1.fq,hiv.2.fq
a          hanta.1.fq,hanta.2.fq
           hcv.1.fq,hcv.2.fq
Test
est
nTest.sh
Outputs (–outdir)
apping
eports
--prefix_quality_report.html
--prefix_quality_report.png
--prefix_sample_plot.html
--prefix_sample_plot.png
--prefix_target_plot.html
--prefix_target_plot.png
tats
--prefix.mapping_stats.txt
--prefix.report.txt
--prefix.run_stats.txt
efix.log

mapping: a directory contains all samples mapping to reference bam file, bam index, and log files

reports: a directory contains report html, png files and log file.

stats: a directory contains the mapping statstics tab-delimited tables. (see below table description)

prefix.log: report of all the commands/scripts/errors that were ran as part of the pipeline.

Quality report plot

interactive quality report Quality plot

Sample plot

interactive sample plot Sample plot

Target plot

interactive sample plot Sample plot

mapping_stats tab-delimited table:

Column | Description ——————— | ——————————————————————————– SampleID | Sample Name Target | Target Reference ID Length | Target Reference sequence Length Quality_Calculation | Coverage * Identity * (BaseQ_mean / 37) * (MapQ_mean / 60) Depth_Mean | Target Reference average coverage depth Depth_RMS | Target Reference coverage depth Depth_StdDev | Target Reference coverage depth Depth_SNR | Depth_Mean / Depth_StdDev Coverage | Target Reference linear coverage Match_Bases | Matched bases count Mismatch_Bases | Mismatched bases count Total_Bases | Match_Bases + Mismatch_Bases Identity | Match_Bases / (Match_Bases + Mismatch_Bases) BaseQ_mean | Mapped reads all bases average quality BaseQ_RMS | Mapped reads all bases quality root mean square BaseQ_StdDev | Mapped reads all bases quality standard deviation BaseQ_SNR | BaseQ_mean / BaseQ_StdDev Match_BaseQ_mean | Matched bases average quality Match_BaseQ_RMS | Matched bases quality root mean square Match_BaseQ_StdDev | Matched bases quality standard deviation Match_BaseQ_SNR | Match_BaseQ_mean / Match_BaseQ_StdDev Mismatch_BaseQ_mean | Mismatched bases average quality Mismatch_BaseQ_RMS | Mismatched bases quality root mean square Mismatch_BaseQ_StdDev | Mismatched bases quality standard deviation Mismatch_BaseQ_SNR | Mismatch_BaseQ_mean / Mismatch_BaseQ_StdDev MapQ_mean | Mapping quality average MapQ_StdDev | Mapping quality standard deviation MapQ_RMS | Mapping quality root mean square MapQ_SNR | MapQ_StdDev / MapQ_StdDev Mapped_Reads | Target Reference mapped reads count Fraction_Reads | Target Reference mapped reads count / Total mapped reads Determination | Based on Quality_Calculation and Depth_Mean (see make determination calls below)

report.txt tab-delimited table:

Column | Description ——————— | ——————————————————————————– SampleID | Sample Name Target | Target Reference ID Determination | Based on Quality_Calculation and Depth_Mean (see make determination calls below) Depth_Mean | Target Reference average coverage depth Quality_Calculation | Coverage * Identity * (BaseQ_mean / 37) * (MapQ_mean / 60)

run_stats tab-delimited table:

Column | Description ———————- | ——————————————————————————– SampleID | Sample Name Prefilter_Reads | Raw reads number Unmapped_Reads | Unmapped reads number Percent_Unmapped_Reads | Unmapped_Reads / Prefilter_Reads Mapped_Singlets | One of the paired reads mapped number Percent_Mapped_Singlets| Mapped_Singlets / Prefilter_Reads Postfilter_Reads | Proper paired reads number Discarded_Reads | Prefilter_Reads - Postfilter_Reads Discarded_Percent | Discarded_Reads / Prefilter_Reads Percent_Run | Prefilter_Reads / sum(Prefilter_Reads)

DETECT R Shiny app

The app is for interactively visualizing mapping_stats output file.

e:
ipt ShinyApp/app.R outdir/stats/prefix.mapping_stats.txt Quality_Calculation_cutoff  Depth_of_coverage_cutoff port 

    Default Value
    Quality_Calculation_cutoff: 0.95^4
    Depth_of_coverage_cutoff: 1000
    port: 3838  (R Shiny Server port)

Rscript ShinyApp/app.R ShinyApp/DETECT_02222017.mapping_stats.txt

live demo: The DETECT output visualization R Shiny app on shinyapps.io.

To host by the Apache, the folowling set up need to be configured in the apache config file

yPass /shiny/websocket  ws://localhost:3838/websocket
yPassReverse /shiny/websocket  ws://localhost:3838/websocket
yPass /shiny/ http://localhost:3838/
yPassReverse /shiny/ http://localhost:3838/
Removing DETECT

For removal, delete (rm -rf) DETECT folder, which will remove any packages that were downloaded in that folder.

PseudoCode

Shell pipe
Switch to scripting language
Contact Info
Citations

If you use DETECT please cite following papers:


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.