hms-dbmi/dropEst

Name: dropEst

Owner: Harvard Medical School - Department of Biomedical Informatics

Description: Pipeline for initial analysis of droplet-based single-cell RNA-seq data

Created: 2017-07-17 20:28:48.0

Updated: 2017-12-08 02:14:28.0

Pushed: 2018-01-15 11:13:22.0

Homepage: null

Size: 40650

Language: C++

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

dropEst - Pipeline

Pipeline for estimating molecular count matrices for droplet-based single-cell RNA-seq measurements. Implements methods, described in this paper.

Table of contents
General processing steps
  1. dropTag: extraction of cell barcodes and UMIs from the library. Result: demultiplexed .fastq.gz files, which should be aligned to the reference.
  2. Alignment of the demultiplexed files to reference genome. Result: .bam files with the alignment.
  3. dropEst: building count matrix and estimation of some statistics, necessary for quality control. Result: .rds file with the count matrix and statistics. Optionally: count matrix in MatrixMarket format.
  4. dropReport - Generating report on library quality.
Setup
System requirements
Installation

Install R packages:

all.packages(c("Rcpp","RcppEigen", "RInside", "Matrix"))

Clone this repository:

clone https://github.com/hms-dbmi/dropEst.git

Build:

ropEst
e . && make
Troubleshooting

If cmake can't find one of the libraries, or you want to use some specific versions, which are currently not in the default path, use corresponding cmake variables:

These variables should be set to the path to the installed library. It can be done either by using command line options: cmake -D R_ROOT="path_to_r" or by adding the variable declaration to the beginning of CMakeLists.txt: set(R_ROOT path_to_r).

In case you have some issues with the linker for specific library, please try to build this library manually with the version of compiler, which you're going to use for dropEst build.

dropTag
droptag -- generate tagged fastq files for alignment

Example command:

optag -c config.xml [-S] reads1.fastq reads2.fastq [...]

Positional arguments of the dropTag phase contain paths to the read files, obtained with a sequencer. These files can be either in .fastq of .fastq.gz format. The file order depends on the type of used protocol. Below is the instruction on the usage for currently supported protocols.

Protocols
inDrop v1 & v2

Example config file is located at “dropEst/configs/indrop_v1_2.xml“.
Example command:

optag -c ./configs/indrop_v1_2.xml barcode_reads.fastq gene_reads.fastq
inDrop v3

If a file with library tags provided, option “-t” is required.

Example config file is located at “dropEst/configs/indrop_v1_2.xml“.
Example command:

optag -c dropEst/configs/indrop_v3.xml [-S] [-t library_tag] barcode1_reads.fastq barcode2_reads.fastq gene_reads.fastq [library_tags.fastq]
10x

Example config file is located at “dropEst/configs/10x.xml“.
Example command:

optag -c dropEst/configs/10x.xml [-S] lib_tag_reads.fastq barcode_reads.fastq gene_reads.fastq

While dropTag provides way to demultiplex 10x data, Cell Ranger is still recommended tool for this. dropEst phase can be ran on the Cell Ranger demultiplexed .bam file to obtain data in the format, optimized for the subsequent analysis.

NOTE. Sometimes 10x CellRanger isn't able to determine gene, from which a read originated. In this cases it fills gene info with a list of possible genes, separated by semicolon (e.g. 'ENSG00000255508;ENSG00000254772'). These genes must be filtered out prior to further analysis:

er <- readRDS('./cell.counts.rds')
- holder$cm[grep("^[^;]+$", rownames(holder$cm)),]
Command line arguments for dropTag

Please, use ./droptag -h for additional help.

Alignment

dropTag writes the tagged reads into multiple files. All these files must be aligned to reference, and all bam files with the alignments must be provided as input for the dropEst stage. In the paper we used TopHat2 aligner, however any RNA-seq aligners (i.e. Kallisto or STAR) can be used.

Alignment with TopHat
  1. Install bowtie index for the sequenced organism.
  2. Download corresponding gene annotation in the gtf format.
  3. Run TopHat2 for each file:
    at2 -p number_of_threads --no-coverage-search -g 1 -G genes.gtf -o output_dir Bowtie2Index/genome reads.fastq.gz
    
  4. The result needed for the count estimation is ./output_dir/accepted_hits.bam.
Alignment with Kallisto

Prior to v0.42.4, Kallisto didn't use BAM flags to mark primary/secondary alignments. Only versions $\geq$ 0.42.4 are supported.

  1. Download transcript sequences in .fasta format (i.e. Ensembl genes).
  2. Build Kallisto index: kallisto index -i genes.fa.gz.
  3. Run kallisto quant --pseudobam --single -i genes.index -o out -l mean_length -s std_length reads.fastq.gz. Here, mean_length is mean read length and std_length is standard deviataion of read length. You should specify values, according to the experiment design.
dropEst
dropest: estimate molecular counts per cell

This phase requires aligned .bam files as input and uses them to estimate count matrix. These files must contain information about cell barcode and UMI for each read (reads, which don't contain such information are ignored). Two possible ways to encode this information are acceptable:

  1. Encode it in read names (as a result of dropTag phase).
  2. Use .bam tags (i.e. output of 10x Cell Ranger). Tag names can be configured in config.xml file.

Count matrix estimation also requires information about the source gene for the reads. It can be provided in two ways:

  1. Use gene annotation in either .bad or .gtf format. To provide such file, “-g” option should be used.
  2. Use .bam tags. Tag name can be configured in config.xml file.

Another crucial moment in estimation of count matrix is correction of cell barcode errors. Most protocols provide the list of real barcodes, which simplifies the task of correction. If such file is available, path to the file should be specified in the config.xml file (Estimation/Merge/barcodes_file). This can dramatically increase quality of the result. Lists for inDrop protocols can be found at dropEst/data/barcodes/. Two algorithms of barcode correction are available:

  1. Simple, “-m” option. This algorithm is recommended in the case, where barcodes list is supplied.
  2. Precise, “-M” option. Doesn't requires list of real barcodes to obtain high-quality results, however has significantly lower performance.

Example command:

est [options] [-m] -g ./hg38/genes.gtf -c ./config.xml ./alignment.*/accepted_hits.bam
Usage of tagged bam files (e.g. 10x, Drop-seq) as input

Some protocols provide pipelines, which create .bam files with information about CB, UMI and gene. To use these files as input, specify “-f” option. Example:

est [options] -f [-g ./genes.gtf] -c ./config.xml ./pipeline_res_*.bam

If “-g” option is provided, genes are parsed from the gtf, and information about genes from the bam file is ignored.

To specify corresponding .bam tag names, use “Estimation/BamTags” section in the config (see configs/config_desc.xml).

Usage of pseudoaligners

Pseudoaligners, such as Kallisto store gene / transcript names in the field of chromosome name. To parse such files, use “-P” option. Example:

pest [options] -P -c ./config.xml ./kallisto_res_*.bam
Count intronic / exonic reads only

One feature of the pipeline is the ability to count only UMIs, which reads touch only specific parts of a genome. Option “-L” allows to specify all acceptable types of regions:

Thus, to count all UMIs with exonic or not annotated reads, use “-L eE”. Default value: “-L eEBA”.

Example commands:

The pipeline can determine genome regions either using .gtf annotation file or using .bam tags, i.e. for CellRanger output (see Estimation/BamTags/Type in configs/config_desc.xml). If .gtf file isn't provided and .bam file doesn't containt annotation tags, all reads with not empty gene tag are considered as exonic.

Velocyto integration

For some purposes (i.e. velocyto) it can be useful to look separately at the fraction of intronic and exonic UMIs. Option “-V” allows to output three separate count matrices, each of which contains only UMIs of a specific type: intronic, exonic or exon/intron spanning. These matrices are stored in the separate file “cell.counts.matrices.rds”.

Command line arguments for dropEst
Output

Result of this phase is cell.counts.rds file with the next fields:

To additionaly print the file with count matrix in MatrixMarket format use “-w” option.

dropReport

To run the report you have to install:

To run report use:

opReport.Rsc cell.counts.rds

To see full list of options use:

opReport.Rsc -h
Troubleshooting

If you get the error “pandoc version 1.12.3 or higher is required and was not found”, try to set path in the corresponding environment variable: export RSTUDIO_PANDOC=/path/to/pandoc.

dropEstR package

This package implements UMI errors corrections and low-quality cells filtration, which are not performed during dropEst phase.

To install the package, use:

ools::install_github('hms-dbmi/dropEst/dropestr', dependencies = T)

Package content:

Additional notes

Description of the fields for the config file is provided in dropEst/configs/config_desc.xml.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.