Name: dropEst
Owner: Harvard Medical School - Department of Biomedical Informatics
Description: Pipeline for initial analysis of droplet-based single-cell RNA-seq data
Created: 2017-07-17 20:28:48.0
Updated: 2017-12-08 02:14:28.0
Pushed: 2018-01-15 11:13:22.0
Homepage: null
Size: 40650
Language: C++
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Pipeline for estimating molecular count matrices for droplet-based single-cell RNA-seq measurements. Implements methods, described in this paper.
libbamtools-dev
cmake -D BAMTOOLS_ROOT=/home/username/bamtools .
)Install R packages:
all.packages(c("Rcpp","RcppEigen", "RInside", "Matrix"))
Clone this repository:
clone https://github.com/hms-dbmi/dropEst.git
Build:
ropEst
e . && make
If cmake
can't find one of the libraries, or you want to use some specific versions, which are currently not in the default path, use corresponding cmake variables:
cat(R.home())
in R.These variables should be set to the path to the installed library. It can be done either by using command line options: cmake -D R_ROOT="path_to_r"
or by adding the variable declaration to the beginning of CMakeLists.txt: set(R_ROOT path_to_r)
.
In case you have some issues with the linker for specific library, please try to build this library manually with the version of compiler, which you're going to use for dropEst build.
droptag -- generate tagged fastq files for alignment
Example command:
optag -c config.xml [-S] reads1.fastq reads2.fastq [...]
Positional arguments of the dropTag phase contain paths to the read files, obtained with a sequencer. These files can be either in .fastq of .fastq.gz format. The file order depends on the type of used protocol. Below is the instruction on the usage for currently supported protocols.
Example config file is located at “dropEst/configs/indrop_v1_2.xml“.
Example command:
optag -c ./configs/indrop_v1_2.xml barcode_reads.fastq gene_reads.fastq
If a file with library tags provided, option “-t” is required.
Example config file is located at “dropEst/configs/indrop_v1_2.xml“.
Example command:
optag -c dropEst/configs/indrop_v3.xml [-S] [-t library_tag] barcode1_reads.fastq barcode2_reads.fastq gene_reads.fastq [library_tags.fastq]
Example config file is located at “dropEst/configs/10x.xml“.
Example command:
optag -c dropEst/configs/10x.xml [-S] lib_tag_reads.fastq barcode_reads.fastq gene_reads.fastq
While dropTag provides way to demultiplex 10x data, Cell Ranger is still recommended tool for this. dropEst phase can be ran on the Cell Ranger demultiplexed .bam file to obtain data in the format, optimized for the subsequent analysis.
NOTE. Sometimes 10x CellRanger isn't able to determine gene, from which a read originated. In this cases it fills gene info with a list of possible genes, separated by semicolon (e.g. 'ENSG00000255508;ENSG00000254772'). These genes must be filtered out prior to further analysis:
er <- readRDS('./cell.counts.rds')
- holder$cm[grep("^[^;]+$", rownames(holder$cm)),]
Please, use ./droptag -h
for additional help.
dropTag writes the tagged reads into multiple files. All these files must be aligned to reference, and all bam files with the alignments must be provided as input for the dropEst stage. In the paper we used TopHat2 aligner, however any RNA-seq aligners (i.e. Kallisto or STAR) can be used.
at2 -p number_of_threads --no-coverage-search -g 1 -G genes.gtf -o output_dir Bowtie2Index/genome reads.fastq.gz
Prior to v0.42.4, Kallisto didn't use BAM flags to mark primary/secondary alignments. Only versions $\geq$ 0.42.4 are supported.
kallisto index -i genes.fa.gz
.kallisto quant --pseudobam --single -i genes.index -o out -l mean_length -s std_length reads.fastq.gz
. Here, mean_length is mean read length and std_length is standard deviataion of read length. You should specify values, according to the experiment design.dropest: estimate molecular counts per cell
This phase requires aligned .bam files as input and uses them to estimate count matrix. These files must contain information about cell barcode and UMI for each read (reads, which don't contain such information are ignored). Two possible ways to encode this information are acceptable:
Count matrix estimation also requires information about the source gene for the reads. It can be provided in two ways:
Another crucial moment in estimation of count matrix is correction of cell barcode errors. Most protocols provide the list of real barcodes, which simplifies the task of correction. If such file is available, path to the file should be specified in the config.xml file (Estimation/Merge/barcodes_file). This can dramatically increase quality of the result. Lists for inDrop protocols can be found at dropEst/data/barcodes/. Two algorithms of barcode correction are available:
Example command:
est [options] [-m] -g ./hg38/genes.gtf -c ./config.xml ./alignment.*/accepted_hits.bam
Some protocols provide pipelines, which create .bam files with information about CB, UMI and gene. To use these files as input, specify “-f” option. Example:
est [options] -f [-g ./genes.gtf] -c ./config.xml ./pipeline_res_*.bam
If “-g” option is provided, genes are parsed from the gtf, and information about genes from the bam file is ignored.
To specify corresponding .bam tag names, use “Estimation/BamTags” section in the config (see configs/config_desc.xml).
Pseudoaligners, such as Kallisto store gene / transcript names in the field of chromosome name. To parse such files, use “-P” option. Example:
pest [options] -P -c ./config.xml ./kallisto_res_*.bam
One feature of the pipeline is the ability to count only UMIs, which reads touch only specific parts of a genome. Option “-L” allows to specify all acceptable types of regions:
Thus, to count all UMIs with exonic or not annotated reads, use “-L eE”. Default value: “-L eEBA”.
Example commands:
est [-f] [-g ./genes.gtf] -L i -c ./config.xml ./alignment_*.bam
est [-f] [-g ./genes.gtf] -L i -c ./config.xml ./alignment_*.bam
est [-f] [-g ./genes.gtf] -L BA -c ./config.xml ./alignment_*.bam
The pipeline can determine genome regions either using .gtf annotation file or using .bam tags, i.e. for CellRanger output (see Estimation/BamTags/Type in configs/config_desc.xml). If .gtf file isn't provided and .bam file doesn't containt annotation tags, all reads with not empty gene tag are considered as exonic.
For some purposes (i.e. velocyto) it can be useful to look separately at the fraction of intronic and exonic UMIs. Option “-V” allows to output three separate count matrices, each of which contains only UMIs of a specific type: intronic, exonic or exon/intron spanning. These matrices are stored in the separate file “cell.counts.matrices.rds”.
Result of this phase is cell.counts.rds file with the next fields:
To additionaly print the file with count matrix in MatrixMarket format use “-w” option.
To run the report you have to install:
dependencies = T
).all.packages(c("optparse","rmarkdown"))
To run report use:
opReport.Rsc cell.counts.rds
To see full list of options use:
opReport.Rsc -h
If you get the error “pandoc version 1.12.3 or higher is required and was not found”, try to set path in the corresponding environment variable: export RSTUDIO_PANDOC=/path/to/pandoc
.
This package implements UMI errors corrections and low-quality cells filtration, which are not performed during dropEst phase.
To install the package, use:
ools::install_github('hms-dbmi/dropEst/dropestr', dependencies = T)
Package content:
Description of the fields for the config file is provided in dropEst/configs/config_desc.xml.