Name: prohlatype
Owner: Hammer Lab
Description: Probabilistic HLA typing
Created: 2016-04-29 06:22:05.0
Updated: 2018-01-13 00:11:16.0
Pushed: 2018-01-15 06:35:40.0
Size: 22246
Language: OCaml
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Paper: Prohlatype: A Probabilistic Framework for HLA Typing
This project provides a set of tools to calculate the full posterior distribution of HLA types given read data.
Instead of:
A1 A2 B1 B2 C1 C2 Reads Objective
A*31:01 A*02:01 B*45:01 B*15:03 C*16:01 C*02:10 538.0 513.79
one can calculate:
| Allele 1 | Allele 2 | Log P | P | |—————|——————-|————-|——-| | A02:05:01:01| A30:114| -23046.81 | 0.5000| | A02:05:01:01| A30:01:01| -23046.81 | 0.5000| | A02:05:01:01| A30:106| -23103.15 | 0.0000| | A02:05:01:02| A30:114| -23146.35 | 0.0000| | … | | | | B07:36| B57:03:01:02| -13717.33 | 0.5000| | B07:36| B57:03:01:01| -13717.33 | 0.5000| | B07:36| B57:03:03| -13804.74 | 0.0000| | B27:157| B57:03:01:02| -13816.17 | 0.0000| | … | | | | C06:103| C18:10| -11936.35 | 0.3338| | C06:103| C18:02| -11936.36 | 0.3331| | C06:103| C18:01| -11936.36 | 0.3331| | C15:102| C18:02| -11951.72 | 0.0000|
If you are running on Linux, standalone binaries are available with each release.
Use the linked Docker image.
Build the software from source:
a. Install opam.
b. Make sure that the opam packages are up to date:
$ opam update
c. Make sure that you're on the relevant compiler:
$ opam switch 4.05.0
$ eval `opam config env`
d. Get source:
$ git clone https://github.com/hammerlab/prohlatype.git prohlatype
$ cd prohlatype
e. Install the dependent packages:
$ make setup
f. Build the programs (afterwards they'll be in _build/default/src/apps
):
$ make
$ git clone https://github.com/ANHIG/IMGTHLA.git imgthla
Create an imputed HLA reference sequence via align2fasta
.
This step makes sure that all alleles have sequence information that spans
the entire locus. This way, reads that originate from a region for which
we normally do not have sequence information will still align (in the
next filtering step), albeit poorly:
$ align2fasta path-to-imgthla/alignments -o imputed_hla_class_I.fasta
This step needs to be performed only once, per each IMGT version.
Run $align2fasta --help
for further information.
Filter your data against the reference, by first aligning. Ex:
$ bwa mem imputed_hla_class_I.fasta ${SAMPLE}.fastq | \
samtools view -F 4 -bT imputed_hla_class_I.fasta -o ${SAMPLE}.bam
While fundamentally, the algorithms here are alignment based. They're too slow to run for all sequences. Sequences that do not originate from the HLA-region would just act as background noice.
and then convert aligned reads back to FASTQ:
$ samtools fastq ${SAMPLE}.bam > ${SAMPLE}_filtered.fastq
Infer types (see $ multi_par --help
for further details):
$ multi_par path-to-imgthla/aignments ${SAMPLE}_filtered.fastq -o ${SAMPLE}_output.tsv