FredHutch/paladin

Name: paladin

Owner: Fred Hutchinson Cancer Research Center

Description: Protein Alignment and Detection Interface

Forked from: twestbrookunh/paladin

Created: 2017-10-30 17:59:04.0

Updated: 2017-10-30 17:59:06.0

Pushed: 2017-11-07 17:22:26.0

Homepage: null

Size: 17694

Language: C

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

PALADIN

Protein ALignment And Detection INterface

PALADIN is a protein sequence alignment tool designed for the accurate functional characterization of metagenomes.

PALADIN is based on BWA, and aligns sequences via read-mapping using BWT. PALADIN, however, offers the novel approach of aligning in the protein space. During the index phase, it processes the reference genome's nucleotide sequences and GTF/GFF annotation containing CDS entries, first converting these transcripts into the corresponding protein sequences, then creating the BWT and suffix array from these proteins. The process of translatation is skiped when providing a protein reference file (e.g., UniProt) for mapping. During the alignment phase, it attempts to find ORFs in the read sequences, then converts these to protein sequences, and aligns to the reference protein sequences.

PALADIN currently only supports single-end reads (or reads merged with FLASH, PEAR, abyss-mergepairs), and BWA-MEM based alignment. It makes use of many BWA parameters and is therefore compatible with many of its command line arguments.

PALADIN may output a standard SAM file, or a text file containing a UniProt-generated functional profile. This text file may be used for all downstream characterizations.

INSTALLATION

Dependencies

clone https://github.com/twestbrookunh/paladin.git
aladin/

=$PATH:$(pwd)

Docker

Alternatively, you can use Paladin with the Docker image hosted at https://quay.io/repository/fhcrc-microbiome/paladin. This image can be downloaded with the command docker pull quay.io/fhcrc-microbiome/paladin. A set of tags are used to pin releases, e.g. v1.4.0--1 is the image pinned to the v1.4.0 version of Paladin.

Docker Repository on Quay

SAMPLE COMMANDS

Download and prepare UniProt Swiss-Prot index files.

din prepare -r1 

Download and prepare UniProt UniRef90 index files.

din prepare -r2 

Index UniProt (or another protein) fasta, if not using the automated prepare command

din index -r3 uniprot_sprot.fasta.gz

Align a set of reads using 4 theads. Send the full UniProt report to paladin_uniprot.tsv.

din align -t 4 -o paladin index input.fastq.gz

Align a set of reads using 4 theads. Produce a bam file.

din align -t 4 index input.fastq.gz | samtools view -Sb - > test.bam

Align a set of reads, preferring higher quality mappings over number of proteins detected.

din align -T 20 -o paladin index input.fastq.gz

Align a set of reads, report secondary alignments, and generate UniProt report for both primary and secondary alignments.

din align -a -o paladin index input.fastq.gz

If you're intersted in trying this out on a smallish test file, try downloading this one which is from a human lung metagenome study: http://www.ebi.ac.uk/ena/data/view/PRJNA71831

tall PALADIN as per above

 -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR117/002/SRR1177122/SRR1177122.fastq.gz
din prepare -r1 #unless already done
din align -t 4 -o lungstudy uniprot_sprot.fasta.gz SRR1177122.fastq.gz

k at report file, SAM, etc.

Wrapper script

This repo also contains a wrapper script (run.py) which is intended to make it easier to deploy Paladin on cloud or HPC computing resources (e.g. Slurm or AWS). The script is located in the PATH in the Docker image, and so you can run run.py -h to see the set of options for this script. In brief, users can specify the input URL, reference database path, and output folder location (any of which may be local paths, S3 buckets, or FTP). The run script will fetch the input data, run Paladin, wrap up the results into a single JSON output file, and copy the results to the specified output folder.

OUTPUT
  1. A SAM/BAM file that can be used for any downstream analyses.
  2. A tab delimited UniProt report file.
MAT

t   Abundance Quality (Avg) Quality (Max) UniProtKB ID  Organism    Protein Names   Genes   Pathway Features    Gene Ontology   Reviewd Existence   Comments  Cross Reference (KEGG)  Cross Reference (GeneID)  Cross Reference (PATRIC)  Cross Reference(EnsemblBacteria)

[PALADIN Wiki]

Join the chat at https://gitter.im/twestbrookunh/paladin


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.