CD2H gitForager

biocore/microprot

Name: microprot

Owner: biocore

Description: structural annotation pipeline for microbial genomes and metagenomes

Created: 2016-04-26 17:31:52.0

Updated: 2017-10-25 04:45:30.0

Pushed: 2017-12-18 00:19:13.0

Homepage:

Size: 2314

Language: Python

GitHub Committers

User	Most Recent Commit	# Commits

Other Committers

User	Email	Most Recent Commit	# Commits

README

microprot

microprot is coded in Python 3.x

Introduction

microprot clusters and annotates microbial metagenome sequences for the ultimate goal of predicting the 3-dimensional structure and function of these proteins.

Install

Requirements

Some of the tools and databases we're using were developed externally and cannot be automatically installed. We ask you to download them on your own, install and update appropriate paths in paths.yml

dbs

PDB100 from PISCES PDB culling server (updated weekly)
UniRef90 from EBI (the database is 10GB+; updated monthly)
Uniclust30 (14GB+; updated every 3 months)
PfamA for HH-suite (updated approx. every 6 months)

tools

Tools requiring manual installation are listed and linked below:

Naming conventions

Filenames

All filenames are in the form: GenomeID_GeneID_ResiduesFrom-ResiduesTo and contain amino acid sequences.
For example, CP003179.1_3319 means gene 3319 from genome CP003179.1 (Sulfobacillus acidophilus DSM 10332), or CP003179.1_3319_1-60 means amino acids 1 to 60 from that gene.

File extensions

a3m
An alignment file produced by HH-suite programs. It's a format similar to FASTA, but in sequence rows it contains additional information useful for the construction of HMMs (represented by [a-z]). A detailed description can be found in HH-suite user guide (section 6.1).
out
HH-suite output files reporting a list of hits for an input sequence, along with Probability, P-value, E-value and other parameters (hit list); as well as a set of pair-wise sequence alignments. A detailed description can be found in HH-suite user guide (section 5).
match
Internal microprot files showing which sub-sequence of the input sequence matched defined by config.yml criteria for any of E-value, P-value, Prob or minimum sequence length in the .out file. Multiple hits are possible. The file is reported in a FASTA format.
non_match
All sub-sequences longer than the minimum sequence length that do not meet the criteria for .match. Internal microprot file.

Example

Gene CP00000.0_1 (CP00000.0_1.fasta) with 100 residues is run against HHsearch and it returns 2 outputs: CP00000.0_1.out and CP00000.0_1.a3m. Sequence split parameters are:

prob: 90.0
fragment_length: 10

and the hit list portion of CP00000.0_1.out is:

ines of input parameters summary]

it                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
ABC_A Uncharacterized protein  91.5   0.001   0.001   24.3   0.0   20   10-30    211-231 (260)
BCD_A Uncharacterized protein  90.3   0.001   0.001   26.4   0.0   55   33-88    28-83  (149)
CDE_A Uncharacterized protein  85.3     0.2   0.001   26.4   0.0   55   43-98    28-83  (149)

According to our criteria, hits 1 and 2 are matches (probability >= 90.0 and fragment length (from Query_HMM) >= 10).
So CP00000.0_1.match file will contain sequences:

0000.0_1_10-30
PLEEXAMPLEEXAMPL
0000.0_1_33-88
PLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPL

and CP00000.0_1.non_match will contain sequence:

0000.0_1_89-100
PLEEXAMP

Sub-sequences CP00000.0_1_1-9 and CP00000.0_1_31-33 will be dropped from subsequent analyses, as they did not match minimum fragment length criteria.

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.