biocore/microprot

Name: microprot

Owner: biocore

Description: structural annotation pipeline for microbial genomes and metagenomes

Created: 2016-04-26 17:31:52.0

Updated: 2017-10-25 04:45:30.0

Pushed: 2017-12-18 00:19:13.0

Homepage:

Size: 2314

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Coverage Status Build Status

microprot

microprot is coded in Python 3.x

Introduction

microprot clusters and annotates microbial metagenome sequences for the ultimate goal of predicting the 3-dimensional structure and function of these proteins.

Install
Requirements

Some of the tools and databases we're using were developed externally and cannot be automatically installed. We ask you to download them on your own, install and update appropriate paths in paths.yml

dbs
tools

Tools requiring manual installation are listed and linked below:

Naming conventions
Filenames

All filenames are in the form: GenomeID_GeneID_ResiduesFrom-ResiduesTo and contain amino acid sequences.
For example, CP003179.1_3319 means gene 3319 from genome CP003179.1 (Sulfobacillus acidophilus DSM 10332), or CP003179.1_3319_1-60 means amino acids 1 to 60 from that gene.

File extensions
Example

Gene CP00000.0_1 (CP00000.0_1.fasta) with 100 residues is run against HHsearch and it returns 2 outputs: CP00000.0_1.out and CP00000.0_1.a3m. Sequence split parameters are:

prob: 90.0
fragment_length: 10

and the hit list portion of CP00000.0_1.out is:

ines of input parameters summary]

it                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
ABC_A Uncharacterized protein  91.5   0.001   0.001   24.3   0.0   20   10-30    211-231 (260)
BCD_A Uncharacterized protein  90.3   0.001   0.001   26.4   0.0   55   33-88    28-83  (149)
CDE_A Uncharacterized protein  85.3     0.2   0.001   26.4   0.0   55   43-98    28-83  (149)

According to our criteria, hits 1 and 2 are matches (probability >= 90.0 and fragment length (from Query_HMM) >= 10).
So CP00000.0_1.match file will contain sequences:

0000.0_1_10-30
PLEEXAMPLEEXAMPL
0000.0_1_33-88
PLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPL

and CP00000.0_1.non_match will contain sequence:

0000.0_1_89-100
PLEEXAMP

Sub-sequences CP00000.0_1_1-9 and CP00000.0_1_31-33 will be dropped from subsequent analyses, as they did not match minimum fragment length criteria.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.