hurwitzlab/prokka

Name: prokka

Owner: Hurwitz Lab

Description: Fork of :zap: :aquarius: rapid prokaryotic genome annotation to add --notbl2asn

Forked from: tseemann/prokka

Created: 2017-08-02 13:18:05.0

Updated: 2017-08-02 14:38:04.0

Pushed: 2017-08-02 14:31:11.0

Homepage:

Size: 239106

Language: Perl

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Prokka: rapid prokaryotic genome annotation

Torsten Seemann (torsten.seemann@gmail.com) (@torstenseemann)

Contents
Introduction

Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.

Installation

Before the main install can begin you need to install some system packages:

Centos/Fedora/RHEL (RPM)

 yum install perl-Time-Piece perl-XML-Simple perl-Digest-MD5 git java perl-CPAN perl-Module-Build
 cpan -i Bio::Perl  # if you don't have Bioperl installed (it will be tedious)

Ubuntu/Debian/Mint (APT)

 apt-get install libdatetime-perl libxml-simple-perl libdigest-md5-perl git default-jre bioperl

Mac OS X

 cpan Time::Piece XML::Simple Digest::MD5 Bio::Perl

There are currently 3 ways to install the main Prokka software: Github, Tarball or Homebrew.

Github

Choose somewhere to put it, for example in your home directory (no root access required):

 $HOME

Clone the latest version of the repository:

t clone https://github.com/tseemann/prokka.git
 prokka

Index the sequence databases

okka/bin/prokka --setupdb
Homebrew

Homebrew is a package manager which allows users to easily install complex software in their home directory. Instructions for installing it are available for Linux and Mac OS X.

Ensure you have brew installed:

ew

Make sure you have the homebrew-science tap/channel enabled:

ew tap homebrew/science
ew update

Install Prokka and all its dependencies:

ew install prokka --HEAD
Tarball

WARNING: this method gives you very old version of prokka. The brew or github methods are preferred!

Download the latest prokka-1.xx.tar.gz archive from http://www.bioinformatics.net.au/software.prokka.shtml

et http://www.vicbioinformatics.com/prokka-1.11.tar.gz

Choose somewhere to put it, for example in your home directory (no root access required):

 $HOME
r zxvf prokka-1.11.tar.gz
 prokka-1.11
Install dependencies

Prokka comes with many binaries for Linux and Mac OS X. It will always use your existing installed versions if they exist, but will use the included ones if that fails. For some older systems (eg. Centos 4.x) some of them won't work due to them being dynamically linked against new GLIBC libraries you don't have. You can consult the list of dependencies later in this document.

Choose a rRNA predictor
Option 1 - Don't use one

If Prokka can't find a predictor for rRNA featues (either Barrnap or RNAmmer below) then it simply won't annotate any. Most people don't care that much about them anyway,

Option 2 - Barrnap

This was written by the author of Prokka and is recommended if you prefer speed over absolute accuracy. It uses the new multi-core NHMMER for DNA:DNA profile searches. Download it from https://github.com/tseemann/barrnap

Option 3 - RNAmmer

RNAmmer was written when HMMER 2.x was the latest release. Since them, HMMER 3.x has been released, and uses the same executable binary names. Prokka needs HMMER3 and RNAmmer (and hence HMMER2) so you need to edit your RNAmmer script to explicitly point your HMMER2 binary instead of using the HMMER3 binary which is more likely to be in your PATH first.

Type which rnammer to find the script, and then edit it with your favourite editor. Find the following lines at the top:

 $uname eq "Linux" ) {
    $HMMSEARCH_BINARY = "/usr/cbs/bio/bin/linux64/hmmsearch";    # OLD
    $HMMSEARCH_BINARY = "/path/to/my/hmmer-2.3.2/bin/hmmsearch"; # NEW (yours)

If you are using Mac OS X, you'll also have to change the "Linux" to "Darwin" too. As you can see, I have commented out the original part, and replaced it with the location of my HMMER2 hmmsearch tool, so it doesn't run the HMMER3 one. You need to ensure HMMER3 is in your PATH before the old HMMER2 too.

Add to PATH

Add the following line to your $HOME/.bashrc file, or to /etc/profile.d/prokka.sh to make it available to all users:

rt PATH=$PATH:$HOME/prokka-1.11/bin
Index the sequence databases
okka --setupdb
Test
Invoking Prokka
Beginner
nilla (but with free toppings)
okka contigs.fa

ok for a folder called PROKKA_yyyymmdd (today's date) and look at stats
t PROKKA_yyyymmdd/*.txt
Moderate
oose the names of the output files
okka --outdir mydir --prefix mygenome contigs.fa

sualize it in Artemis
t mydir/mygenome.gff
Expert
's not just for bacteria, people
okka --kingdom Archaea --outdir mydir --genus Pyrococcus --locustag PYCC

arch for my favourite gene
onerate --bestn 1 zetatoxin.fasta mydir/PYCC_06072012.faa | less
Wizard
tch and learn
okka --outdir mydir --locustag EHEC --proteins NewToxins.faa --evalue 0.001 --gram neg --addgenes contigs.fa

eck to see if anything went really wrong
ss mydir/EHEC_06072012.err

d final details using Sequin
quin mydir/EHEC_0607201.sqn
NCBI Genbank submitter
gister your BioProject (e.g. PRJNA123456) and your locus_tag prefix (e.g. EHEC) first!
okka --compliant --centre UoN --outdir PRJNA123456 --locustag EHEC --prefix EHEC-Chr1 contigs.fa

eck to see if anything went really wrong
ss PRJNA123456/EHEC-Chr1.err

d final details using Sequin
quin PRJNA123456/EHEC-Chr1.sqn
European Nucleotide Archive (ENA) submitter
gister your BioProject (e.g. PRJEB12345) and your locus_tag (e.g. EHEC) prefix first!
okka --compliant --centre UoN --outdir PRJEB12345 --locustag EHEC --prefix EHEC-Chr1 contigs.fa

eck to see if anything went really wrong
ss PRJNA123456/EHEC-Chr1.err

stall and run Sanger Pathogen group's Prokka GFF3 to EMBL converter
ailable from https://github.com/sanger-pathogens/gff3toembl
nd the closest NCBI taxonomy id (e.g. 562 for Escherichia coli)
f3_to_embl -i "Submitter, A." \
-m "Escherichia coli EHEC annotated using Prokka." \
-g linear -c PROK -n 11 -f PRJEB12345/EHEC-Chr1.embl \
"Escherichia coli" 562 PRJEB12345 "Escherichia coli strain EHEC" PRJEB12345/EHEC-Chr1.gff

wnload and run the EMBL validator prior to submitting the EMBL flat file
rl -L -O ftp://ftp.ebi.ac.uk/pub/databases/ena/lib/embl-client.jar
va -jar embl-client.jar -r PRJEB12345/EHEC-Chr1.embl

mpress the file ready to upload to ENA, and calculate MD5 checksum
ip PRJEB12345/EHEC-Chr1.embl
5sum PRJEB12345/EHEC-Chr1.embl.gz
Crazy Person
 stinking Perl script is going to control me
okka \
    --outdir $HOME/genomes/Ec_POO247 --force \
    --prefix Ec_POO247 --addgenes --locustag ECPOOp \
    --increment 10 --gffver 2 --centre CDC  --compliant \
    --genus Escherichia --species coli --strain POO247 --plasmid pECPOO247 \
    --kingdom Bacteria --gcode 11 --usegenus \
    --proteins /opt/prokka/db/trusted/Ecocyc-17.6 \
    --evalue 1e-9 --rfam \
    plasmid-closed.fna
Output Files

| Extension | Description | | ——— | ———– | | .gff | This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. | | .gbk | This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence. | | .fna | Nucleotide FASTA file of the input contig sequences. | | .faa | Protein FASTA file of the translated CDS sequences. | | .ffn | Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA) | | .sqn | An ASN1 format “Sequin” file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. | | .fsa | Nucleotide FASTA file of the input contig sequences, used by “tbl2asn” to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. | | .tbl | Feature Table file, used by “tbl2asn” to create the .sqn file. | | .err | Unacceptable annotations - the NCBI discrepancy report. | | .log | Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the –quiet option was enabled. | | .txt | Statistics relating to the annotated features found. | | .tsv | Tab-separated file of all features: locus_tag,ftype,gene,EC_number,product |

Command line options
General:
  --help            This help
  --version         Print version and exit
  --docs            Show full manual/documentation
  --citation        Print citation for referencing Prokka
  --quiet           No screen output (default OFF)
  --debug           Debug mode: keep all temporary files (default OFF)
Setup:
  --listdb          List all configured databases
  --setupdb         Index all installed databases
  --cleandb         Remove all database indices
  --depends         List all software dependencies
Outputs:
  --outdir [X]      Output folder [auto] (default '')
  --force           Force overwriting existing output folder (default OFF)
  --prefix [X]      Filename output prefix [auto] (default '')
  --addgenes        Add 'gene' features for each 'CDS' feature (default OFF)
  --locustag [X]    Locus tag prefix (default 'PROKKA')
  --increment [N]   Locus tag counter increment (default '1')
  --gffver [N]      GFF version (default '3')
  --compliant       Force Genbank/ENA/DDJB compliance: --genes --mincontiglen 200 --centre XXX (default OFF)
  --centre [X]      Sequencing centre ID. (default '')
Organism details:
  --genus [X]       Genus name (default 'Genus')
  --species [X]     Species name (default 'species')
  --strain [X]      Strain name (default 'strain')
  --plasmid [X]     Plasmid name or identifier (default '')
Annotations:
  --kingdom [X]     Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
  --gcode [N]       Genetic code / Translation table (set if --kingdom is set) (default '0')
  --gram [X]        Gram: -/neg +/pos (default '')
  --usegenus        Use genus-specific BLAST databases (needs --genus) (default OFF)
  --proteins [X]    Fasta file of trusted proteins to first annotate from (default '')
  --hmms [X]        Trusted HMM to first annotate from (default '')
  --metagenome      Improve gene predictions for highly fragmented genomes (default OFF)
  --rawproduct      Do not clean up /product annotation (default OFF)
Computation:
  --fast            Fast mode - skip CDS /product searching (default OFF)
  --cpus [N]        Number of CPUs to use [0=all] (default '8')
  --mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
  --evalue [n.n]    Similarity e-value cut-off (default '1e-06')
  --rfam            Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
  --norrna          Don't run rRNA search (default OFF)
  --notrna          Don't run tRNA search (default OFF)
  --rnammer         Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)
Option: –rawproduct

Prokka annotates proteins by using sequence similarity to other proteins in its database, or the databses the user provides via --proteins. By default, Prokka tries to “cleans” the /product names to ensure they are compliant with Genbank/ENA conventions. Some of the main things it does is:

Full details can be found in the cleanup_product() function in the prokka script. If you feel your annotations are being ruined, try using the --rawproduct option, and please file an issue if you find an example of where it is “behaving badly” and I will fix it.

Databases
The Core (BLAST+) Databases

Prokka uses a variety of databases when trying to assign function to the predicted CDS features. It takes a hierarchial approach to make it fast.
A small, core set of well characterized proteins are first searched using BLAST+. This combination of small database and fast search typically completes about 70% of the workload. Then a series of slower but more sensitive HMM databases are searched using HMMER3.

The initial core databases are derived from UniProtKB; there is one per “kingdom” supported. To qualify for inclusion, a protein must be (1) from Bacteria (or Archaea or Viruses); (2) not be “Fragment” entries; and (3) have an evidence level (“PE”) of 2 or lower, which corresponds to experimental mRNA or proteomics evidence.

Making a Core Databases

If you want to modify these core databases, the included script prokka-uniprot_to_fasta_db, along with the official uniprot_sprot.dat, can be used to generate a new database to put in /opt/prokka/db/kingdom/. If you add new ones, the command prokka --listdb will show you whether it has been detected properly.

The Genus Databases

If you enable --usegenus and also provide a Genus via --genus then it will first use a BLAST database which is Genus specific. Prokka comes with a set of databases for the most common Bacterial genera; type prokka --listdb to see what they are.

Adding a Genus Databases

If you have a set of Genbank files and want to create a new Genus database, Prokka comes with a tool called prokka-genbank_to_fasta_db to help. For example, if you had four annotated “Coccus” genomes, you could do the following:

% prokka-genbank_to_fasta_db Coccus1.gbk Coccus2.gbk Coccus3.gbk Coccus4.gbk > Coccus.faa
% cd-hit -i Coccus.faa -o Coccus -T 0 -M 0 -g 1 -s 0.8 -c 0.9
% rm -fv Coccus.faa Coccus.bak.clstr Coccus.clstr
% makeblastdb -dbtype prot -in Coccus
% mv Coccus.p* /path/to/prokka/db/genus/
The HMM Databases

Prokka comes with a bunch of HMM libraries for HMMER3. They are mostly Bacteria-specific. They are searched after the core and genus databases. You can add more simply by putting them in /opt/prokka/db/hmm. Type prokka --listdb to confirm they are recognised.

FASTA database format

Prokka understands two annotation tag formats, a plain one and a detailed one.

The plain one is a standard FASTA-like line with the ID after the > sign, and the protein /product after the ID (the “description” part of the line):

>SeqID product

The detailed one consists of a special encoded three-part description line. The parts are the /EC_number, the /gene code, then the /product - and they are separated by a special “~~~” sequence:

>SeqID EC_number~~~gene~~~product

Here are some examples. Note that not all parts need to be present, but the “~~~” should still be there:

>YP_492693.1 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
MNEKNIKHSQNFITSKHNIDKIMTNIRLNEHDNIFEIGSGKGHFTLELVQRCNFVTAIEI
DHKLCKTTENKLVDHDNFQVLNKDILQFKFPKNQSYKIFGNIPYNISTDIIRKIVF*
>YP_492697.1 ~~~traB~~~transfer complex protein TraB
MIKKFSLTTVYVAFLSIVLSNITLGAENPGPKIEQGLQQVQTFLTGLIVAVGICAGVWIV
LKKLPGIDDPMVKNEMFRGVGMVLAGVAVGAALVWLVPWVYNLFQ*
>YP_492694.1 ~~~~~~transposase
MNYFRYKQFNKDVITVAVGYYLRYALSYRDISEILRGRGVNVHHSTVYRWVQEYAPILYQ
QSINTAKNTLKGIECIYALYKKNRRSLQIYGFSPCHEISIMLAS*

The same description lines apply to HMM models, except the “NAME” and “DESC” fields are used:

NAME  PRK00001
ACC   PRK00001
DESC  2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase
LENG  284
FAQ
Bugs
Changes
Citation

Seemann T.
Prokka: rapid prokaryotic genome annotation
Bioinformatics 2014 Jul 15;30(14):2068-9. PMID:24642063

Dependencies
Mandatory
Recommended
Optional

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.