Name: wgs2ncbi
Owner: Naturalis Biodiversity Center
Description: Toolkit for preparing genomes for submission to NCBI
Created: 2013-08-06 13:15:44.0
Updated: 2017-10-17 16:16:48.0
Pushed: 2017-10-31 22:13:19.0
Size: 2101
Language: Perl
GitHub Committers
User | Most Recent Commit | # Commits |
---|---|---|
Kim Rutherford | 2015-11-17 01:46:48.0 | 1 |
Rutger Vos | 2018-02-27 14:49:15.0 | 168 |
Other Committers
User | Most Recent Commit | # Commits |
---|
The process of going from an annotated genome to a valid NCBI submission is somewhat cumbersome. “Boutique” genome projects might produce a scaffolded assembly in FASTA format and predicted genes in GFF3 tabular format (e.g. produced by the “maker” pipeline) but no convenient tools appear to exist to turn these results in the format that NCBI requires. This project remedies this by providing some Perl scripts (with no dependencies) to do the re-formatting. Included is also a shell script that chains the Perl scripts together and runs NCBI's tbl2asn on the result. This shell script is intended as an example and should be edited or copied to provide the right values.
NCBI requires that “whole genome shotgunning” (WGS) genomes are submitted as .sqn files (under the new rules of Spring 2013, one file for each scaffold). A sqn file is a file in ASN.1 syntax that contains both the sequence, its features, and the metadata about the submission (i.e. the authors, the publication title, the organism, etc.). sqn files are normally produced by the SeqIn program, which has a graphical user interface and which is therefore not practical for the potentially many thousands of files that comprise an entire genome. It is therefore preferred to use the program tbl2asn (command line), which takes a directory with FASTA files (.fsa) and corresponding files with the gene features in tabular format (.tbl), and a submission template (template.sbt) to produce the sqn files. We therefore need to do some data processing to prepare the inputs for tbl2asn.
Since our starting material is one giant FASTA file that contains the scaffolds and a GFF file with the features, we need to do the following:
wgs2ncbi prepare -conf <config.ini>
wgs2ncbi process -conf <config.ini>
wgs2ncbi convert -conf <config.ini>
tbl2asn
, make updates to the
protein names. If need be, keep re-running step 4 and updating
the names file until there are no warnings in the descrepancy report.wgs2ncbi compress -conf <config.ini>
, and upload to NCBI. They
will perform a contaminants screen. If there are suspicious sequences (e.g. untrimmed
adaptors), mask these using the adaptors file.In other words, the pipeline mostly consists of invocations of the wgs2ncbi script. Each invocation is followed by a verb (prepare, process, convert, compress), followed by a set of arguments that point to a configuration file, which in turn points to other files. By perusing the examples of these configuration files you should get a pretty good idea how to prepare your own versions of these files. To be able to run the script, you will need to install it locally. One way to do that is as follows:
perl Makefile.PL
sudo make install
Here now follow more details about each of the steps of the pipeline:
GenBank provides a web form that produces the sbt file. This form needs to be filled out with the correct metadata, i.e. all the authors of the publication, the publication title, the organism, etc. The included template.sbt file contains an example. The form to create such files is here
The genome annotation file (GFF3 format) may have the following issues that may prevent quick lookups of features for a given scaffold:
To remedy this we “explode” the GFF3 file into separate files, one for each scaffold. This allows us to quickly find the annotations for a given scaffold (i.e. random access) and we can filter out included things we don't want. This is done using the following command:
wgs2ncbi prepare -conf <config.ini>
Once the annotations are exploded, we then need to take the big FASTA file and chop it
up into multiple FASTA files and tbl
files, which need to be written into an output folder. The naive behavior is to write
each scaffold (and its features) to a separate FASTA file. This, however, may result in
very many files. Therefore, you can provide a parameter to indicate that sequences and
feature tables are lumped together with up to chunksize
sequences per file, where chunksize
may not exceed 10000 according to NCBI guidelines.
The default for this is 5000.
wgs2ncbi process -conf <config.ini>
A word of caution: this script produces in some cases tens of thousands of files, each of
which have a name that matches the first word in the FASTA definition line (so this should
be a unique identifier!) and the *.fsa
extension. Generally speaking you want to avoid
having to look inside the folder that contains these files because graphical interfaces
(like the windows explorer or the mac finder) have a hard time dealing with this. If you
use the chunksize
parameter (which is the default behaviour) the numberof files will be
a lot lower, and each will have a name matching combined_xxx-yyy.(fsa|tbl)
, where xxx
and yyy
are the start and end rank of the sequences in the file.
Once the submission template, the FASTA files, and the feature tables are produced, the tbl2asn program provided by NCBI needs to be run on the folder that contains these files. A typical invocation using the wrapper goes like this:
wg2ncbi convert -conf <config.ini>
In other words, this command will run the tbl2asn
command for you with the right command line arguments (provided you have installed it on
your system and made sure it can be found). Pay attention to the
output as this is running, and inspect the discrepancy report.
The Discrepancy Report is an evaluation of a single or multiple ASN.1 files, looking for suspicious annotation or annotation discrepancies that NCBI staff has noticed commonly occur in genome submissions, both complete and incomplete (WGS). A few of the problems that this function was written to find include inconsistent locus_tag prefixes, missing protein_id's, missing gene features, and suspect product names. The function is available in specially configured Sequin, as an argument for tbl2asn, or with the command-line program asndisc.
If you have questions about the Discrepancy Report, please contact us by email at genomes@ncbi.nlm.nih.gov prior to sending us your submission. Source: https://www.ncbi.nlm.nih.gov/genbank/asndisc/
The report contains numerous informational messages, warnings, and fatal errors. The latter have to be resolved before your submission is accepted by NCBI. Most (ideally, all) of the fatal errors have to do with bad product names. As you scroll through the discrepancy report, there will be categories of problematic product names (e.g. where a product is called a 'gene', which would be incoherent). Below each category, all the instances of this problem are listed. Each instance will show the problematic description.For these you need to create a mapping. Note that this conversion step may consequently be an iterative process: if the descrepancy report raises issues about names you will need to address and re-run the step.
Tip: Note that NCBI does accept .tar.gz archives, which means you can prepare your upload as follows:
wgs2ncbi compress -conf <config.ini>
Once you upload the archive, you will get a verdict from whoever is handling this submission at NCBI. It is possible that there will be stretches of sequence in your submission that NCBI will consider contaminants (based on a pipeline they run). One way to deal with those is to blank them out of the data, using a configuration file that specifies the coordinates of stretches to NNN. In addition, NCBI might have additional issues with certain protein names, so you may have to update the names mapping file. Then rerun the convert step, rebuild the archive, and do another upload.