Name: genevalidator
Owner: wurmlab
Description: GeneValidator: Identify problems with predicted genes
Created: 2013-05-30 04:48:03.0
Updated: 2017-11-18 22:34:28.0
Pushed: 2017-09-12 15:00:28.0
Homepage: http://wurmlab.github.io/tools/genevalidator/
Size: 135563
Language: Ruby
GitHub Committers
User | Most Recent Commit | # Commits |
---|---|---|
Anurag Priyam | 2017-12-24 11:26:52.0 | 36 |
Yannick Wurm | 2015-07-15 10:46:53.0 | 8 |
Monica Dragan | 2015-04-23 12:53:57.0 | 127 |
Ismail Moghul | 2017-12-23 15:00:54.0 | 633 |
Velik123 | 2014-09-29 10:53:27.0 | 3 |
Other Committers
User | Most Recent Commit | # Commits | |
---|---|---|---|
Monica Dragan | moncia.dragan@cti.pub.ro | 2013-09-05 08:04:54.0 | 36 |
Monica Dragan | monica.dragan@cti.pub.ro | 2013-10-03 11:25:46.0 | 64 |
Monica Dragan | monique@monique-pc.(none) | 2014-03-25 15:15:28.0 | 1 |
monique | monique@work.(none) | 2014-09-02 03:16:01.0 | 23 |
monique | monique@work.(none) | 2014-09-02 03:16:01.0 | 23 |
monique | monique@work.(none) | 2014-09-02 03:16:01.0 | 23 |
GeneValidator helps in identifing problems with gene predictions and provide useful information extracted from analysing orthologs in BLAST databases. The results produced can be used by biocurators and researchers who need accurate gene predictions.
If you would like to use GeneValidator on a few sequences, see our online GeneValidator Web App - http://genevalidator.sbcs.qmul.ac.uk.
If you use GeneValidator in your work, please cite us as follows:
GeneValidator runs the following validation on all input sequences:
GeneValidator also runs a further two validation on cDNA sequences:
Each analysis of each query returns a binary result (good vs. potential problem) according to p-value or an empirically determined cutoff. The results for each query are combined into an overall quality score from 0 to 100. Each analysis of each query returns a binary result (good vs. potential problem) according to p-value or an empirically determined cutoff. The results for each query are combined into an overall quality score from 0 to 100.
Please see here for more help with installing the prerequisites.
GeneValidator requires a protein BLAST database in order to fully analyse all sequences. The BLAST database needs to be set up with the -parse_seqids
argument as follows:
blastdb -in input_db -dbtype prot -parse_seqids
Simply run the following command in the terminal.
install genevalidator
If that doesn't work, try sudo gem install genevalidator
instead.
It is also possible to run from source. However, this is not recommended.
one the repository.
clone https://github.com/wurmlab/genevalidator.git
ve into GeneValidator source directory.
eneValidator
stall bundler
install bundler
e bundler to install dependencies
le install
tional: run tests, build documentation and build the gem from source
le exec rake
n GeneValidator.
le exec genevalidator -h
te that `bundle exec` executes GeneValidator in the context of the bundle
ternativaly, install GeneValidator as a gem
le exec rake install
validator -h
Verify GeneValidator installed by running the following command in the terminal:
validator
You should see the following output.
E:
genevalidator [OPTIONS] Input_File
MENTS:
Input_File: Path to the input fasta file containing the predicted sequences.
ONAL ARGUMENTS
-v, --validations <String> The Validations to be applied.
Validation Options Available (separated by coma):
all = All validations (default),
lenc = Length validation by clusterization,
lenr = Length validation by ranking,
merge = Analyse gene merge,
dup = Check for duplications,
frame = Open reading frame (ORF) validation,
orf = Main ORF validation,
align = Validating based on multiple alignment
-d, --db [BLAST_DATABASE] Path to the BLAST database
GeneValidator also supports remote databases:
e.g. genevalidator -d "swissprot -remote" Input_File
-e, --extract_raw_seqs Produces a fasta file of the raw sequences of all BLAST hits in the
supplied BLAST output file. This fasta file can then be provided to
GeneValidator with the "-r", "--raw_sequences" argument
-j, --json_file [JSON_FILE] Generate HTML report from a JSON file (or a subset of a JSON file)
produced by GeneValidator
-x [BLAST_XML_FILE], Provide GeneValidator with a pre-computed BLAST XML output
--blast_xml_file file (BLAST -outfmt option 5).
-t [BLAST_TABULAR_FILE], Provide GeneValidator with a pre-computed BLAST tabular output
--blast_tabular_file file. (BLAST -outfmt option 6).
-o [BLAST_TABULAR_OPTIONS], Custom format used in BLAST -outfmt argument
--blast_tabular_options See BLAST+ manual pages for more details
-n, --num_threads num_of_threads Specify the number of processor threads to use when running
BLAST and Mafft within GeneValidator.
-r, --raw_sequences [raw_seq] Supply a fasta file of the raw sequences of all BLAST hits present
in the supplied BLAST XML or BLAST tabular file.
-b, --binaries [binaries] Path to BLAST and MAFFT bin folders (is added to $PATH variable)
To be provided as follows:
e.g. genevalidator -b /blast/bin/path/ -b /mafft/bin/path/
--version The version of GeneValidator that you are running.
-h, --help Show this screen.
This runs BLAST on NCBI remote Swiss-Prot BLAST database. As such this is suitable for analyses on less than 10 sequences.
validator INPUT_FASTA_FILE
GeneValidator would run BLAST (using an E-Value 1e-5) on each query against the provided BLAST database and then run the validation analyses.
validator -d DATABASE_PATH -n NUM_THREADS INPUT_FASTA_FILE
At times, it may be more suitable to run the resource-heavy BLAST separately and then pass the BLAST output file to GeneValidator. This may be the case if one is analysing a large number of input sequence and would like to run the time- and resource-consuming BLAST process on a faster machine (i.e a cluster).
GeneValidator supports the XML and tabular BLAST output formats.
n BLAST (XML output)
t(p/x) -db DATABASE_PATH -num_threads NUM_THREADS -outfmt 5 -out BLAST_XML_FILE -query INPUT_FASTA_FILE
tional: Generate a fasta file for the BLAST hits.
te: this works best if you use the same database used to create the BLAST OUTPUT file.
validator -d DATABASE_PATH -e -x BLAST_XML_FILE
n GeneValidator
f you ran the previous command (i.e. if you produced fasta file for the BLAST hits)
validator -n NUM_THREADS -x BLAST_XML_FILE -r RAW_SEQUENCES_FILE INPUT_FASTA_FILE
f you did not run the previous command (this will run the previous command for you)
validator -d DATABASE_PATH -n NUM_THREADS -x BLAST_XML_FILE INPUT_FASTA_FILE
This is the same, but using the BLAST tabular output.
n BLAST (tabular output)
t(p/x) -db DATABASE_PATH -num_threads NUM_THREADS -outfmt '7 qseqid sseqid sacc slen qstart qend sstart send length qframe pident nident evalue qseq sseq' -out BLAST_TAB_FILE -query INPUT_FASTA_FILE
tional: Generate a fasta file for the BLAST hits.
te: this works best if you use the same database used to create the BLAST OUTPUT file.
validator -d DATABASE_PATH -e -t BLAST_TAB_FILE -o 'qseqid sseqid sacc slen qstart qend sstart send length qframe pident nident evalue qseq sseq'
n GeneValidator
f you ran the previous command (i.e. if you produced fasta file for the BLAST hits)
validator -n NUM_THREADS -t BLAST_TAB_FILE -o 'qseqid sseqid sacc slen qstart qend sstart send length qframe pident nident evalue qseq sseq' -r RAW_SEQUENCES_FILE INPUT_FASTA_FILE
f you did not generate the BLAST hits fasta file (this will automatically run the previous command for you)
validator -d DATABASE_PATH -n NUM_THREADS -t BLAST_TAB_FILE -o 'qseqid sseqid sacc slen qstart qend sstart send length qframe pident nident evalue qseq sseq' INPUT_FASTA_FILE
The output produced by GeneValidator is presented in three manners.
Firstly, the output is produced as a colourful, HTML file. This file is titled 'results.html' (found in the 'html' folder) and can be opened in a web browser (please use a supported browser - See Installation Requirements). This file contains all the results in an easy-to-view manner with graphical visualisations. See exemplar HTML output here (protein input data) and here (DNA input data).
The output is also produced in JSON. GeneValidator is able to re-generate results for any JSON files (or derived JSON files) with that were previously generated by the program. This means that you are able to use the JSON file in your own analysis pipelines and then use GeneValidator to produce the HTML output for the analysed JSON file.
Lastly, a tabular summary of the results is also outputted in the terminal to provide quick feedback on the results. The terminal output can be piped to tools like awk
and sed
or redirected to a file for further processing.
There are numerous methods to analyse the JSON output including the streamable JSON command line program or jq. The below examples uses jq 1.5.
After installing node:
untu
do apt-get install jq
ew / linuxbrew
ew install jq
quires jq 1.5
tract sequences that have an overall score of 100
t INPUT_JSON_FILE | jq '.[] | select(.overall_score == 100)' > OUTPUT_JSON_FILE
tract sequences that have an overall score of over 70
t INPUT_JSON_FILE | jq '.[] | select(.overall_score == 70)' > OUTPUT_JSON_FILE
tract sequences that have more than 50 hits
t INPUT_JSON_FILE | jq '.[] | select(.no_hits > 50)' > OUTPUT_JSON_FILE
rt the JSON based on the overall score (ascending - 0 to 100)
t INPUT_JSON_FILE | jq 'sort_by(.overall_score)' > OUTPUT_JSON_FILE
rt the JSON based on the overall score (decending - 100 to 0)
t INPUT_JSON_FILE | jq 'sort_by(- .overall_score)' > OUTPUT_JSON_FILE
move the large graphs objects (note these Graphs objects are required if you wish to pass the json back into GV using the `-j` option - see below)
t INPUT_JSON_FILE | jq -r '[ .[] | del(.validations[].graphs) ]' > OUTPUT_JSON_FILE
ve JSON as CSV
rite header first
data/protein_data.fasta.json | jq -r '.[0] | ["idx", "overall_score", "definition", "no_hits", .validations[].header ] | @csv' > OUTPUT_JSON_FILE
rite content to the same file
t INPUT_JSON_FILE | jq -r '.[] | [.idx, .overall_score, .definition, .no_hits, .validations[].print ] | @csv ' >> OUTPUT_JSON_FILE
The subsetted/sorted JSON file can then be passed back into GeneValidator (using the -j
command line argument) to generate the HTML report for the sequences in the JSON file.
validator -j SORTED_JSON_FILE
GeneValidatorApp - A Web App wrapper for GeneValidator.
GeneValidatorApp-API - An easy to use API for GeneValidatorApp to allow you to use GeneValidator within your web applications.