soedinglab/WIsH

Name: WIsH

Owner: Söding Lab

Description: Predict prokaryotic host for phage metagenomic sequences

Created: 2017-02-14 16:45:46.0

Updated: 2017-10-10 13:20:28.0

Pushed: 2018-01-15 17:58:59.0

Homepage: null

Size: 1677

Language: C++

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

README

Build Status

LICENSE

WIsH is licensed under the General Public License (see the LICENSE file). Copyright Clovis Galiez (clovis.galiez@mpibpc.mpg.de).

What is this repository for?
How do I get set up?
Installation: Docker:

Build the docker container:

sh
path/to/repository/of/WiSH
er build -t wish .

To run WIsH from the container:

er run -v /some/host/folder:/data wish <some WiSH commands>
Linux:
sh
clone https://github.com/soedinglab/WIsH.git
IsH
e .

MacOS

Compiling with OpenMP support on MacOS requires a recent gcc compiler. You can get gcc from homebrew.

sh
 install gcc@6
rt CC=gcc-6
rt CXX=g++-6

clone https://github.com/soedinglab/WIsH.git
IsH
e .

Dependencies

If you want to enjoy the parallelization of WIsH, you should have the OpenMP library installed. WIsH uses C++11. The model construction and the interaction prediction are both parallelized. In both cases, it spreads one bacterial genome/model per thread as soon as you give WIsH the number N of threads to use (with the parameter “-t N”).

Database configuration

You need two different directory containing only sequence data in FASTA format. One should contain your potential host genomes (one becteria per FATSA file), the other should contain the viral contigs/genomes (one virus per FATSA file).

Usage example

To run a prediction, you should proceed in two steps:

1 - Create the models from the bacterial genomes you stored in FASTA format in prokaryoteGenomesDir:

sh
r modelDir
sH -c build -g prokaryoteGenomesDir -m modelDir

This will create a model in modelDir for every bectrial genome.

2 - Run the prediction on the viral sequences you stored in FASTA format in phageContigsDir:

sh
r outputResultDir
sH -c predict -g phageContigsDir -m modelDir -r outputResultDir -b

This will output a file llikelihood.matrix containing a matrix of log-likelihood (rows are a bacteria, and columns are a viral contigs), and a “summary” file prediction.list containing for every viral sequence the host corresponding to highest log-likelihood (-b option).

The files can be further analyzed with any text editor or with R:



 read.table("outputResultDir/llikelihood.matrix")
ictions = read.table("outputResultDir/prediction.list")

ow the number of viral contigs targeting every potential hosts:
e(predictions$V2)

ow the histogram of the log-likelihoods for the best predictions:
(predictions$V3)
Getting the null paramters for new bacterial models

If you want to get p-values for your predictions, you need to know the null parameters for a new bacterial model. To get them, you must run the predictions on a large set of phage genomes that are know not to infect your bacterial model (let's call it the null set of phages) and use the prediction likelihood to fit the null-model parameters. You can use the script computeNullParameters.R that takes the predictions on this null set of phage and create a file containing the null parameters for every bacterial model. To get the p-values while predicting interactions, please specify the options “-b -n nullParameters.tsv” in a prediction call.

Troubleshooting - Bug reports

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.