EnvGen/Binning

Name: Binning

Owner: Environmental Genomics Group SciLifeLab/KTH Stockholm

Description: Scripts required to calculate tetramer frequencies and create input files for ESOM. See: Dick, G.J., A. Andersson, B.J. Baker, S.S. Simmons, B.C. Thomas, A.P. Yelton, and J.F. Banfield (2009). Community-wide analysis of microbial genome sequence signatures. Genome Biology, 10: R85

Created: 2013-11-27 09:59:58.0

Updated: 2014-09-09 14:50:18.0

Pushed: 2013-11-27 12:15:46.0

Homepage: http://genomebiology.com/2009/10/8/R85

Size: 126

Language: Perl

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Binning

Scripts required to calculate tetramer frequencies and create input files for ESOM.
See: Dick, G.J., A. Andersson, B.J. Baker, S.S. Simmons, B.C. Thomas, A.P. Yelton, and J.F. Banfield (2009). Community-wide analysis of microbial genome sequence signatures. Genome Biology, 10: R85
Open Access: http://genomebiology.com/2009/10/8/R85

How to ESOM?

These instructions are for ESOM-based for binning: see http://databionic-esom.sourceforge.net/ for software download and manual.

  1. Generate input files.
  2. Although not necessary but we recommend adding some reference genomes based on your 16s/OTU analysis as 'controls'. The idea is that, if the ESOM worked, your reference genome should form a bin itself. You may do this by downloading genomes in fasta format from any public database, preferably a complete single sequence genome.

  3. Use the esomWrapper.pl script to create the relevant input files for ESOM. In order to run this script, you'll need to have all your sequence(in fasta format) files with the same extension in the same folder. For example: perl esomWrapper.pl -path fasta_folder -ext fa
    For more help and examples, type:
    perl esomWrapper.pl -h

  4. The script will use the fasta file to produce three tab-delimited files that ESOM requires:
    – Learn file = a table of tetranucleotide frequencies (.lrn)
    – Names file = a list of the names of each contig (.names)
    – Class file = a list of the class of each contig, which can be used to color data points, etc. ( .cls)

NOTE:class number: The esom mapping requires that you define your sequences as classes. We generally define all the sequences that belong to your query (meatgenome for example) as 0 and all the others 1, 2 and so on. think of these as your predefined bins, each sequence that has the same class number will be assigned the same color in the map.

Questions?

Sunit Jain, sunitj [AT] umich [DOT] edu


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.