soedinglab/MMseqs

Name: MMseqs

Owner: Söding Lab

Description: null

Created: 2015-05-04 12:00:32.0

Updated: 2018-01-03 19:41:46.0

Pushed: 2016-09-19 09:37:09.0

Homepage: null

Size: 565

Language: C++

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

PLEASE USE MMSEQS2 (THIS VERSION IS NOT FURTHER DEVELOPMENT ANYMORE)

Please use MMseqs2 instead of MMseqs. It is faster, more sensitive, clusters better and is more userfriendly. You can find MMseqs2 here: https://github.com/soedinglab/MMseqs2 This repository is not further developed.

MMseqs

MMseqs (Many-against-Many sequence searching) is a software suite for very fast protein sequence searches and clustering of huge protein sequence data sets. MMseqs is around 1000 times faster than protein BLAST and sensitive enough to capture similarities down to less than 30% sequence identity.

Requirements

To compile from source, you will need:

Memory Requirements

When using MMseqs the available memory limits the size of database you will be able to compute. We recommend at least 128 GB of RAM so you can compute databases up to 50.000.000 entries:

You can calculate the memory requirements in bytes for L columns and N rows using the following formula:

    M = (4*N*L + 8*a^k) byte

MMseqs stores an index table and two auxiliary arrays, which have a total size of M byte.

For a database containing N sequences with an average length L, the memory consumption of the index table is (4*N*L) byte. Note that the memory consumption grows linearly with the number of the sequences N in the database.

The two auxiliary arrays consume (8*a^k) byte, with a being the size of the amino acid alphabet (usually 21 including the unknown amino acid X) and the k-mer size k.

Installation
Cloning from GIT

If you want to compile the most recent version, simply clone the git repository.

    git clone https://github.com/soedinglab/MMseqs.git
Compile

First, set environment variables:

    export MMDIR=$HOME/path/to/mmseqs/
    export PATH=$PATH:$MMDIR/bin

MMseqs uses ffindex, a fast and simple database for wrapping and accessing a huge number of small files. Setting the environment variable LD_LIBRARY_PATH ensures that the needed ffindex libraries are available:

    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MMDIR/lib/ffindex/src
    cd $MMDIR/lib/ffindex
    make

Then build the MMseqs binaries:

    cd $MMDIR/src
    make

MMseqs binaries are now located in $MMDIR/bin.

Overview of MMseqs

MMseqs contains six binaries. Three commands execute complete workflows that combine MMseqs core modules. The other three commands execute the single modules which are used by the workflows and are available for advanced users.

Workflows
FFindex Database Format

All modules take ffindex databases as input and produce ffindex databases as output. ffindex was developed to avoid drastically slowing down the file system when millions of files need to be written and accessed. ffindex hides the files from the file system by storing them as unstructured data records in a single data file. In addition to this data file, an ffindex database includes a second index file: This index file stores an unique accession code, the start position in bytes of the data record in the FFindex data file and the length of the record for each file. When transforming a FASTA file with multiple sequences into an ffindex database, the accession code is the ID of the sequence parsed from the header. If no ID can be identified, the accession code is the whole header without the > character before the first blank space.

The binaries fasta2ffindex and ffindex2fasta located in mmseqs/bin do the format conversion from and to the ffindex database format. fasta2ffindex generates a ffindex database from a FASTA sequence database. ffindex2fasta converts an ffindex database to a FASTA formatted text file: the headers are ffindex accession codes preceded by >, with the corresponding dataset from the ffindex data file following. However, for a fast access to the particular datasets in very large databases it is advisable?to use the ffindex database directly without converting. We offer the binary ffindex_get ($MMDIR/lib/ffindex/src/) for direct access to the datasets stored in an ffindex database.

How to cluster

Before clustering, convert your FASTA database into ffindex format:

    fasta2ffindex DB.fasta DB

Please ensure that in case of large input databases the temporary folder tmp provides enough free space. For the disc space requirements, see the user guide.

    mkdir tmp
    mmseqs_cluster DB DB_clu tmp --cascaded

To generate a FASTA-style formatted output file from the ffindex output file, type:

    ffindex2fasta DB_clu DB_clu.fasta

To run the more sensitive cascaded clustering and convert the result into FASTA format, type:

    mmseqs_cluster DB DB_clu_s7 tmp --cascaded -s 7
    ffindex2fasta DB_clu_s7 DB_clu_s7.fasta
How to search

You can use the query database queryDB.fasta and target database targetDB.fasta to test the search workflow. Before clustering, you need to convert your database containing query sequences (queryDB.fasta) and your target database (targetDB.fasta) into ffindex format:

    fasta2ffindex queryDB.fasta queryDB
    fasta2ffindex targetDB.fasta targetDB

It generates ffindex database files, e. g. queryDB and ffindex index file queryDB.index from queryDB.fasta. Then, generate a directory for tmp files:

    mkdir tmp

Please ensure that in case of large input databases tmp provides enough free space. For the disc space requirements, see the user guide. To run the search type:

    mmseqs_search queryDB targetDB outDB tmp

Then convert the result ffindex database into a FASTA formatted database:

    ffindex2fasta outDB outDB.fasta
License Terms
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.