datamade/ziphmm

Name: ziphmm

Owner: datamade

Description: Automatically exported from code.google.com/p/ziphmm

Created: 2015-11-26 18:32:08.0

Updated: 2015-11-26 18:32:10.0

Pushed: 2015-11-26 18:41:47.0

Homepage: null

Size: 1995

Language: C++

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Build procedure:

To build and install the library, unzip the directory and execute the following commands in a terminal:

 <path to library>/zipHMM-1.0.1/
MM-1.0.1 $ cmake .
MM-1.0.1 $ make
MM-1.0.1 $ bin/calibrate
MM-1.0.1 $ make test
MM-1.0.1 $ make install

To build in OS X, the Accellerate framework is required (see https://developer.apple.com/performance/accelerateframework.html). This is included in the developer tools installed with XCode (see https://developer.apple.com/xcode/)

To build in Linux CMake must be able to find an installation of a BLAS implementation. For now the CMake script is set up to use Atlas and to look for it at /com/extra/ATLAS/3.9.84. This will most likely not work on your machine. You may therefore have to change line 11 in zipHMM/CmakeLists.txt:

t(ATLAS_ROOT "/com/extra/ATLAS/3.9.84")

If you are using a different implementation of BLAS than Atlas you will have to do a few extra simple changes in zipHMM/CMakelists.txt - look at line 12, 13, 32, 56, 76, 91, 106, 121, 159 and 192.

bin/calibrate finds the optimal number of threads to use in the parallelized algorithm and saves the number in a file (default is ~/.ziphmm.devices).

Getting started

Have a look at zipHMM/cpp_example.cpp and zipHMM/python_example.cpp and try running the following commands from the root directory.

pHMM-0.0.1 $ bin/cpp_example

pHMM-0.0.1 $ cd zipHMM/
pHMM $ python python_example.py

Using the C++ library

The main class in the library is Forwarder (forwarder.hpp and forwarder.cpp). Objects of this class represents an observed sequence that have been preprocessed such that the likelihood of the sequence can be obtained from a given HMM (pi, A, B) very fast. To build a new forwarder object just call the empty constructor:

arder();

and to read in a sequence call one of the two read_seq methods:

 Forwarder::read_seq(const std::string &seq_filename, const size_t alphabet_size, 
                     std::vector<size_t> nStatesSave, const size_t min_no_eval = 1);
 Forwarder::read_seq(const std::string &seq_filename, const size_t alphabet_size, 
             const size_t no_states, const size_t min_no_eval);
 Forwarder::read_seq(const std::string &seq_filename, const size_t alphabet_size, 
                     const size_t min_no_eval = 1)

Here seq_filename is the filename of the file containing the observed sequence, alphabet_size is the size of the alphabet used in the observed sequence, nStatesSave is a vector indicating the sizes of the HMMs the user intends to run the Forwarder object on, and min_no_eval is a guess of the number of times, the preprocessing will be reused (if unsure about this then leave it out and use the default value). If nStatesSave contains the vector (2, 4, 8), datastructures obtaining the fastest evaluation of the forward algorithm for each of the HMM state space sizes 2, 4, and 8 will be build. If nStatesSave is left empty a single datastructure obtaining a very fast evaluation of the forward algorithm for all HMM state space sizes will be saved.

The second constructor serves as a convenient way to call the first constructor with only one HMM size in nStatesSave. The third constructor serves as a convenient way to call the first constructor with an empty nStates2save vector.

After building an Forwarder object, it can be saved to disk using the method

 write_to_directory(const std::string &directory) const;

Here directory should contain the path (relative to the root of the library) of the intended location of the datastructure.

To read a previously saved datastructure, one of the following two methods can be used:

 Forwarder::read_from_directory(const std::string &dirname);
 Forwarder::read_from_directory(const std::string &directory, const size_t no_states);

Using the first one, the entire datastructure is being rebuilt. Using the second one only the datastructure matching no_states is being rebuild. This will be faster in many cases. If you did not save the datastructure for the size of your HMM, then use the first constructor. The forward algorithm will figure out which of the saved data structures is most optimal for your HMM.

Finally, to get the loglikelihood of the observed sequence in a specific model, one of the following methods are used:

le Forwarder::forward(const Matrix &pi, const Matrix &A, const Matrix &B) const;
le Forwarder::pthread_forward(const Matrix &pi, const Matrix &A, const Matrix &B, 
                              const std::string &device_filename = DEFAULT_DEVICE_FILENAME) const;

The second method is a parallelized version of the forward algorithm, whereas the first one is single-threaded. pi, A and B specifies the HMM parameters. They can either be read from a file or build in C++ as described below in the section 'Encoding HMMs'. The parallelized version takes an additional filename as parameter. This filename should be the path to the file created by the calibrate program, which finds the optimal number of threads to use in the parallelized forward algorithm. The default filename is ~/.ziphmm.devices. If you did not move the file, then leave the device_filename parameter out.

See zipHMM/cpp_example.cpp for a simple example.

Using the Python library

To use the Python library in another project, copy zipHMM/pyZipHMM.py and zipHMM/libpyZipHMM.so to the root of your project folder after building the library and import pyZipHMM. See zipHMM/python_example.py and zipHMM/python_test.py for details on how to use the library.

A Forwarder object can be constructed from an observed sequence in the following ways:

 pyZipHMM import *
Forwarder.fromSequence(seqFilename = "example.seq", alphabetSize = 3, minNoEvals = 500)

To save the datastructure to disk do as follows:

iteToDirectory("example_preprocessed")

To read a previously saved datastructure from disk use either of the two methods:

 Forwarder.fromDirectory(directory = "../example_out")
 Forwarder.fromDirectory(directory = "../example_out", nStates = 3)

Finally, to evaluate the loglikelihood of the sequence in a given model (matrices pi, A and B) use either of

ikelihood = f.forward(pi, A, B)
ikelihood = f.ptforward(pi, A, B)

where the second method is parallelized. The three matrices pi, A and B can be read from a file or build in Python as described below.

See zipHMM/python_example.py for an example.

Encoding HMMs

An HMM consists of three matrices:

These three matrices can either be build in the code (in C++ or Python) or they can be encoded in a text file. The format of the text file is as follows:

HMM example


tates

abet_size






0.2 0.7
0.4 0.3
0.5 0.0

0.2 0.3 0.4
0.3 0.4 0.1
0.4 0.1 0.2

To read and write HMMs from and to files in C++, use the methods

 read_HMM(Matrix &resultInitProbs, Matrix &resultTransProbs, Matrix &resultEmProbs, const std::string &filename);
 write_HMM(const Matrix &initProbs, const Matrix &transProbs, const Matrix &emProbs, const std::string &filename);

To read and write HMMs from and to files in Python, use the functions

HMM(filename) -> (pi, A, B)
eHMM(pi, A, B, filename) -> None

To build a matrix in C++ do as illustrated in the following example:

C++ example
lude "matrix.hpp"

_t nRows = 3;
_t nCols = 4;
MM::Matrix m(3,4);
0) = 0.1;
1) = 0.2;

3) = 0.2;

To build a matrix in Python do as illustrated here:

Python example
rt pyZipHMM

s = 3
s = 4
pyZipHMM.Matrix(nRows, nCols)
0] = 0.1
1] = 0.2

3] = 0.2

Encoding sequences

The alphabet of observables are encoded using integers. Thus if the size of the alphabet is M, the observables are encoded using 0, 1, 2, …, M - 1. A sequence of observables is encoded in a text file with the single observations are seperated by whitespace. See example.seq for an example.

Executables

calibrate

Usage: bin/calibrate

Finds the optimal number of threads to use in the parallelized version of the forward algorithm.

build_forwarder

Usage: bin/build_forwarder -s <sequence filename> -M <alphabet size> -o <output directory> [-N <number of states>]*

Builds a Forwarder object from the sequence in the file specified in and writes it to the directory specified in . should be the size of the alphabet used in the observed sequence, and the file specified in should contain a single line containing white space separated integers between 0 and - 1. The list of HMM sizes to generate the data structure for can be specified using the -N parameter.

Examples:

build_forwarder -s example.seq -M 3 -o example_out
build_forwarder -s example.seq -M 3 -o example_out -N 2
build_forwarder -s example.seq -M 3 -o example_out -N 2 -N 4 -N 8 -N 16
forward

Usage: bin/forward (-s <sequence filename> -m <HMM filename> [-e #expected forward calls] [-o <output directory>] ) | (-d <preprocessing directory> -m <HMM filename>) [-p]

Runs the forward algorithm and outputs the loglikelhood. This executable can be called in two different ways:

forward -s example.seq -m example.hmm -e 500 -o example_out
forward -d example_out/ -m example.hmm

In the first example the loglikelihood is evaluated based on the observed sequence in example.seq and the HMM specified in example.hmm. In the second example the loglikelihood is evaluated based on the previously saved data structure in example_out/ and the HMM specified in example.hmm. In both cases the -p parameter can be used to use the parallelized version. In the first example the user can optionally choose to save the data structure in eg. example_out/ using the -o parameter:

forward -s example.seq -m example.hmm -e 500 -o example_out/
generate_hmm

Usage: bin/generate_hmm <number of states> <alphabet size> <HMM filename>

Generates a random HMM with states and observables, and saves it to .

generate_seq

Usage: bin/generate_seq <HMM filename> <length> <observed sequence output filename> <state sequence output filename>

Given an HMM specified in , runs the HMM for iterations and saves the resulting sequence of observables to and the resulting sequence of hidden states to .

Contact

If you encounter any problems or have questions about using this software, please contact

  Thomas Mailund : mailund@birc.au.dk.

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.