soedinglab/PEnG-motif

Name: PEnG-motif

Owner: Söding Lab

Description: PEnG-motif is an open-source software package for searching statistically overrepresented motifs (position specific weight matrices, PWMs) in a set of DNA sequences.

Created: 2016-11-09 07:46:59.0

Updated: 2017-11-24 06:21:50.0

Pushed: 2017-12-26 21:08:31.0

Homepage:

Size: 1627

Language: C++

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

PEnG-motif

© Johannes Soeding, Markus Meier, Christian Roth

Build Status License Issues

PEnG-motif is an open-source software package for searching motifs (position specific weight matrices, PWMs) in a set of DNA sequences.

As the core algorithm operates on kmers, the runtime is practically independent of the number and size of the input sequences. This makes PEnG-motif suitable for de-novo motif discovery on large sequence sets.

Installation on macOS

The easiest way to get PEnG-motif on your mac is by downloading our precompiled binaries. If you choose that way, please download peng_motif_macOS.zip from our latest release.

After unzipping you will find a new binary peng_motif which you can call from command line by the full path to the binary ./peng_motif, or directly by typing peng_motif, if you moved it to a location in your shell path.

Installation on linux
Requirements

To compile and run PEnG-motif, you need

Installation procedure

Download the source code archive from our latest release.

Unzip the source code, and navigate into the freshly unzipped folder in your terminal.

This code will compile and install PEnG-motif

ALL_DIR=/path/to/an/installation/directory/of/your/choice
r build
uild
e .. -DCMAKE_INSTALL_PREFIX=$INSTALL_DIR
 && make install

You may want to add the path of $INSTALL_DIR/bin to your shell PATH, to simplify usage.

Optional requirements for running the scripts

The repository also contains helper scripts - if you want to use them, you also need:

Using PEnG-motif

PEnG-motif finds enriched sequences in a fasta file and writes the motifs to an output file in meme format.

If the parent directory of of peng_motif is in your shell path, the simplest way to use PEng-motif is:

ng_motif <path/to/input.fasta> -o output.meme

For a list of all available options, please see the output of peng_motif -h.

If you didn't put PEnG-motif in your shell path, you have to enter the full path to the peng binary, e.g. /path/to/peng/peng_motif -h.

Interpreting the output

The PEnG-motif algorithm runs several phases; understanding the output printed to the console will help interpreting the final results.

Phase 1: Counting base patterns

In the first phase, the occurrences of all 4^(pattern_length) kmers are counted. The expected number of pattern counts are calculated under the assumption of a homogenous Markov model. From the observed and expected counts a z-score is computed for each kmer.

ern        observed      enrichment          zscore

TAG             890            2.40           27.00
ACC            2456            1.68           26.08

For each pattern that passes a predefined z-score threshold, the number of occurences, the enrichment of observed occurences over the expected and the calculated z-score are reported.

Phase 2: Optimizing the base patterns

In the second phase the selected base patterns from phase 1 are iteratively optimized by degenerating single nucleotides if the degenerated pattern achieves a higher score. By default we optimize a function based on mutual information of observation and expectation.

TAG        890   2.40    -0.036173
TAG       1687   1.89    -0.036744
mization: CACTAG -> CWCTAG

For each iteration the current IUPAC pattern, the number of observed counts, the enrichment over the expected counts and the optimization score are printed. Once the optimization runs into a local optimum, the original base pattern and its optimal IUPAC pattern are reported.

If the optimization runs into a pattern that has already been seen in a previous optimization, the optimization stops and the base pattern is removed.

At the end of the optimization phase, all IUPAC patterns, their occurrences, enrichment over the expected counts, and the calculated z-scores are reported.

ern        observed      enrichment          zscore

TAG            1687            1.89           26.51
AYC            4376            1.55           29.29
Phase 3: selection, PWM generation

In phase 3, only the best scoring IUPAC PWMs are retained and are converted to PWMs.

Phase 4: EM-optimization and merging

In the final phase, an expectation-maximization algorithm sharpens the PWMs. PWMs that have strong detectable overlaps are merged to form longer PWMs. The so optimized PWMs are written to the output file in meme format.

Tips and tricks
Benchmarking

The shoot_peng.py python script can be used to annotate motifs according how well they can distinguish the given sequences from randomly generated ones.

Benchmarking requires BaMMmotif2.

License

The PEnG-motif can be modified and distributed under the GPL-3.0 License.

Acknowledgements

PEnG-motif uses shared code from BaMMmotif. Many thanks to the developers of BaMMmotif!


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.