Name: PEnG-motif
Owner: Söding Lab
Description: PEnG-motif is an open-source software package for searching statistically overrepresented motifs (position specific weight matrices, PWMs) in a set of DNA sequences.
Created: 2016-11-09 07:46:59.0
Updated: 2017-11-24 06:21:50.0
Pushed: 2017-12-26 21:08:31.0
Size: 1627
Language: C++
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
© Johannes Soeding, Markus Meier, Christian Roth
PEnG-motif is an open-source software package for searching motifs (position specific weight matrices, PWMs) in a set of DNA sequences.
As the core algorithm operates on kmers, the runtime is practically independent of the number and size of the input sequences. This makes PEnG-motif suitable for de-novo motif discovery on large sequence sets.
The easiest way to get PEnG-motif on your mac is by downloading our precompiled binaries. If you choose that way, please download peng_motif_macOS.zip
from our latest release.
After unzipping you will find a new binary peng_motif
which you can call from command line by the full path to the binary ./peng_motif
, or directly by typing peng_motif
, if you moved it to a location in your shell path.
To compile and run PEnG-motif, you need
Download the source code archive from our latest release.
Unzip the source code, and navigate into the freshly unzipped folder in your terminal.
This code will compile and install PEnG-motif
ALL_DIR=/path/to/an/installation/directory/of/your/choice
r build
uild
e .. -DCMAKE_INSTALL_PREFIX=$INSTALL_DIR
&& make install
You may want to add the path of $INSTALL_DIR/bin
to your shell PATH, to simplify usage.
The repository also contains helper scripts - if you want to use them, you also need:
PEnG-motif finds enriched sequences in a fasta file and writes the motifs to an output file in meme format.
If the parent directory of of peng_motif
is in your shell path, the simplest way to use PEng-motif is:
ng_motif <path/to/input.fasta> -o output.meme
For a list of all available options, please see the output of peng_motif -h
.
If you didn't put PEnG-motif in your shell path, you have to enter the full path to the peng binary, e.g. /path/to/peng/peng_motif -h
.
The PEnG-motif algorithm runs several phases; understanding the output printed to the console will help interpreting the final results.
In the first phase, the occurrences of all 4^(pattern_length) kmers are counted. The expected number of pattern counts are calculated under the assumption of a homogenous Markov model. From the observed and expected counts a z-score is computed for each kmer.
ern observed enrichment zscore
TAG 890 2.40 27.00
ACC 2456 1.68 26.08
For each pattern that passes a predefined z-score threshold, the number of occurences, the enrichment of observed occurences over the expected and the calculated z-score are reported.
In the second phase the selected base patterns from phase 1 are iteratively optimized by degenerating single nucleotides if the degenerated pattern achieves a higher score. By default we optimize a function based on mutual information of observation and expectation.
TAG 890 2.40 -0.036173
TAG 1687 1.89 -0.036744
mization: CACTAG -> CWCTAG
For each iteration the current IUPAC pattern, the number of observed counts, the enrichment over the expected counts and the optimization score are printed. Once the optimization runs into a local optimum, the original base pattern and its optimal IUPAC pattern are reported.
If the optimization runs into a pattern that has already been seen in a previous optimization, the optimization stops and the base pattern is removed.
At the end of the optimization phase, all IUPAC patterns, their occurrences, enrichment over the expected counts, and the calculated z-scores are reported.
ern observed enrichment zscore
TAG 1687 1.89 26.51
AYC 4376 1.55 29.29
In phase 3, only the best scoring IUPAC PWMs are retained and are converted to PWMs.
In the final phase, an expectation-maximization algorithm sharpens the PWMs. PWMs that have strong detectable overlaps are merged to form longer PWMs. The so optimized PWMs are written to the output file in meme format.
--strand PLUS
to avoid mixing the counts of reverse complemented base patterns.The shoot_peng.py
python script can be used to annotate motifs according how well they can distinguish the given sequences from randomly generated ones.
Benchmarking requires BaMMmotif2.
The PEnG-motif can be modified and distributed under the GPL-3.0 License.
PEnG-motif uses shared code from BaMMmotif. Many thanks to the developers of BaMMmotif!