BilkentCompGen/repeatnet

Name: repeatnet

Owner: Bilkent Bioinformatics and Computational Genomics Group

Description: RepeatNet: an ab initio centromeric sequence detection algorithm

Created: 2014-10-31 10:01:06.0

Updated: 2017-08-04 08:47:26.0

Pushed: 2017-08-04 08:47:43.0

Homepage: null

Size: 21

Language: C

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

repeatnet

RepeatNet: an ab initio centromeric sequence detection algorithm

Additional tools that should be installed

Compilation:

gcc repeatnet.c -o repeatnet -O3 -lm

input: FASTA format only. The pairs of reads (preferably fosmid; if not plasmid) should be named as:

pair1.FORWARD.1 pair1.REVERSE.1

pair2.FORWARD.1 pair2.REVERSE.1

the “pair1/pair2” part doesn't matter, but FORWARD/REVERSE part is used to mark the pairs; of course the “pair1/pair2” part of the pairs must match.

then run : repeatnet -i test

this will generate a bunch of files: test.h11.dump, test.h11.names, test.h11.winlog

then

repeatnet -loadwin test.h11.winlog -m -a 100 -c 100 -compare

-a 100 and -c 100 removes the vertices with less than 100 occurances in the graph, the edges with <100 weight value (co-occurance frequency between pairs of vertics). Lower values will give more connected and noisy graphs, higher values will “clean up” the graph more.

this will generate a matrix and viz file. To visualize the graph, use the graphviz package:

neato -Tps -o test.h11.winlog.h11.cut100.e10000.merged.eps test.h11.winlog.h11.cut100.e10000.merged.viz

then to divide into connected components:

repeatnet -loadmatrix test.h11.winlog.h11.cut100.e10000.merged.matrix -loadnames test.h11.names -components

pick the largest component first (by file size; or any other interesting-looking one from the graph). in my test, it is component-30. then “encode” the kmers back into numbers:

for i in cut -f 1 component-30.txt; do repeatnet -encode $i; done > component-30.ids

this will calculate the hash values (or vertex id's) for the kmers.

then:

repeatnet -loadwin test.h11.winlog -loadnames test.h11.names -clones component-30.id

will generate a file component-30.ids.clones that will have names of the sequences (pairs) that are likely to contain satellite. This file might be redundant, so run a “sort -u " on it. Fetch the sequences back from your original fasta file, and run phrap first; then run TRF on the contigs.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.