TGAC/KAT

Name: KAT

Owner: Earlham Institute

Description: The K-mer Analysis Toolkit (KAT) contains a number of tools that analyse and compare K-mer spectra.

Created: 2013-10-03 08:59:47.0

Updated: 2018-01-12 23:21:00.0

Pushed: 2018-01-15 15:45:14.0

Homepage: http://www.earlham.ac.uk/kat-tools

Size: 40649

Language: C++

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

alt text

KAT - The K-mer Analysis Toolkit

KAT is a suite of tools that analyse jellyfish hashes or sequence files (fasta or fastq) using kmer counts. The following tools are currently available in KAT:

In addition, KAT contains a python script for analysing the mathematical distributions present in the K-mer spectra in order to determine how much content is present in each peak.

This README only contains some brief details of how to install and use KAT. For more extensive documentation please visit: https://kat.readthedocs.org/en/latest/

Installation

There are two ways to install KAT from source, either by cloning the git repository, or by downloading a distributable package, the later method is generally recommended as it reduces the number of installation steps and dependencies required to be on your system.

When installing from distributable first confirm dependencies are installed and configured:

In addition, KAT can only produce plots if one of the following plotting engines is installed:

Then proceed with the following steps:

Should you wish to install from a cloned git repository instead, do the following:

The configure script can take several options as arguments. One commonly modified option is `--prefix`, which will install KAT to a custom directory. By default this is “/usr/local”, so the KAT executable would be found at “/usr/local/bin” by default. In addition, some options specific to managing KAT dependencies located in non-standard locations are:

Type `./configure --help` for full details.

As already mentioned KAT can also make plots but requires external software to be available to do this. To enable plotting functionality we require either python3, with numpy, scipy and matplotlib packages installed. The python installation must come with the python shared library, on debian systems you can install this with “sudo apt-get install python3-dev”. If you don't already have python3 installed on your system we recommend installing anaconda3 as this contains everything you need. Alternatively, you can use gnuplot, although the python plotting method is the preferred method and will produce nicer results.

The type of plotting engine used will be determined when running the configure script, which will select the first engine detected in the following order: python, gnuplot, none. There is currently no way to select the plotting directory from a custom location, so the plotting system needs to be properly installed and configured on your system: i.e. python3 or gnuplot must be available on the PATH.

If sphinx is installed and detected on your system then html documentation and man pages are automatically built during the build process. If it is not detected then this step is skipped. Should you wish to create a PDF version of the manual you can do so by typing `make pdf`, this is not executed by default.

Operating Instructions

After KAT has been installed, the `kat` executable file should be available which contains a number of subtools.

Running `kat --helpwill bring up a list of available tools within kat. To get help on any of these subtools simple type: ``kat –help`. For example:kat sect --help`` will show details on how to use the sequence coverage estimator tool.

KAT supports file globbing for input, this is particularly useful when trying to count and analyse kmers for paired end files. For example, assuming you had two files: LIB_R1.fastq, LIB_R2.fastq in the current directory then `kat hist -C -m27 LIB_R?.fastq, will consume any files matching the pattern LIB_R?.fastq as input, i.e. LIB_R1.fastq, LIB_R2.fastq. The same result could be achieved listing the files at the command line: ``kat hist -C -m27 LIB_R1.fastq LIB_R2.fastq```

Note, the KAT comp subtool takes 2 or three groups of inputs as positional arguments therefore we need to distinguish between the file groups. This is achieved by surrounding any glob patterns or file lists in single quotes. For example, assuming we have LIB1_R1.fastq, LIB1_R2.fastq, LIB2_R1.fastq, LIB2_R2.fastq in the current directory, and we want to compare LIB1 against LIB2, instead of catting the files together, we might run either: `kat comp -C -D 'LIB1_R?.fastq' 'LIB2_R?.fastq'; or ``kat comp -C -D 'LIB1_R1.fastq LIB1_R2.fastq' 'LIB2_R1.fastq LIB2_R2.fastq'```. Both commands do the same thing.

Licensing

GNU GPL V3. See COPYING file for more details.

Cite

If you use KAT in your work and wish to cite us please use the following citation:

Daniel Mapleson, Gonzalo Garcia Accinelli, George Kettleborough, Jonathan Wright, and Bernardo J. Clavijo. KAT: A K-mer Analysis Toolkit to quality control NGS datasets and genome assemblies. Bioinformatics, 2016. doi: 10.1093/bioinformatics/btw663

Authors

See AUTHORS file for more details.

Acknowledgements

We would also like to thank the authors of Jellyfish: https://github.com/gmarcais/Jellyfish; and SeqAn: http://www.seqan.de/. Both are embedded inside KAT.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.