ratatosk is a library of luigi tasks, currently focused on, but not limited to, common bioinformatical tasks.


It is recommended that you first create a virtual environment in which to install the packages. Install virtualenvwrapper and use mkvirtualenv to create a virtual environment.


To install the development version of ratatosk, do

git clone
python develop

To begin with, you may need to install Tornado and Pygraphviz (see Luigi for further information).

The tests depend on the following software to run:

  1. bwa
  2. samtools
  3. GATK - set an environment variable GATK_HOME to point to your installation path
  4. picard - set an environment variable PICARD_HOME to point to your installation path
  5. cutadapt - install with pip install cutadapt
  6. fastqc

You also need to install the test data set:

git clone
python develop

Note that you must use develop.

Running the tests

Cd to the luigi test directory (tests) and run

nosetests -v -s

To run a given task (e.g. TestLuigiWrappers.test_fastqln), do

nosetests -v -s
Task visualization and tabulation

By default, the tests use a local scheduler, implemented in luigi. For production purposes, there is also a central planner. Among other things, it allows for visualization of the task flow by using Tornado and Pygraphviz. Results are displayed in http://localhost:8081, results “collected” at http://localhost:8082/api/graph.

In addition, I have extended the luigi daemon and server code to generate a table representation of the tasks (in http://localhost:8083). The aim here would be to define a grouping function that groups task lists according to a given feature (e.g. sample, project).

In order to view tasks, run

bin/ratatoskd &

in the background, set the PYTHONPATH to the current directory and run the tests:

PYTHONPATH=. nosetests -v -s
Examples in tests

NOTE: these are still not real unit tests in that they in some cases are inter-dependent. See issues.

These examples are currently based on the tests in ratatosk.tests.test_wrapper.

Creating file links

The task ratatosk.lib.files.fastq.FastqFileLink creates a link from source to a target. The source in this case depends on an external task (ratatosk.lib.files.external.FastqFile meaning this file was created by some outside process (e.g. sequencing machine).

nosetests -v -s


A couple of comments are warranted. First, the boxes shows tasks, where the FastqFile is an external task. The file it points to must exist for the task FastqFileLink executes. The color of the box indicates status; here, green means the task has completed successfully. Second, every task has its own set of options that can be passed via the command line or in the code. In the FastqFileLink task box we can see the options that were passed to the task. For instance, the option use_long_names=True prints complete task names, as shown above.

Alignment with bwa sampe

Here's a more useful example; paired-end alignment using bwa.

nosetests -v -s


Wrapping up metrics tasks

The class subclasses ratatosk.job.JobWrapperTask that can be used to require that several tasks have completed. Here I've used it to group picard metrics tasks:

nosetests -v -s


Here, I've set the option --use-long-names to False, which changes the output to show only the class names for each task. This example utilizes a configuration file that links tasks together. More about that in the next example.

Working with parent tasks and configuration files

All tasks have a default requirement, which I call parent_task. In the current implementation, all tasks subclass ratatosk.job.JobTask, which provides a parent_task class variable. This variable can be changed, either at the command line (option --parent-task) or in a configuration file. The parent_task variable is a string representing a class in a python module, and could therefore be any python code of choice. In addition to the parent_task variable, JobTask provides variables _config_section and _config_subsection that point to sections and subsections in the config file, which should be in yaml format (see google app for nicely structured config files). By default, all metrics functions have as parent class This can easily be modified in the config file to:

    targets: targets.interval_list
    baits: targets.interval_list

    parent_task: ratatosk.lib.align.BwaSampe

Note also that InputBamFile has been changed to depend on (default value is ratatosk.lib.files.external.BamFile).

Examples with

The installation procedure will install an executable script,, in your search path. The script collects all tasks currently available in the ratatosk modules:

usage: [-h] [--config-file CONFIG_FILE] [--dry-run] [--lock]
                       [--workers WORKERS] [--lock-pid-dir LOCK_PID_DIR]
                       [--scheduler-host SCHEDULER_HOST]
                       [--restart-from RESTART_FROM]
                       [--custom-config CUSTOM_CONFIG] [--print-config]
                       [--use-long-names] [--local-scheduler] [--restart]


To run a specific task, you use one of the positional arguments. In this way, it works much like a Makefile. A make command resolves dependencies based on the desired target file name, so you would do make target to generate target. With ratatosk, the target is passed via the --target option. For instance, to run BwaSampe you would do: BwaSampe \
  --target target.bam
  --config-file config/ratatosk.yaml

Here I've used a 'global' config file ratatosk.yaml. You actually don't need to pass it as in the example above as it's loaded by default.

The following examples assume you run the command from the ngs_test_data/data/projects/J.Doe_00_01 directory, and that ratatosk is installed at ~/opt.

Dry run

The --dry-run option will resolve dependencies but not actually run anything. In addition, it will print the tasks that will be called. By passing a target RawIndelRealigner 
  --target P001_101_index3/P001_101_index3_TGACCA_L001.trimmed.sync.sort.merge.realign.bam
  --custom-config ~/opt/ratatosk/examples/J.Doe_00_01.yaml --dry-run

we get the dependencies as specified in the config file:


The task RawIndelRealigner is defined in ratatosk.pipeline.haloplex and is a modified version of It is used for analysis of HaloPlex data.

Merging samples over several runs

Sample P001_101_index3 has data from two separate runs that should be merged. The class merges sample_run files and places the result in the sample directory. The implementation currently depends on the directory structure 'sample/fc1', sample/fc2' etc. MergeSamFiles  --target P001_101_index3/P001_101_index3_TGACCA_L001.sort.merge.bam
  --config-file ~/opt/ratatosk/examples/J.Doe_00_01.yaml

results in


Note that in this implementation the merged files end up directly in the sample directory (i.e. P001_101_index3).

Adding adapter trimming

Changing the following configuration section (see J.Doe_00_01_trim.yaml):

    parent_task: ratatosk.lib.utils.cutadapt.CutadaptJobTask

    parent_task: ratatosk.lib.utils.misc.ResyncMatesJobTask

and running MergeSamFiles  
    --target P001_101_index3/P001_101_index3_TGACCA_L001.trimmed.sync.sort.merge.bam 
    --config-file ~/opt/ratatosk/examples/J.Doe_00_01_trim.yaml

runs the same pipeline as before, but on adapter-trimmed data.


Extending workflows with subclassed tasks

It's dead simple to add tasks of a given type. Say you want to calculate hybrid selection on bam files that have and haven't been mark duplicated. By subclassing an existing task and giving the new class it's own configuration file location, you can configure the new task to depend on whatever you want. In I have added the following class:

s HsMetricsNonDup(HsMetrics):
"""Run on non-deduplicated data"""
_config_subsection = "hs_metrics_non_dup"
parent_task = luigi.Parameter(default="")

and a picard metrics wrapper task

s PicardMetricsNonDup(JobWrapperTask):
"""Runs hs metrics on both duplicated and de-duplicated data"""
def requires(self):
    return [InsertMetrics( + str(InsertMetrics.target_suffix.default[0])),
            HsMetrics( + str(HsMetrics.target_suffix.default)),
            HsMetricsNonDup(target=rreplace(, str(DuplicationMetrics.label.default), "", 1) + str(HsMetrics.target_suffix.default)),
            AlignmentMetrics( + str(AlignmentMetrics.target_suffix.default))]

The picard configuration section in the configuration file J.Doe_00_01_nondup.yaml now has a new subsection:


Running PicardMetricsNonDup  --target P001_101_index3/P001_101_index3_TGACCA_L001.sort.merge.dup
  --config-file ~/opt/ratatosk/examples/J.Doe_00_01_nondup.yaml

will add hybrid selection calculation on non-deduplicated bam file for sample P001_101_index3:


Best practice pipelines

The user can modify execution order of tasks by customising the parent_task attribute. However, some workflows should be immutable, thereby representing “standard” or “best-practice” pipelines. This is currently achieved by treating some tasks differently. For instance, when the task HaloPlex is called, the following code is executed in

ask == "HaloPlex":
args = sys.argv[2:] + ['--config-file', config_dict['haloplex']], main_task_cls=ratatosk.pipeline.haloplex.HaloPlex)

where config_dict['haloplex'] points to predefined config files located in the ratatosk/config folder. Best practice pipeline classes are currently located in ratatosk.pipeline. For a pipeline to run, the final targets have to be calculated. This is currently done by providing a function in the configuration that the pipeline will load in the set_target_generator_function. For instance, the corresponding configuration section in the example configuration file J.Doe_00_01.yaml is

target_generator_function: test.site_functions.target_generator

In contrast to parent_task, there is no default function to fall back on, so not providing this function will result in an error.

Incidentally, this demonstrates the boilerplate code needed to add a new predefined pipeline. In, add

'bestpractice' : os.path.join(ratatosk.__path__[0], os.pardir, "config", "bestpractice.yaml"),

and in ratatosk.pipeline.bestpractice

s BestPractice(PipelineTask):

def requires(self):
    tgt_fun = self.set_target_generator_function()
    # Need to pass the class to tgt_fun
    targets = tgt_fun(self.indir, ...)
    targets = ["...".format(x[2], self.final_target_suffix) for x in target_list]
    return [FinalTarget(target=tgt) for tgt in target_list, ...]

This feature is likely to change soon. Among other things, it would be nice to dynamically generate target names based on task labels.

If a pipeline config has been loaded, but the user nevertheless wants to change program options, the --custom-config flag can be used. Note then that updating parent_task is disabled so that program execution order cannot be changed - after all, it is a fixed pipeline. This allows for project-specific configuration files that contain metadata information about the project itself, as well as allowing for configurations of analysis options.

Basic align seqcap pipeline

Here is an example of a basic align seqcap pipeline. AlignSeqcap --indir ~/opt/ngs_test_data/data/projects/J.Doe_00_01 
    --custom-config ~/opt/ratatosk/examples/J.Doe_00_01.yaml


HaloPlex calling pipeline

Here's an example of a variant calling pipeline defined for analysis of HaloPlex data: HaloPlex --indir ~/opt/ngs_test_data/data/projects/J.Doe_00_01
  --workers 4 --custom-config ~/opt/ratatosk/examples/J.Doe_00_01.yaml

resulting in


Blue boxes mean active processes (the command was run with --workers 4). Note that we need to know what labels are applied to the file name (see issues). In this iteration, for the predefined pipelines the file names have been hardcoded.


The implementation is still under heavy development and testing so expect many changes in near future.

Basic job task

ratatosk.job defines, among other things, a default shell job runner, which is a wrapper for running tasks in shell, and a base job task that subclasses luigi.Task. The base job task implements a couple of functions that are essential for general behaviour:

Program modules

ratatosk submodules are named after the application/program to be run (e.g. ratatosk.lib.align.bwa for bwa). For consistency, the modules shoud contain

  1. a job runner that subclasses ratatosk.job.DefaultShellJobRunner. The runner specifies how the program is run

  2. input file task(s) that subclass ratatosk.job.JobTask and that depend on external tasks in ratatosk.lib.files.external. The idea is that all acceptable file formats be defined as external inputs, and that parent tasks therefore must use one/any of these inputs

  3. a main job task that subclasses ratatosk.job.JobTask and has as default parent task one of the inputs (previous point). The _config_section should be set to the module name (e.g. bwa for ratatosk.lib.align.bwa). It should also return the job runner defined in 1.

  4. tasks that subclass the main job task. The _config_subsection should represent the task name in some way (e.g. aln for bwa alncommand)

  5. possibly wrapper tasks that group common tasks in a module

Configuration parser

Python's standard configuration parser works on .ini files allowing section levels followed by customizations. It would be nice with at least sections/subsections (python's ConfigObj does this), but since I prefer yaml files, I have implemented a config parser that enforces section and subsections, treating everything below that level as lists/dicts/variables.

HOWTO: Adding task wrappers

In essence, ratatosk is a library of program wrappers. There are already a couple of wrappers available, but many more could easily be added. Here is a short HOWTO on how to add a wrapper module myprogram.

1. Create the file

Create the file (doh!), with at least the following imports:

rt os
rt luigi
 ratatosk.job import JobTask, DefaultShellJobRunner
 ratatosk.utils import rreplace
2. Add job runners

At the very least, there should exist the following:

s MyProgramJobRunner(DefaultShellJobRunner):

This is in part for consistency, in part in case the myprogram program group needs special handling of command construction (see e.g.

3. Add default inputs

There should be at least one input class that subclasses one of the ratatosk.lib.files.external classes. Mainly here for naming consistency.

s InputFastqFile(JobTask):
_config_section = "myprogram"
_config_subsection = "InputFastqFile"
target = luigi.Parameter(default=None)
parent_task = luigi.Parameter(default="ratatosk.lib.files.external.FastqFile")

def requires(self):
    cls = self.set_parent_task()
    return cls(
def output(self):
    return luigi.LocalTarget(
def run(self):
4. Add wrapper tasks

Once steps 1-3 are done, tasks can be added. If the program has subprograms (e.g. bwa aln), it is advisable to create a generic 'top' job task. In any case, a task should at least consist of the following:

s MyProgram(JobTask):
# Corresponding section and subsection in config file
_config_section = "myprogram"
_config_subsection = "myprogram_subsection"
# Name of executable. This is a parameter so the user can specify
# the version
executable = luigi.Parameter(default="myprogram")
# Name of sub_executable. 
sub_executable = luigi.Parameter(default="my_subprogram")
# program options
options = luigi.Parameter(default=None)
parent_task = luigi.Parameter(default="myprogram.InputFastqFile")
# Target and source suffixes are necessary for generating target
# names
target_suffix = luigi.Parameter(default=".sai")
source_suffix = luigi.Parameter(default=".fastq.gz")
# Add label if this task should add label to file name (e.g.
# file.txt -> file.label.txt)
label = luigi.Parameter(default="label")

# Must be present
def job_runner(self):
    return MyProgramJobRunner()

# Here gather the *required* arguments to 'myprogram'. Often input
# redirected to output suffices
def args(self):
    return [self.input(), ">", self.output()]

# The following functions are inherited from JobTask and changing
# their behaviour is often not necessary

# For single requirements, the BaseJobTask function often
# suffices. For more complex requirements, a reimplementation is
# needed. Idea is to generate the source name of the parent class
# that was used to generate the target
#def requires(self):
#    cls = self.set_parent_task()
#    source = self._make_source_file_name()
#    return cls(target=source)

#def exe(self):
#    """Executable of this task"""
#    return self.executable

# Subprogram name, e.g. 'aln' in 'bwa aln'  
#def main(self):
#    return self.sub_executable

# Returns the options string. This may need a lot of tampering
# with, see e.g. 'ratatosk.gatk.VariantEval' (but see also comment
# in issues)
#def opts(self):
#return self.options

# Output = target
#def output(self):
#    return luigi.LocalTarget(

Note that in many cases you only have to reimplement job_runner and args, and in some cases the requires function.

To actually run the task, you need to import the module in your script, and luigi will automagically add the task MyProgram and its options.

TODO/future ideas/issues

See issue list for a complete list. Some of the most pressing issues to fix include

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.