AAFC-BICoE/blackbox-pipeline

Name: blackbox-pipeline

Owner: Biological Informatics CoE @ Agriculture and Agri-Food Canada

Owner: Biological Informatics CoE @ Agriculture and Agri-Food Canada

Description: Genome assembly and metadata collection

Created: 2016-02-23 14:58:34.0

Updated: 2016-03-04 23:24:17.0

Pushed: 2016-08-11 20:19:52.0

Homepage:

Size: 108

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Blackbox Pipeline for genome assembly

Background
Introduction

This is an automated genome assembly pipeline designed to optimally run inside a Docker image. The purpose of this pipeline is to provide researchers with a standard and simplified workflow for microbial genome assembly while keeping track of executed commands and their parameters, program version numbers and associated sample metadata, which is important for ensuring experimental reproducibility and traceability.

This pipeline uses SPAdes as the genome assembler and therefore would work best with bacterial or smaller fungal genomes (< 100 Mb). MiSeq raw reads and metadata files are required as input. Alternatively, archived files from BaseSpace (e.g. analysis_14348334_fastq.zip) can also be used as input. In addition to the forward and reverse reads file, the following files are required:

  1. GenerateFASTQRunStatistics.xml
  2. RunInfo.xml
  3. SampleSheet.csv

These files are located within the appropriate subfolder (e.g. 140922_M02466_0030_000000000-AARWU - the naming of this folder consists of the date (140922), the MiSeq designation (M02466), and the flowcell number (000000000-AARWU)) of the MiSeqOutput directory in the MiSeq onboard computer. The fastq files are located in the ../MiSeqOutput/140922_M02466_0030_000000000-AARWU/Data/Intensities/BaseCalls folder.

Copy the forward and reverse fastq files, GenerateFASTQRunStatistics.xml, RunInfo.xml and SampleSheet.csv to a different working location (e.g. ../Sequencing/user_name/project_name/genome_assembly/140922).

Contents

This pipeline includes a main script (MBBSpades) that executes the following helper modules:

Installation
Requirements
Docker
Outputs
  1. Assembled contigs are collected in the 'BestAssemblies' folder
  2. Reports in JSON format are located in the genome folder with the suffix _metadata.json
Usage
e: MBBSpades [-h] [-v] [-n numreads] [-t threads] [-o] [-F]
                               [-d destinationfastq] [-m miSeqPath] [-f miseqfolder]
                               [-r1 readLengthForward] [-r2 readLengthReverse]
                               [-r referenceFilePath] [-k kmerRange]
                               [-c customSampleSheet] [-b] [--clade CLADE]
                               [--itsx ITSX] [--trimoff]
                               path

mble genomes from Illumina fastq files

tional arguments:
th                  Specify path

onal arguments:
, --help            show this help message and exit
, --version         show program's version number and exit
 numreads           Specify the number of reads. Paired-reads: 2,
                    unpaired-reads: 1. Default is paired-end
 threads            Number of threads. Default is the number of cores in
                    the system
, --offHours        Optionally run the off-hours module that will search
                    for MiSeq runs in progress, wait until the run is
                    complete, and assemble the run
, --FastqCreation   Optionally run the fastq creation modulethat will
                    search for MiSeq runs in progress, run bcl2fastq to
                    create fastq files, and assemble the run
 destinationfastq   Optional folder path to store .fastq files created
                    using the fastqCreation module. Defaults to
                    path/miseqfolder
 miSeqPath          Path of the folder containing MiSeq run data folder
 miseqfolder        Name of the folder containing MiSeq run data
1 readLengthForward
                    Length of forward reads to use. Can specify "full" to
                    take the full length of forward reads specified on the
                    SampleSheet. Defaults to full
2 readLengthReverse
                    Length of reverse reads to use. Can specify "full" to
                    take the full length of reverse reads specified on the
                    SampleSheet. Defaults to full
 referenceFilePath  Provide the location of the folder containing the
                    pipeline accessory files (reference genomes, MLST
                    data, etc.
 kmerRange          The range of kmers used in SPAdes assembly. Default is
                    21,33,55,77,99,127
 customSampleSheet  Path of folder containing a custom sample sheet and
                    name of sample sheet file e.g.
                    /home/name/folder/BackupSampleSheet.csv. Note that
                    this sheet must still have the same format of Illumina
                    SampleSheet.csv files
, --basicAssembly   Performs a basic de novo assembly, and does not
                    collect metadata
clade CLADE         Specifiy HMM database for BUSCO
itsx ITSX           Specifiy comma-seperated HMM database for ITSx
trimoff             Turn off trimming with bbduk

Blackbox Pipeline for genome assembly

Background
Introduction

This is an automated genome assembly pipeline designed to optimally run inside a Docker image. The purpose of this pipeline is to provide researchers with a standard and simplified workflow for microbial genome assembly while keeping track of executed commands and their parameters, program version numbers and associated sample metadata, which is important for ensuring experimental reproducibility and traceability.

This pipeline uses SPAdes as the genome assembler and therefore would work best with bacterial or smaller fungal genomes (< 100 Mb). MiSeq raw reads and metadata files are required as input. Alternatively, archived files from BaseSpace (e.g. analysis_14348334_fastq.zip) can also be used as input. In addition to the forward and reverse reads file, the following files are required:

  1. GenerateFASTQRunStatistics.xml
  2. RunInfo.xml
  3. SampleSheet.csv

These files are located within the appropriate subfolder (e.g. 140922_M02466_0030_000000000-AARWU - the naming of this folder consists of the date (140922), the MiSeq designation (M02466), and the flowcell number (000000000-AARWU)) of the MiSeqOutput directory in the MiSeq onboard computer. The fastq files are located in the ../MiSeqOutput/140922_M02466_0030_000000000-AARWU/Data/Intensities/BaseCalls folder.

Copy the forward and reverse fastq files, GenerateFASTQRunStatistics.xml, RunInfo.xml and SampleSheet.csv to a different working location (e.g. ../Sequencing/user_name/project_name/genome_assembly/140922).

Contents

This pipeline includes a main script (MBBSpades) that executes the following helper modules:

Installation
Requirements
Docker
Outputs
  1. Assembled contigs are collected in the 'BestAssemblies' folder
  2. Reports in JSON format are located in the genome folder with the suffix _metadata.json
Usage
e: MBBSpades [-h] [-v] [-n numreads] [-t threads] [-o] [-F]
                               [-d destinationfastq] [-m miSeqPath] [-f miseqfolder]
                               [-r1 readLengthForward] [-r2 readLengthReverse]
                               [-r referenceFilePath] [-k kmerRange]
                               [-c customSampleSheet] [-b] [--clade CLADE]
                               [--itsx ITSX] [--trimoff]
                               path

mble genomes from Illumina fastq files

tional arguments:
th                  Specify path

onal arguments:
, --help            show this help message and exit
, --version         show program's version number and exit
 numreads           Specify the number of reads. Paired-reads: 2,
                    unpaired-reads: 1. Default is paired-end
 threads            Number of threads. Default is the number of cores in
                    the system
, --offHours        Optionally run the off-hours module that will search
                    for MiSeq runs in progress, wait until the run is
                    complete, and assemble the run
, --FastqCreation   Optionally run the fastq creation modulethat will
                    search for MiSeq runs in progress, run bcl2fastq to
                    create fastq files, and assemble the run
 destinationfastq   Optional folder path to store .fastq files created
                    using the fastqCreation module. Defaults to
                    path/miseqfolder
 miSeqPath          Path of the folder containing MiSeq run data folder
 miseqfolder        Name of the folder containing MiSeq run data
1 readLengthForward
                    Length of forward reads to use. Can specify "full" to
                    take the full length of forward reads specified on the
                    SampleSheet. Defaults to full
2 readLengthReverse
                    Length of reverse reads to use. Can specify "full" to
                    take the full length of reverse reads specified on the
                    SampleSheet. Defaults to full
 referenceFilePath  Provide the location of the folder containing the
                    pipeline accessory files (reference genomes, MLST
                    data, etc.
 kmerRange          The range of kmers used in SPAdes assembly. Default is
                    21,33,55,77,99,127
 customSampleSheet  Path of folder containing a custom sample sheet and
                    name of sample sheet file e.g.
                    /home/name/folder/BackupSampleSheet.csv. Note that
                    this sheet must still have the same format of Illumina
                    SampleSheet.csv files
, --basicAssembly   Performs a basic de novo assembly, and does not
                    collect metadata
clade CLADE         Specifiy HMM database for BUSCO
itsx ITSX           Specifiy comma-seperated HMM database for ITSx
trimoff             Turn off trimming with bbduk

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.