Name: blackbox-pipeline
Owner: Biological Informatics CoE @ Agriculture and Agri-Food Canada
Owner: Biological Informatics CoE @ Agriculture and Agri-Food Canada
Description: Genome assembly and metadata collection
Created: 2016-02-23 14:58:34.0
Updated: 2016-03-04 23:24:17.0
Pushed: 2016-08-11 20:19:52.0
Size: 108
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
This is an automated genome assembly pipeline designed to optimally run inside a Docker image. The purpose of this pipeline is to provide researchers with a standard and simplified workflow for microbial genome assembly while keeping track of executed commands and their parameters, program version numbers and associated sample metadata, which is important for ensuring experimental reproducibility and traceability.
This pipeline uses SPAdes as the genome assembler and therefore would work best with bacterial or smaller fungal genomes (< 100 Mb). MiSeq raw reads and metadata files are required as input. Alternatively, archived files from BaseSpace (e.g. analysis_14348334_fastq.zip) can also be used as input. In addition to the forward and reverse reads file, the following files are required:
These files are located within the appropriate subfolder (e.g. 140922_M02466_0030_000000000-AARWU - the naming of this folder consists of the date (140922), the MiSeq designation (M02466), and the flowcell number (000000000-AARWU)) of the MiSeqOutput directory in the MiSeq onboard computer. The fastq files are located in the ../MiSeqOutput/140922_M02466_0030_000000000-AARWU/Data/Intensities/BaseCalls folder.
Copy the forward and reverse fastq files, GenerateFASTQRunStatistics.xml, RunInfo.xml and SampleSheet.csv to a different working location (e.g. ../Sequencing/user_name/project_name/genome_assembly/140922).
This pipeline includes a main script (MBBSpades) that executes the following helper modules:
--clade
is set to bacteria
e: MBBSpades [-h] [-v] [-n numreads] [-t threads] [-o] [-F]
[-d destinationfastq] [-m miSeqPath] [-f miseqfolder]
[-r1 readLengthForward] [-r2 readLengthReverse]
[-r referenceFilePath] [-k kmerRange]
[-c customSampleSheet] [-b] [--clade CLADE]
[--itsx ITSX] [--trimoff]
path
mble genomes from Illumina fastq files
tional arguments:
th Specify path
onal arguments:
, --help show this help message and exit
, --version show program's version number and exit
numreads Specify the number of reads. Paired-reads: 2,
unpaired-reads: 1. Default is paired-end
threads Number of threads. Default is the number of cores in
the system
, --offHours Optionally run the off-hours module that will search
for MiSeq runs in progress, wait until the run is
complete, and assemble the run
, --FastqCreation Optionally run the fastq creation modulethat will
search for MiSeq runs in progress, run bcl2fastq to
create fastq files, and assemble the run
destinationfastq Optional folder path to store .fastq files created
using the fastqCreation module. Defaults to
path/miseqfolder
miSeqPath Path of the folder containing MiSeq run data folder
miseqfolder Name of the folder containing MiSeq run data
1 readLengthForward
Length of forward reads to use. Can specify "full" to
take the full length of forward reads specified on the
SampleSheet. Defaults to full
2 readLengthReverse
Length of reverse reads to use. Can specify "full" to
take the full length of reverse reads specified on the
SampleSheet. Defaults to full
referenceFilePath Provide the location of the folder containing the
pipeline accessory files (reference genomes, MLST
data, etc.
kmerRange The range of kmers used in SPAdes assembly. Default is
21,33,55,77,99,127
customSampleSheet Path of folder containing a custom sample sheet and
name of sample sheet file e.g.
/home/name/folder/BackupSampleSheet.csv. Note that
this sheet must still have the same format of Illumina
SampleSheet.csv files
, --basicAssembly Performs a basic de novo assembly, and does not
collect metadata
clade CLADE Specifiy HMM database for BUSCO
itsx ITSX Specifiy comma-seperated HMM database for ITSx
trimoff Turn off trimming with bbduk
This is an automated genome assembly pipeline designed to optimally run inside a Docker image. The purpose of this pipeline is to provide researchers with a standard and simplified workflow for microbial genome assembly while keeping track of executed commands and their parameters, program version numbers and associated sample metadata, which is important for ensuring experimental reproducibility and traceability.
This pipeline uses SPAdes as the genome assembler and therefore would work best with bacterial or smaller fungal genomes (< 100 Mb). MiSeq raw reads and metadata files are required as input. Alternatively, archived files from BaseSpace (e.g. analysis_14348334_fastq.zip) can also be used as input. In addition to the forward and reverse reads file, the following files are required:
These files are located within the appropriate subfolder (e.g. 140922_M02466_0030_000000000-AARWU - the naming of this folder consists of the date (140922), the MiSeq designation (M02466), and the flowcell number (000000000-AARWU)) of the MiSeqOutput directory in the MiSeq onboard computer. The fastq files are located in the ../MiSeqOutput/140922_M02466_0030_000000000-AARWU/Data/Intensities/BaseCalls folder.
Copy the forward and reverse fastq files, GenerateFASTQRunStatistics.xml, RunInfo.xml and SampleSheet.csv to a different working location (e.g. ../Sequencing/user_name/project_name/genome_assembly/140922).
This pipeline includes a main script (MBBSpades) that executes the following helper modules:
--clade
is set to bacteria
e: MBBSpades [-h] [-v] [-n numreads] [-t threads] [-o] [-F]
[-d destinationfastq] [-m miSeqPath] [-f miseqfolder]
[-r1 readLengthForward] [-r2 readLengthReverse]
[-r referenceFilePath] [-k kmerRange]
[-c customSampleSheet] [-b] [--clade CLADE]
[--itsx ITSX] [--trimoff]
path
mble genomes from Illumina fastq files
tional arguments:
th Specify path
onal arguments:
, --help show this help message and exit
, --version show program's version number and exit
numreads Specify the number of reads. Paired-reads: 2,
unpaired-reads: 1. Default is paired-end
threads Number of threads. Default is the number of cores in
the system
, --offHours Optionally run the off-hours module that will search
for MiSeq runs in progress, wait until the run is
complete, and assemble the run
, --FastqCreation Optionally run the fastq creation modulethat will
search for MiSeq runs in progress, run bcl2fastq to
create fastq files, and assemble the run
destinationfastq Optional folder path to store .fastq files created
using the fastqCreation module. Defaults to
path/miseqfolder
miSeqPath Path of the folder containing MiSeq run data folder
miseqfolder Name of the folder containing MiSeq run data
1 readLengthForward
Length of forward reads to use. Can specify "full" to
take the full length of forward reads specified on the
SampleSheet. Defaults to full
2 readLengthReverse
Length of reverse reads to use. Can specify "full" to
take the full length of reverse reads specified on the
SampleSheet. Defaults to full
referenceFilePath Provide the location of the folder containing the
pipeline accessory files (reference genomes, MLST
data, etc.
kmerRange The range of kmers used in SPAdes assembly. Default is
21,33,55,77,99,127
customSampleSheet Path of folder containing a custom sample sheet and
name of sample sheet file e.g.
/home/name/folder/BackupSampleSheet.csv. Note that
this sheet must still have the same format of Illumina
SampleSheet.csv files
, --basicAssembly Performs a basic de novo assembly, and does not
collect metadata
clade CLADE Specifiy HMM database for BUSCO
itsx ITSX Specifiy comma-seperated HMM database for ITSx
trimoff Turn off trimming with bbduk