Name: sra-pipeline
Owner: Fred Hutchinson Cancer Research Center
Description: download sra files from SRA, pipe through fastq_dump and bowtie2 to S3, in a container
Created: 2018-04-05 19:04:16.0
Updated: 2018-05-03 05:22:48.0
Pushed: 2018-05-03 05:22:47.0
Homepage: null
Size: 4130
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
This repository contains code for running an analysis pipeline in AWS Batch.
Given a set of SRA accession numbers, AWS Batch will start an array job where each child will process a single accession number, doing the following:
.sra
file to fastq
format using
fastq-dump. The sra
file is highly compressed and this step can expand it to more than 20 times
its size, which is one reason we stream the data in a pipe: so as to
not need lots of scratch space.fastq
data through
bowtie2
to search for the virus.bowtie2
through
gzip to compress it prior to
the next step.bowtie2
to an
S3 bucket. The resulting file will
have an S3 URL like this: s3://<bucket-name>/pipeline-results2/<SRA-accession-number>/<virus>/<SRA-accession-number>.sam.gz
.git pull
periodically to keep
your cloned repository up to date):clone https://github.com/FredHutch/sra-pipeline.git
ra-pipeline
sra_pipeline
utilityA script called sra_pipeline
is available to to simplify the following:
Running the utility with --help
gives usage information:
sra_pipeline --help
e: sra_pipeline.py [-h] [-c] [-i] [-s N] [-r N] [-f FILE]
onal arguments:
, --help show this help message and exit
, --completed show completed accession numbers
, --in-progress show accession numbers that are in progress
N, --submit-small N
submit N jobs of ascending size
N, --submit-random N
submit N randomly chosen jobs
FILE, --submit-file FILE
submit accession numbers contained in FILE
You can get more detail about running jobs by using
the Batch Dashboard
and/or the
AWS command-line client for Batch.
See Using AWS Batch at Fred Hutch for more information.