Sage-Bionetworks/rnaseqSim

Name: rnaseqSim

Owner: Sage Bionetworks

Description: null

Created: 2017-01-12 18:53:03.0

Updated: 2017-04-06 08:18:08.0

Pushed: 2017-11-04 21:38:44.0

Homepage: null

Size: 169

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

rnaseqSim

For SMC-RNA challenge, code for analyzing datasets of validated fusions and generating synthetic fusion data.

CWL Workflow:

Running the workflow requires CWL v1.0+ and Docker.

To run the workflow:

cwltool workflow/fusion_simulation_workflow.cwl [INPUT.JSON] cwltool workflow/fusion_simulation_workflow_all.cwl [INPUT.JSON]

The input JSON needs the fields:

SIM_NAME: string GTF: File NUM_EVENTS: int TARGET_DEPTH: int GENOME: File EXPRESSION_PROFILE: File RSEM_MODEL: File DIP_GENOME: File

And can use the optional fields SEED: [“null”, int] MID_EXON_FUSIONS: [“null”, boolean]

Description of inputs

SIM_NAME: GTF: NUM_EVENTS: TARGET_DEPTH: GENOME: EXPRESSION_PROFILE: RSEM_MODEL: DIP_GENOME:

SEED: (optional) If given all scripts with a random element in the workflow will have a seed set at the given integer.

MID_EXON_FUSIONS: (optional) If set to true, fusions will happen in the middle of exons

Description of outputs

[SIM_NAME]_filtered.bedpe:

[SIM_NAME]_isoforms_truth.txt:

[SIM_NAME]_mergeSort_1.fq.gz:

[SIM_NAME]_mergeSort_2.fq.gz:

archive.tgz: This will store other intermediate files if fusion_simulation_workflow_all.cwlwas used.

older descriptions:

Requirements:

STAR 2.4.2a

RSEM v1.2.31

Required Inputs

Diploid Genome - Homo_sapiens.GRCh37.75.primary.diploid.fa.gz (syn8348583)

Diploid GTF - Hsapiens_Ensembl_v75_diploid.gtf.gz (syn8348617)

Reference GTF - Hsapiens_Ensembl_v75_refonly.gtf (syn8348668)

Model file - CPCG_0258.R1.fastq.model (syn8348382)

Expression profile -

Basic Steps:

Step 1 - Index Diploid Genome:

rsem-prepare-reference --gtf [diploid.ref.gtf] --star [diploid.ref.fa] [Index name]

Step 2 - Create fusion events, truth file, and RSEM-format fusion reference:

fusion_create/module.py --gtf Hsapiens_Ensembl_v75_refonly.gtf --numEvents [XX] --simName [simName]

Step 3 - Adjust estimated isoform values to include expression for fusion genes according to a model:

model_isoforms/modify_model_tpm_for_diploid.R --TPM [input expression profile] --gtf [simName.gtf] --targetDepth [XX] --codeDir [/path/to/code] &> [output.log]

Step 4 - Generate reads from diploid and fusion references:

fastq_create/generate_reads.py --totalReads [targetDepth * 1000000] --numSimReads [output.log] --simName [simName] --RSEMmodel [model file] --isoformTPM [model_isoforms output] --fusionTPM [model_isoforms output] --fusRef [fusion_create output]


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.