raphael-group/cna-processing

Name: cna-processing

Owner: raphael-group

Description: Script for taking the output of the GISTIC2.0 module and converting the data for use with MAGI, HotNet2, and CoMEt.

Created: 2016-05-24 18:40:10.0

Updated: 2016-05-26 04:31:07.0

Pushed: 2016-07-11 13:32:07.0

Homepage: null

Size: 5533

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

CNA Processing

Purpose

Process CNA data into formats usable by MAGI, HotNet, and CoMEt.

Requires
Setup

None required, beyond having necessary python libraries and gene target/location lists.

Quick Start

This will work assuming the GISTIC2.0 output data is using the standard naming convention.

python gistic2processing.py -g <path to gene dictionary> -d <path to data folder/tar file> -tg <path to gene target file>

Output
HotNet2

A tab separated file, each row is a sample and associated gene mutations. The first column is the sample name, each entry after is a gene with either (A) for amplified or (D) for deleted appended on the end. The name is {prefix}_hotnet2.tsv. The prefix is a optional argument. If no prefix is supplied, the default is 'output'.

CoMEt

The same as HotNet2, named {prefix}_all_cna_comet.tsv. There is an additional file named {prefix}_name_map_comet.tsv, which provides a mapping from a shortened, more human-readable concatenated gene list to the full list of genes found in each peak. That file is a series of rows with the shortened name in the first column and the longer name in the second, tab separated.

MAGI

A five column, tab separated file with a header. Each row (in order) consists of a gene, a sample, the CNA type (Del/Amp), left position, and right position.

Usage

All options can be set by command line or via a configuration file. The command line will override configuration file settings, and the names and usage for command line and configuration file options are the same.

Required

The following options must be specified by the user either at the command line or in the config file, and cannot be left to defaults:

Config/long argument | Short argument | Input type | Description :————————————-| :—– |:—– |:—– gene_dictionary | -g | Path to file |Path to the gene dictionary, a json file with a key/value pair of { gene : [[start location, end location, chromosome number]] }. See example folder in this repository. data | -d | Path to folder or tar file| Path to either a folder with a flat structure containing all of the GISTIC2.0 data, or the path to a tar file (does not have to have a flat internal file structure) with the GISTIC2.0 data. target_genes | -tg | Path to file |Path to a file containing the target genes. File format is two columns, tab separated, with the gene name in the first column and Amp/Del/Both (not case sensitive) in the second column.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.