FredHutch/find-cags

Name: find-cags

Owner: Fred Hutchinson Cancer Research Center

Description: Find Co-Abundant Groups of Genes

Created: 2018-05-11 16:43:42.0

Updated: 2018-05-14 18:32:27.0

Pushed: 2018-05-14 18:32:26.0

Homepage: null

Size: 71292

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Find Co-Abundant Groups of Genes

Docker Repository on Quay

Purpose

Analyze gene abundance data from a large set of samples and calculate which sets of genes are found at a similar abundance across all samples. Those genes are expected to be biologically linked, such as the case of metagenomic analysis via whole-genome shotgun sequences, where genes from the same genome tend to be found at a similar abundance.

Input Data Format

It is assumed that all input data will be in JSON format (gzip optional). The pertinent data for each individual sample is an abundance metric for each sample. The input file must contain a list in which each element is a dict that contains the gene ID with one key and the abundance metric with another key.

For initial development we will assume that each input file is a single dict, with the results located at a single key within that dict. In the future we may end up supporting more flexibility in extracting results from files with different structures, but for the first pass we'll just go with this.

Therefore the features that must be specified by the user are:

NOTE: All abundance metric values must be >= 0

Sample Sheet

To link individual files with sample names, the user will specify a sample sheet, which is a JSON file formatted as a dict, with sample names as key and file locations as values.

Data Locations

At the moment we will support data found in (a) the local file system or (b) AWS S3.

Test Dataset

For testing, I will use a set of JSONs which contain the abundance of individual genes for a set of microbiome samples. That data is found in the tests/ folder. There is also a JSON file indicating which sample goes with which file, which is formatted as a simple dict (keys are sample names and values are file locations) and located in tests/sample_sheet.json.

Normalization

The --normalize metric accepts two values, median and sum. In each case the abundance metric for each gene within each sample is divided by either the median or the sum of the abundance metrics for all genes within that sample. When calculating the median, only genes with non-zero abundances are considered.

Distance Metric

Any of the distance metrics supported by scipy.spatial.distance.cdist are supported. See the scipy documentation for more details

Clustering Method

At the moment we will support single-linkage clustering, using a single distance threshold.

Invocation
e: find-cags.py [-h] --sample-sheet SAMPLE_SHEET --output-prefix
                OUTPUT_PREFIX --output-folder OUTPUT_FOLDER
                [--metric METRIC] [--normalization NORMALIZATION]
                [--max-dist MAX_DIST] [--temp-folder TEMP_FOLDER]
                [--results-key RESULTS_KEY]
                [--abundance-key ABUNDANCE_KEY]
                [--gene-id-key GENE_ID_KEY]

 a set of co-abundant genes

onal arguments:
, --help            show this help message and exit
sample-sheet SAMPLE_SHEET
                    Location for sample sheet (.json[.gz]).
output-prefix OUTPUT_PREFIX
                    Prefix for output files.
output-folder OUTPUT_FOLDER
                    Folder to place results. (Supported: s3://, or local
                    path).
metric METRIC       Distance metric calculation method, see
                    scipy.spatial.distance.
normalization NORMALIZATION
                    Normalization factor per-sample (median or sum).
max-dist MAX_DIST   Maximum distance for single-linkage clustering.
temp-folder TEMP_FOLDER
                    Folder for temporary files.
results-key RESULTS_KEY
                    Key identifying the list of gene abundances for each
                    sample JSON.
abundance-key ABUNDANCE_KEY
                    Key identifying the abundance value for each element
                    in the results list.
gene-id-key GENE_ID_KEY
                    Key identifying the gene ID for each element in the
                    results list.
threads THREADS     Number of threads to use.

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.