Name: find-cags
Owner: Fred Hutchinson Cancer Research Center
Description: Find Co-Abundant Groups of Genes
Created: 2018-05-11 16:43:42.0
Updated: 2018-05-14 18:32:27.0
Pushed: 2018-05-14 18:32:26.0
Homepage: null
Size: 71292
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Analyze gene abundance data from a large set of samples and calculate which sets of genes are found at a similar abundance across all samples. Those genes are expected to be biologically linked, such as the case of metagenomic analysis via whole-genome shotgun sequences, where genes from the same genome tend to be found at a similar abundance.
It is assumed that all input data will be in JSON format (gzip optional).
The pertinent data for each individual sample is an abundance metric for
each sample. The input file must contain a list
in which each element
is a dict
that contains the gene ID with one key
and the abundance
metric with another key
.
For initial development we will assume that each input file is a single
dict
, with the results located at a single key
within that dict
.
In the future we may end up supporting more flexibility in extracting
results from files with different structures, but for the first pass we'll
just go with this.
Therefore the features that must be specified by the user are:
NOTE: All abundance metric values must be >= 0
To link individual files with sample names, the user will specify a
sample sheet, which is a JSON file formatted as a dict
, with sample
names as key and file locations as values.
At the moment we will support data found in (a) the local file system or (b) AWS S3.
For testing, I will use a set of JSONs which contain the abundance of
individual genes for a set of microbiome samples. That data is found in the
tests/
folder. There is also a JSON file indicating which sample goes
with which file, which is formatted as a simple dict (keys are sample names
and values are file locations) and located in tests/sample_sheet.json
.
The --normalize
metric accepts two values, median
and sum
. In each case
the abundance metric for each gene within each sample is divided by either
the median
or the sum
of the abundance metrics for all genes within that
sample. When calculating the median
, only genes with non-zero abundances
are considered.
Any of the distance metrics supported by scipy.spatial.distance.cdist
are
supported. See the scipy documentation for more details
At the moment we will support single-linkage clustering, using a single distance threshold.
e: find-cags.py [-h] --sample-sheet SAMPLE_SHEET --output-prefix
OUTPUT_PREFIX --output-folder OUTPUT_FOLDER
[--metric METRIC] [--normalization NORMALIZATION]
[--max-dist MAX_DIST] [--temp-folder TEMP_FOLDER]
[--results-key RESULTS_KEY]
[--abundance-key ABUNDANCE_KEY]
[--gene-id-key GENE_ID_KEY]
a set of co-abundant genes
onal arguments:
, --help show this help message and exit
sample-sheet SAMPLE_SHEET
Location for sample sheet (.json[.gz]).
output-prefix OUTPUT_PREFIX
Prefix for output files.
output-folder OUTPUT_FOLDER
Folder to place results. (Supported: s3://, or local
path).
metric METRIC Distance metric calculation method, see
scipy.spatial.distance.
normalization NORMALIZATION
Normalization factor per-sample (median or sum).
max-dist MAX_DIST Maximum distance for single-linkage clustering.
temp-folder TEMP_FOLDER
Folder for temporary files.
results-key RESULTS_KEY
Key identifying the list of gene abundances for each
sample JSON.
abundance-key ABUNDANCE_KEY
Key identifying the abundance value for each element
in the results list.
gene-id-key GENE_ID_KEY
Key identifying the gene ID for each element in the
results list.
threads THREADS Number of threads to use.