glympsed/glympsed

Name: glympsed

Owner: glympsed

Description: null

Created: 2016-06-20 21:55:16.0

Updated: 2016-06-20 22:57:58.0

Pushed: 2016-07-28 02:16:56.0

Homepage: null

Size: 71032

Language: Jupyter Notebook

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

glympsed: Gene Pathway and Structure Discovery

glympsed is a pipeline for querying large RNAseq gene expression data sets with more than 100 samples and 30,000 genes each for patterns of common expression using the unsupervised machine-learning library, Keras. This tool can be used to enable discovery of common patterns of expression where there is are unknown relationships between samples. The results can then be used to provide evidence to generate hypotheses for further investigation.

1 Running software

The pipeline uses a combination of Python scripts (file.py) as well as iPython Notebook (file.ipynb). The user is expected to provide a matrix of counts for each sample x gene ID.

1.1 Script Overview 1.2 Calling from the Command Line

Calling the main.py script from the command line:

$ python __main__.py

The main.py script calls the executer function to run the project.


2 Data

This project contains three sets of data - open-access Pseudomonas, simulated, and MMETSP.

2.1 Pseudomonas

Reproduces a previous study with data from Pseudomonas aeruginosa (Tan et al 2015).

2.2 Simulated

Steps for creating simulated data:

2.3 MMETSP

The Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP), which contains RNAseq data from 678 divergent samples representing more than 40 phyla (Keeling et al. 2014. Many of the species in this data set do not have a reference genomes. Following transcriptome assembly, the procedure for annotation is not straightforward.


3 Analysis
3.1 Overview

This project makes use of a variety of dimensional reduction, clustering, and unsupervised and semi-supervised machine-learning techniques. These include PCA, ICA, and Autoencoders. In addition, t-SNE was used for dimensional reduction, clustering, and visualization.

3.2 PCA and ICA

Did not perform well.

3.3 Autoencoder 3.4 t-SNE

t-SNE can be used as an independent method for dimensional reduction and visualization or in combination with a dimensional reduction technique - e.g. PCA.

Some examples of t-SNE usage:

3.5 Extra

In addition to the methods mentioned above - we have visualizations of the following post-structure-discovery methods:

NOTE: feature importance is measured as the gain in reducing misclassifications per feature / predictor / column. So this means feature importance is a measure of (generally) how much better any specific feature is at making the entire model predict the nodes / classes of the genes.

Collaboration

This is a collaboration between Lisa Cohen (Titus Brown lab, UC Davis), Harriet Alexander (Titus Brown lab, UC Davis), Dave Harris (Ethan White lab, University of Florida), Yuan Liu (Princeton University), and Oliver Muellerklein (Wayne Getz lab, UC Berkeley). We started as team “burgers and mushrooms” at the Moore Foundation's Data Driven Discovery Barn-Raising event held at the Mount Desert Island Biological Laboratory in Bar Harbor, Maine from May 1-6, 2016 coordinated by Dr. Casey Greene and Dr. Blaire Sullivan.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.