Name: glympsed
Owner: glympsed
Description: null
Created: 2016-06-20 21:55:16.0
Updated: 2016-06-20 22:57:58.0
Pushed: 2016-07-28 02:16:56.0
Homepage: null
Size: 71032
Language: Jupyter Notebook
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
glympsed
is a pipeline for querying large RNAseq gene expression data sets with more than 100 samples and 30,000 genes each for patterns of common expression using the unsupervised machine-learning library, Keras. This tool can be used to enable discovery of common patterns of expression where there is are unknown relationships between samples. The results can then be used to provide evidence to generate hypotheses for further investigation.
The pipeline uses a combination of Python scripts (file.py) as well as iPython Notebook (file.ipynb). The user is expected to provide a matrix of counts for each sample x gene ID.
Scripts for visualizing the clustering of the nodes (hidden layers) that the Autoencoder method found can be found in the run_unsupervised.py script that is within the sample-models/ directory and can be called from main.
The file calc-gene-pathway-counts.R - R script that can be used to calculate the number of genes that are within each pathway and the number of pathways that each gene appears in - output as CSV for each.
Autoencoder weights - visualizing 50 nodes (rows) x 5549 genes (columns) to explore the clustering of nodes within a subspace of the gene space.
Autoencoder codings - visualize 50 nodes (rows) x 950 samples (columns) to explore the clustering of nodes within a subspace of the sample space.
Calling the main.py script from the command line:
$ python __main__.py
The main.py script calls the executer function to run the project.
This project contains three sets of data - open-access Pseudomonas, simulated, and MMETSP.
Reproduces a previous study with data from Pseudomonas aeruginosa (Tan et al 2015).
Steps for creating simulated data:
The Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP), which contains RNAseq data from 678 divergent samples representing more than 40 phyla (Keeling et al. 2014. Many of the species in this data set do not have a reference genomes. Following transcriptome assembly, the procedure for annotation is not straightforward.
This project makes use of a variety of dimensional reduction, clustering, and unsupervised and semi-supervised machine-learning techniques. These include PCA, ICA, and Autoencoders. In addition, t-SNE was used for dimensional reduction, clustering, and visualization.
Did not perform well.
t-SNE can be used as an independent method for dimensional reduction and visualization or in combination with a dimensional reduction technique - e.g. PCA.
Some examples of t-SNE usage:
PCA into 50 dimensions and then scatter plot of all genes in 2-dimensions with t-SNE to try to visualize the clustering of those 50 PCs as matching the 50 hidden nodes from Autoencoding (bad results as expected since PCA does not do well on this data)
t-SNE directly on raw data to try to find clustering - scatter plot of all genes in 2-dimensions - this is also not good as expected (since PCA did not work well)
2-dimensional plot of all nodes from hidden layer (Autoencoder) clustered with t-SNE - this is actually two plots: one for the Autoencoder weights and another for the Autoencoder codings. In these plots you can start to see some nodes (some hidden layers) that really stand out from the rest - such as node 18 - which probably has a major influence on genetic expression
In addition to the methods mentioned above - we have visualizations of the following post-structure-discovery methods:
Feature importance of nodes (from hidden layers of Autoencoder) as predictors for classifying genes to their respective nodes of highest weight - through random forest multi-classification - this means I assigned each gene a label that corresponded to the node that had the highest weight value for it
Feature importance of samples (from original dataset of 950 samples / predictors) for classifying genes to their respective nodes of highest weight - through random forest multi-classification and then boosted classification methods (GBM)
NOTE: feature importance is measured as the gain in reducing misclassifications per feature / predictor / column. So this means feature importance is a measure of (generally) how much better any specific feature is at making the entire model predict the nodes / classes of the genes.
This is a collaboration between Lisa Cohen (Titus Brown lab, UC Davis), Harriet Alexander (Titus Brown lab, UC Davis), Dave Harris (Ethan White lab, University of Florida), Yuan Liu (Princeton University), and Oliver Muellerklein (Wayne Getz lab, UC Berkeley). We started as team “burgers and mushrooms” at the Moore Foundation's Data Driven Discovery Barn-Raising event held at the Mount Desert Island Biological Laboratory in Bar Harbor, Maine from May 1-6, 2016 coordinated by Dr. Casey Greene and Dr. Blaire Sullivan.