Name: flashpca
Owner: Wellcome Trust Sanger Institute - Human Genetics Informatics
Description: Fast Principal Component Analysis of Large-Scale Genome-Wide Data
Forked from: gabraham/flashpca
Created: 2015-09-16 13:19:44.0
Updated: 2015-09-16 13:19:45.0
Pushed: 2015-09-16 16:56:30.0
Size: 32139
Language: C++
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
flashpca performs fast principal component analysis (PCA) of single nucleotide polymorphism (SNP) data, similar to smartpca from EIGENSOFT (http://www.hsph.harvard.edu/alkes-price/software/) and shellfish (https://github.com/dandavison/shellfish). flashpca is based on the randomized PCA algorithm (Alg. 3) of Halko et al. 2011 (http://arxiv.org/abs/1007.5510).
Main features:
Gad Abraham, gad.abraham@unimelb.edu.au
G. Abraham and M. Inouye, Fast Principal Component Analysis of Large-Scale Genome-Wide Data, PLos ONE 9(4): e93766. doi:10.1371/journal.pone.0093766
(preprint: http://biorxiv.org/content/early/2014/03/11/002238)
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
Copyright © 2014 Gad Abraham. All rights reserved.
Portions of this code are based on SparSNP (https://github.com/gabraham/SparSNP), Copyright © 2011-2012 Gad Abraham and National ICT Australia (http://www.nicta.com.au).
See Releases for statically-linked version for Linux x86-64 ≥ 2.6.15
--mem high
, the default), large datasets will require
large amounts of RAM, e.g. for 15,000 individuals (43K SNPs) you'll need
about 14GB RAM, for 150,000 individuals (43K SNPs) you'll need about 145GB
RAM (estimated using https://github.com/jhclark/memusg), roughly 8 ×
min(n 2 × p + n × p, p 2 × n + n × p)
bytes, where p is number of SNPs and n is number of individuals.
Using --mem low
, you will only need about 8 × n × p bytes of
RAM.To get the latest version:
it clone git://github.com/gabraham/flashpca
On Linux:
error: no match for 'operator/' in '1 / ((Eigen::MatrixBase...
you'll need a more recent Eigen)On Mac:
rt CXX=/usr/local/bin/g++-4.7
Edit the Makefile to reflect where you have installed the Eigen headers and Boost headers and libraries:
IGEN_INC=/usr/local/include/eigen
OOST_INC=/usr/local/include/boost
OOST_LIB=/usr/local/lib
Run make:
d flashpca
ake all
Note: the compilation process will first look for a local directory named Eigen. It should contain the file signature_of_eigen3_matrix_library. Next, it will look for the directory /usr/include/eigen3 (Debian/Ubuntu location for Eigen), although those available through apt-get tend to be older versions.
First thin the data by LD (highly recommend plink2 for this):
link --bfile data --indep-pairwise 1000 50 0.05 --exclude range exclusion_regions.txt
link --bfile data --extract plink.prune.in --make-bed --out data_pruned
where exclusion_regions.txt contains:
44000000 51500000 r1
25000000 33500000 r2
8000000 12000000 r3
1 45000000 57000000 r4
(You may need to change the –indep-pairwise parameters to get a suitable number of SNPs for you dataset, 10,000-50,000 is usually enough.)
To run on the pruned dataset:
/flashpca --bfile data_pruned
We highly recommend using multi-threading, to run in multi-threaded mode with 8 threads:
/flashpca --bfile data_pruned --numthreads 8
Eigensoft-scaling of genotypes (default):
/flashpca --stand binom ...
To use genotype centering (compatible with R prcomp):
/flashpca --stand center ...
To use the low-memory version:
/flashpca --mem low ...
To append a custom suffix '_mysuffix.txt' to all output files:
/flashpca --suffix _mysuffix.txt ...
To see all options
/flashpca --help
flashpca produces the following files:
eigenvectors.txt
: the top k eigenvectors of the covariance
X XT / (n - 1), same as matrix U from the SVD of the genotype matrix
X=UDVT.pcs.txt
: the top k principal components (the projection of the data on the
eigenvectors, scaled by the eigenvalues, same as XV (or UD). This is the file
you will want to plot the PCA plot from.eigenvalues.txt
: the top k eigenvalues of X XT / (n - 1). These are the
square of the singular values D (square of sdev from prcomp).pve.txt
: the proportion of total variance explained by each of the top k
eigenvectors (the total variance is given by the trace of the covariance
matrix X XT / (n - 1), which is the same as the sum of all eigenvalues).
To get the cumulative variance explained, simply
do the cumulative sum of the variances (cumsum
in R).You must perform quality control using PLINK (at least filter using –geno, –mind, –maf, –hwe) before running flashpca on your data. You will likely get spurious results otherwise.
--kernel
rbf
).--rbfsample
, default=min(1000, n)), and can also
be specified using --sigma
.FID, IID, pheno1, pheno2, pheno3, ...
except that there must be no header line. The phenotype file must be in the same order as
the FAM file.--lambda1
and for the phenotypes is
--lambda2
.--mem low
, roughly 8 x #Samples x (#SNPs + #Phenotypes))
and high memory (--mem high
, roughly 8bytes × #SNPs × #Phenotypes)./flashpca --scca --bfile data --pheno pheno.txt \
-lambda1 1e-3 --lambda2 1e-2 --ndim 10 --numthreads 8
We optimise the penalties by finding the values that maximise the correlation of the canonical components cor(X U, Y V) in independent test data.
flashpca is now available as an independent R package.
R packages Rcpp, RcppEigen, BH, g++ compiler
To install on Mac or Linux, you can use devtools::install_github:
ibrary(devtools)
nstall_github("gabraham/flashpca/flashpcaR")
Note: on Mac you will need a GCC/G++ compiler (e.g., from http://brew.sh), and to set the correct compiler in ~/.R/Makevars to point to that compiler, e.g.,
XX=/usr/local/bin/g++-4.9 -std=c++11
(issue https://github.com/gabraham/flashpca/issues/5)
Alternatively, after cloning the git archive, install using:
CMD INSTALL flashpcaR
On Windows, see Releases for a prebuilt Windows binary package.
Example usage, assuming X
is a 100-sample by 1000-SNP matrix in dosage
coding (0, 1, 2) (an actual matrix, not a path to PLINK data)
im(X)
1] 100 1000
ibrary(flashpcaR)
<- flashpca(X, do_loadings=TRUE, verbose=TRUE, stand="binom", ndim=10,
extra=100)
PLINK data can be loaded into R either by recoding the data into raw format (recode A
) or using package plink2R.
Output:
values
: eigenvaluesvectors
: eigenvectorsprojection
: projection of sample onto eigenvectors (X V)loadings
: SNP loadings, if using a linear kernelSparse CCA of matrices X and Y, with 5 components, penalties lambda1=0.1 and lambda2=0.1:
im(X)
1] 100 1000
im(Y)
1] 100 50
<- scca(X, Y, ndim=5, lambda1=0.1, lambda2=0.1)
See the HapMap3 directory
See CHANGELOG.txt