wtsi-hgi/flashpca

Name: flashpca

Owner: Wellcome Trust Sanger Institute - Human Genetics Informatics

Description: Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Forked from: gabraham/flashpca

Created: 2015-09-16 13:19:44.0

Updated: 2015-09-16 13:19:45.0

Pushed: 2015-09-16 16:56:30.0

Homepage:

Size: 32139

Language: C++

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

flashpca

flashpca performs fast principal component analysis (PCA) of single nucleotide polymorphism (SNP) data, similar to smartpca from EIGENSOFT (http://www.hsph.harvard.edu/alkes-price/software/) and shellfish (https://github.com/dandavison/shellfish). flashpca is based on the randomized PCA algorithm (Alg. 3) of Halko et al. 2011 (http://arxiv.org/abs/1007.5510).

Main features:

Contact

Gad Abraham, gad.abraham@unimelb.edu.au

Citation

G. Abraham and M. Inouye, Fast Principal Component Analysis of Large-Scale Genome-Wide Data, PLos ONE 9(4): e93766. doi:10.1371/journal.pone.0093766

(preprint: http://biorxiv.org/content/early/2014/03/11/002238)

License

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

Copyright © 2014 Gad Abraham. All rights reserved.

Portions of this code are based on SparSNP (https://github.com/gabraham/SparSNP), Copyright © 2011-2012 Gad Abraham and National ICT Australia (http://www.nicta.com.au).

Download statically linked version (stable versions only)

See Releases for statically-linked version for Linux x86-64 ≥ 2.6.15

System requirements
Building from source

To get the latest version:

it clone git://github.com/gabraham/flashpca
Requirements

On Linux:

On Mac:

To install

Edit the Makefile to reflect where you have installed the Eigen headers and Boost headers and libraries:

IGEN_INC=/usr/local/include/eigen
OOST_INC=/usr/local/include/boost
OOST_LIB=/usr/local/lib

Run make:

d flashpca
ake all

Note: the compilation process will first look for a local directory named Eigen. It should contain the file signature_of_eigen3_matrix_library. Next, it will look for the directory /usr/include/eigen3 (Debian/Ubuntu location for Eigen), although those available through apt-get tend to be older versions.

Quick start

First thin the data by LD (highly recommend plink2 for this):

link --bfile data --indep-pairwise 1000 50 0.05 --exclude range exclusion_regions.txt
link --bfile data --extract plink.prune.in --make-bed --out data_pruned

where exclusion_regions.txt contains:

 44000000 51500000 r1
 25000000 33500000 r2
 8000000 12000000 r3
1 45000000 57000000 r4

(You may need to change the –indep-pairwise parameters to get a suitable number of SNPs for you dataset, 10,000-50,000 is usually enough.)

To run on the pruned dataset:

/flashpca --bfile data_pruned

We highly recommend using multi-threading, to run in multi-threaded mode with 8 threads:

/flashpca --bfile data_pruned --numthreads 8

Eigensoft-scaling of genotypes (default):

/flashpca --stand binom ...

To use genotype centering (compatible with R prcomp):

/flashpca --stand center ...

To use the low-memory version:

/flashpca --mem low ...

To append a custom suffix '_mysuffix.txt' to all output files:

/flashpca --suffix _mysuffix.txt ...

To see all options

/flashpca --help 
Output

flashpca produces the following files:

Warning

You must perform quality control using PLINK (at least filter using –geno, –mind, –maf, –hwe) before running flashpca on your data. You will likely get spurious results otherwise.

Experimental features
Kernel PCA
Sparse Canonical Correlation Analysis (SCCA)
Quick example
/flashpca --scca --bfile data --pheno pheno.txt \
-lambda1 1e-3 --lambda2 1e-2 --ndim 10 --numthreads 8
Example scripts to tune the penalties via split validation

We optimise the penalties by finding the values that maximise the correlation of the canonical components cor(X U, Y V) in independent test data.

Calling flashpca from R

flashpca is now available as an independent R package.

Requirements for building

R packages Rcpp, RcppEigen, BH, g++ compiler

To install on Mac or Linux, you can use devtools::install_github:

ibrary(devtools)
nstall_github("gabraham/flashpca/flashpcaR")

Note: on Mac you will need a GCC/G++ compiler (e.g., from http://brew.sh), and to set the correct compiler in ~/.R/Makevars to point to that compiler, e.g.,

XX=/usr/local/bin/g++-4.9 -std=c++11

(issue https://github.com/gabraham/flashpca/issues/5)

Alternatively, after cloning the git archive, install using:

 CMD INSTALL flashpcaR

On Windows, see Releases for a prebuilt Windows binary package.

PCA

Example usage, assuming X is a 100-sample by 1000-SNP matrix in dosage coding (0, 1, 2) (an actual matrix, not a path to PLINK data)

im(X)
1]  100 1000
ibrary(flashpcaR)
 <- flashpca(X, do_loadings=TRUE, verbose=TRUE, stand="binom", ndim=10,
extra=100)

PLINK data can be loaded into R either by recoding the data into raw format (recode A) or using package plink2R.

Output:

Sparse CCA

Sparse CCA of matrices X and Y, with 5 components, penalties lambda1=0.1 and lambda2=0.1:

im(X)
1]  100 1000
im(Y)
1]  100 50
 <- scca(X, Y, ndim=5, lambda1=0.1, lambda2=0.1)
LD-pruned HapMap3 example data

See the HapMap3 directory

Changelog (stable versions only)

See CHANGELOG.txt


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.