Name: spark-examples
Owner: Google Genomics
Description: Apache Spark jobs such as Principal Coordinate Analysis.
Created: 2014-05-02 20:24:58.0
Updated: 2017-12-19 16:06:38.0
Pushed: 2017-01-30 18:39:14.0
Size: 262
Language: Scala
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
The projects in this repository demonstrate working with genomic data accessible via the Google Genomics API using Apache Spark.
If you are ready to start coding, take a look at the information below. But if you are looking for a task-oriented list (e.g., How do I compute principal coordinate analysis with Google Genomics?), a better place to start is the Google Genomics Cookbook.
git clone this repository.
If you have not already done so, follow the Google Genomics getting started instructions to set up your environment
including installing gcloud and running gcloud init
.
Download and install Apache Spark.
Install SBT.
This project now includes code for calling the Genomics API using gRPC. To use gRPC, you'll need a version of ALPN that matches your JRE version.
See the ALPN documentation for a table of which ALPN jar to use for your JRE version.
Then download the correct version from here.
From the spark-examples
directory run sbt run
Use the following flags to match your runtime configuration:
port SBT_OPTS='-Xbootclasspath/p:/YOUR/PATH/TO/alpn-boot-YOUR-VERSION.jar'
t "run --help"
, --output-path <arg>
, --spark-master <arg> A spark master URL. Leave empty if using spark-submit.
.
--help Show help message
For example:
t "run --spark-master local[4]"
A menu should appear asking you to pick the sample to run:
iple main classes detected, select one to run:
com.google.cloud.genomics.spark.examples.SearchVariantsExampleKlotho
com.google.cloud.genomics.spark.examples.SearchVariantsExampleBRCA1
com.google.cloud.genomics.spark.examples.SearchReadsExample1
com.google.cloud.genomics.spark.examples.SearchReadsExample2
com.google.cloud.genomics.spark.examples.SearchReadsExample3
com.google.cloud.genomics.spark.examples.SearchReadsExample4
com.google.cloud.genomics.spark.examples.VariantsPcaDriver
r number:
If you are seeing java.lang.OutOfMemoryError: PermGen space
errors, set the following SBT_OPTS flag:
rt SBT_OPTS='-XX:MaxPermSize=256m'
(1) Build the assembly.
assembly
(2) Deploy your Spark cluster using Google Cloud Dataproc.
ud beta dataproc clusters create example-cluster --scopes cloud-platform
(3) Copy the assembly jar to the master node.
ud compute copy-files \
rget/scala-2.10/googlegenomics-spark-examples-assembly-1.0.jar example-cluster-m:~/
(4) ssh to the master.
ud compute ssh example-cluster-m
(5) Run one of the examples.
k-submit --class com.google.cloud.genomics.spark.examples.SearchReadsExample1 \
oglegenomics-spark-examples-assembly-1.0.jar
To run the variant PCA analysis on GCE make sure you have followed all the steps on the previous section and that you are able to run at least one of the examples.
Run the example PCA analysis for BRCA1 on the 1000 Genomes Project dataset.
k-submit --class com.google.cloud.genomics.spark.examples.VariantsPcaDriver \
oglegenomics-spark-examples-assembly-1.0.jar
The analysis will output the two principal components for each sample to the console. Here is an example of the last few lines.
811 0.0286308791579312 -0.008456233951873527
812 0.030970386921818943 -0.006755469223823698
813 0.03080348019961635 -0.007475822860939408
814 0.02865238920148145 -0.008084003476919057
815 0.028798695736608034 -0.003755789964021788
816 0.026104805529612096 -0.010430718823329282
818 -0.033609576645005836 -0.026655905606186293
819 0.032019557126552155 -0.00775750983842731
826 0.03026607917284046 -0.009102704080927001
828 -0.03412964005321165 -0.025991697661590686
313 -0.03401702847363714 -0.024555217139987182
This pipeline is described in greater detail on How do I compute principal coordinate analysis with Google Genomics?
For more information, see https://cloud.google.com/dataproc/faq