googlegenomics/spark-examples

Name: spark-examples

Owner: Google Genomics

Description: Apache Spark jobs such as Principal Coordinate Analysis.

Created: 2014-05-02 20:24:58.0

Updated: 2017-12-19 16:06:38.0

Pushed: 2017-01-30 18:39:14.0

Homepage:

Size: 262

Language: Scala

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

spark-examples Build Status

The projects in this repository demonstrate working with genomic data accessible via the Google Genomics API using Apache Spark.

If you are ready to start coding, take a look at the information below. But if you are looking for a task-oriented list (e.g., How do I compute principal coordinate analysis with Google Genomics?), a better place to start is the Google Genomics Cookbook.

Getting Started
  1. git clone this repository.

  2. If you have not already done so, follow the Google Genomics getting started instructions to set up your environment including installing gcloud and running gcloud init.

  3. Download and install Apache Spark.

  4. Install SBT.

  5. This project now includes code for calling the Genomics API using gRPC. To use gRPC, you'll need a version of ALPN that matches your JRE version.

  6. See the ALPN documentation for a table of which ALPN jar to use for your JRE version.

  7. Then download the correct version from here.

Local Run

From the spark-examples directory run sbt run

Use the following flags to match your runtime configuration:

port SBT_OPTS='-Xbootclasspath/p:/YOUR/PATH/TO/alpn-boot-YOUR-VERSION.jar'
t "run --help"
, --output-path  <arg>
, --spark-master  <arg>      A spark master URL. Leave empty if using spark-submit.
.
  --help                     Show help message

For example:

t "run --spark-master local[4]"

A menu should appear asking you to pick the sample to run:

iple main classes detected, select one to run:

 com.google.cloud.genomics.spark.examples.SearchVariantsExampleKlotho
 com.google.cloud.genomics.spark.examples.SearchVariantsExampleBRCA1
 com.google.cloud.genomics.spark.examples.SearchReadsExample1
 com.google.cloud.genomics.spark.examples.SearchReadsExample2
 com.google.cloud.genomics.spark.examples.SearchReadsExample3
 com.google.cloud.genomics.spark.examples.SearchReadsExample4
 com.google.cloud.genomics.spark.examples.VariantsPcaDriver

r number:
Troubleshooting:

If you are seeing java.lang.OutOfMemoryError: PermGen space errors, set the following SBT_OPTS flag:

rt SBT_OPTS='-XX:MaxPermSize=256m'
Run on Google Compute Engine

(1) Build the assembly.

assembly

(2) Deploy your Spark cluster using Google Cloud Dataproc.

ud beta dataproc clusters create example-cluster --scopes cloud-platform

(3) Copy the assembly jar to the master node.

ud compute copy-files \
rget/scala-2.10/googlegenomics-spark-examples-assembly-1.0.jar  example-cluster-m:~/

(4) ssh to the master.

ud compute ssh example-cluster-m

(5) Run one of the examples.

k-submit --class com.google.cloud.genomics.spark.examples.SearchReadsExample1 \
oglegenomics-spark-examples-assembly-1.0.jar
Running PCA variant analysis on GCE

To run the variant PCA analysis on GCE make sure you have followed all the steps on the previous section and that you are able to run at least one of the examples.

Run the example PCA analysis for BRCA1 on the 1000 Genomes Project dataset.

k-submit --class com.google.cloud.genomics.spark.examples.VariantsPcaDriver \
oglegenomics-spark-examples-assembly-1.0.jar

The analysis will output the two principal components for each sample to the console. Here is an example of the last few lines.


811     0.0286308791579312  -0.008456233951873527
812     0.030970386921818943    -0.006755469223823698
813     0.03080348019961635 -0.007475822860939408
814     0.02865238920148145 -0.008084003476919057
815     0.028798695736608034    -0.003755789964021788
816     0.026104805529612096    -0.010430718823329282
818     -0.033609576645005836   -0.026655905606186293
819     0.032019557126552155    -0.00775750983842731
826     0.03026607917284046 -0.009102704080927001
828     -0.03412964005321165    -0.025991697661590686
313     -0.03401702847363714    -0.024555217139987182

This pipeline is described in greater detail on How do I compute principal coordinate analysis with Google Genomics?

Debugging

For more information, see https://cloud.google.com/dataproc/faq


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.