bigdatagenomics/eggo

Name: eggo

Owner: Big Data Genomics

Description: Ready-to-go Parquet-formatted public 'omics datasets

Created: 2015-01-06 22:56:45.0

Updated: 2017-11-06 00:51:26.0

Pushed: 2015-11-02 22:12:51.0

Homepage: null

Size: 1008

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

eggo

Eggo is two things:

  1. CLI for easily provisioning fully-functioning Hadoop clusters (CDH) using Cloudera Director

  2. A set of Parquet-formatted public 'omics data sets in S3 for easily performing integrative genomics on the Hadoop stack (including Spark and Impala).

Eggo includes all the of scripts for processing the data, including the necessary DDL statements to register the data sets with the Hive Metastore and make them accessible to Hive/Impala.

At the moment, Eggo is geared specifically towards scaling up variant stores and related functionality (e.g., population genomics, clinical genomics)

The pre-converted data sets are hosted at a publicly available S3 bucket:

/bdg-eggo

See the datasets/ directory for a list of available data sets (with metadata conforming to the DataPackage spec).

Getting started
install git+https://github.com/bigdatagenomics/eggo.git

Eggo makes use of Fabric, Boto, and Click.

eggo-cluster command – provisioning clusters

Simply run eggo-cluster at the command line. The eggo-cluster tool expects the following four environment variables:

go-cluster -h
e: eggo-cluster [OPTIONS] COMMAND [ARGS]...

go-cluster -- provisions Hadoop clusters using Cloudera Director

ons:
, --help  Show this message and exit.

ands:
nfig_cluster    Configure cluster for genomics, incl.
scribe          Describe the EC2 instances in the cluster
t_director_log  DEBUG: get the Director application log from...
gin             Login to the cluster
ovision         Provision a new cluster on AWS
install_eggo    DEBUG: reinstall a specific version of eggo
ardown          Tear down a cluster and stack on AWS
b_proxy         Set up ssh tunnels to web UIs

A typical set of commands for creating a cluster would be:

-cluster provision -n 5  # takes about 45 min
-cluster config_cluster  # takes about 15 min

gin to the cluster's master node
-cluster login

 another terminal set up local ssh tunnels to give you access to the WebUIs
-cluster web_proxy
en up localhost:7180 to access Cloudera Manager

en you are done with the cluster, tear it down
-cluster teardown
eggo-cluster provision
go-cluster provision -h
e: eggo-cluster provision [OPTIONS]

ovision a new cluster on AWS

ons:
region TEXT                  AWS Region  [default: us-east-1]
stack-name TEXT              Stack name for CloudFormation and cluster
                             name  [default: bdg-eggo]
availability-zone TEXT       AWS Availability Zone  [default: us-east-1b]
cf-template-path TEXT        Path to AWS Cloudformation Template
                             [default:
                             /Users/laserson/miniconda/lib/python2.7/site-p
                             ackages/eggo-0.1.0.dev0-py2.7.egg/eggo/resour
                             ces/cloudformation.template]
launcher-ami TEXT            The AMI to use for the launcher node
                             [default: ami-00a11e68]
launcher-instance-type TEXT  The instance type to use for the launcher
                             node  [default: m3.medium]
director-conf-path TEXT      Path to Director conf for AWS cloud
                             [default:
                             /Users/laserson/miniconda/lib/python2.7/site-p
                             ackages/eggo-0.1.0.dev0-py2.7.egg/eggo/resour
                             ces/aws.conf]
cluster-ami TEXT             The AMI to use for the worker nodes
                             [default: ami-00a11e68]
, --num-workers INTEGER      The total number of worker nodes to provision
                             [default: 3]
, --help                     Show this message and exit.
eggo-cluster config_cluster
go-cluster config_cluster -h
e: eggo-cluster config_cluster [OPTIONS]

nfigure cluster for genomics, incl. ADAM, OpenCB, Quince, etc

ons:
region TEXT           AWS Region  [default: us-east-1]
stack-name TEXT       Stack name for CloudFormation and cluster name
                      [default: bdg-eggo]
adam / --no-adam      Install ADAM?  [default: True]
adam-fork TEXT        GitHub fork to use for ADAM  [default:
                      bigdatagenomics]
adam-branch TEXT      GitHub branch to use for ADAM  [default: master]
opencb / --no-opencb  Install OpenCB?  [default: False]
gatk / --no-gatk      Install GATK? (v4 aka Hellbender)  [default: True]
quince / --no-quince  Install quince?  [default: True]
, --help              Show this message and exit.
eggo-cluster login
go-cluster login -h
e: eggo-cluster login [OPTIONS]

gin to the cluster

ons:
region TEXT                   AWS Region  [default: us-east-1]
stack-name TEXT               Stack name for CloudFormation and cluster
                              name  [default: bdg-eggo]
, --node [master|manager|launcher]
                              The node to login to  [default: master]
, --help                      Show this message and exit.

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.