Name: eggo
Owner: Big Data Genomics
Description: Ready-to-go Parquet-formatted public 'omics datasets
Created: 2015-01-06 22:56:45.0
Updated: 2017-11-06 00:51:26.0
Pushed: 2015-11-02 22:12:51.0
Homepage: null
Size: 1008
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
eggo
Eggo is two things:
CLI for easily provisioning fully-functioning Hadoop clusters (CDH) using Cloudera Director
A set of Parquet-formatted public 'omics data sets in S3 for easily performing integrative genomics on the Hadoop stack (including Spark and Impala).
Eggo includes all the of scripts for processing the data, including the necessary DDL statements to register the data sets with the Hive Metastore and make them accessible to Hive/Impala.
At the moment, Eggo is geared specifically towards scaling up variant stores and related functionality (e.g., population genomics, clinical genomics)
The pre-converted data sets are hosted at a publicly available S3 bucket:
/bdg-eggo
See the datasets/
directory for a list of available data sets (with metadata
conforming to the DataPackage spec).
install git+https://github.com/bigdatagenomics/eggo.git
Eggo makes use of Fabric, Boto, and Click.
eggo-cluster
command – provisioning clustersSimply run eggo-cluster
at the command line. The eggo-cluster
tool expects
the following four environment variables:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
EC2_KEY_PAIR
– the name of the EC2-registered key pair to use for instance
authentication
EC2_PRIVATE_KEY_FILE
– the local path to the corresponding private key
go-cluster -h
e: eggo-cluster [OPTIONS] COMMAND [ARGS]...
go-cluster -- provisions Hadoop clusters using Cloudera Director
ons:
, --help Show this message and exit.
ands:
nfig_cluster Configure cluster for genomics, incl.
scribe Describe the EC2 instances in the cluster
t_director_log DEBUG: get the Director application log from...
gin Login to the cluster
ovision Provision a new cluster on AWS
install_eggo DEBUG: reinstall a specific version of eggo
ardown Tear down a cluster and stack on AWS
b_proxy Set up ssh tunnels to web UIs
A typical set of commands for creating a cluster would be:
-cluster provision -n 5 # takes about 45 min
-cluster config_cluster # takes about 15 min
gin to the cluster's master node
-cluster login
another terminal set up local ssh tunnels to give you access to the WebUIs
-cluster web_proxy
en up localhost:7180 to access Cloudera Manager
en you are done with the cluster, tear it down
-cluster teardown
eggo-cluster provision
go-cluster provision -h
e: eggo-cluster provision [OPTIONS]
ovision a new cluster on AWS
ons:
region TEXT AWS Region [default: us-east-1]
stack-name TEXT Stack name for CloudFormation and cluster
name [default: bdg-eggo]
availability-zone TEXT AWS Availability Zone [default: us-east-1b]
cf-template-path TEXT Path to AWS Cloudformation Template
[default:
/Users/laserson/miniconda/lib/python2.7/site-p
ackages/eggo-0.1.0.dev0-py2.7.egg/eggo/resour
ces/cloudformation.template]
launcher-ami TEXT The AMI to use for the launcher node
[default: ami-00a11e68]
launcher-instance-type TEXT The instance type to use for the launcher
node [default: m3.medium]
director-conf-path TEXT Path to Director conf for AWS cloud
[default:
/Users/laserson/miniconda/lib/python2.7/site-p
ackages/eggo-0.1.0.dev0-py2.7.egg/eggo/resour
ces/aws.conf]
cluster-ami TEXT The AMI to use for the worker nodes
[default: ami-00a11e68]
, --num-workers INTEGER The total number of worker nodes to provision
[default: 3]
, --help Show this message and exit.
eggo-cluster config_cluster
go-cluster config_cluster -h
e: eggo-cluster config_cluster [OPTIONS]
nfigure cluster for genomics, incl. ADAM, OpenCB, Quince, etc
ons:
region TEXT AWS Region [default: us-east-1]
stack-name TEXT Stack name for CloudFormation and cluster name
[default: bdg-eggo]
adam / --no-adam Install ADAM? [default: True]
adam-fork TEXT GitHub fork to use for ADAM [default:
bigdatagenomics]
adam-branch TEXT GitHub branch to use for ADAM [default: master]
opencb / --no-opencb Install OpenCB? [default: False]
gatk / --no-gatk Install GATK? (v4 aka Hellbender) [default: True]
quince / --no-quince Install quince? [default: True]
, --help Show this message and exit.
eggo-cluster login
go-cluster login -h
e: eggo-cluster login [OPTIONS]
gin to the cluster
ons:
region TEXT AWS Region [default: us-east-1]
stack-name TEXT Stack name for CloudFormation and cluster
name [default: bdg-eggo]
, --node [master|manager|launcher]
The node to login to [default: master]
, --help Show this message and exit.