DataBiosphere/cgp-deployment

Name: cgp-deployment

Owner: Data Biosphere

Owner: Data Biosphere

Description: The UCSC Genomics Institute's Computational Genomics Platform (CGP). This repo contains the Docker compose-based deployment process.

Forked from: BD2KGenomics/dcc-ops

Created: 2018-03-13 19:50:05.0

Updated: 2018-05-15 04:57:34.0

Pushed: 2018-05-15 21:03:03.0

Homepage: https://ucsc-cgp.org

Size: 1780

Language: Shell

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

dcc-ops

About

This repository contains our Docker-compose and setup bootstrap scripts used to create a deployment of the UCSC Genomic Institute's Computational Genomics Platform for AWS. The system is designed to receive genomic data, run analysis at scale on the cloud, and return analyzed results to authorized users. It uses, supports, and drives development of several key GA4GH APIs and open source projects. In many ways it is the generalization of the PCAWG cloud infrastructure developed for that project and a potential reference implementation for the NIH Commons concept.

Components

The system has components fulfilling a range of functions, all of which are open source and can be used independently or together.

Computational Genomics Platform architecture

These components are setup with the install process available in this repository:

These are related projects that are either already setup and available for use on the web or are used by components above.

Installing the Platform

These directions below assume you are using AWS. We will include additional cloud instructions as dcc-ops matures.

Collecting Information

Make sure you have:

Starting an AWS VM

Use the AWS console or command line tool to create a host. For example:

We will refer to this as the host VM throughout the documentation below and it is the machine running all the Docker containers for each of the components below.

You should make a note of your security group name and ID and ensure you can connect via ssh.

NOTE: We have had problems when uploading big files to Virginia (~25GB). If possible, set up your AWS anywhere else but Virginia.

AWS Tasks

Make sure you do the following:

Setup for Redwood

Here is a summary of what you need to do. See the Redwood README for details.

Re-route Service Endpoints

Redwood exposes storage, metadata, auth services. Each of these should be made subdomains of your “base domain”.

Make your S3 Bucket Create an AWS IAM Encryption Key

Now we're ready to install Redwood.

Setup for Consonance

See the Consonance README for details. Consonance assumes you have an SSH key created and uploaded to a location on your host VM. Other than that, there are no additional pre-setup tasks.

Adding private SSH key to your VM

Add your private ssh key under ~/.ssh/<your_key>.pem, this is typically the same key that you use to SSH to your host VM, regardless it needs to be a key created on the AWS console so Amazon is aware of it. Then do chmod 400 ~/.ssh/<your_key>.pem so your key is not publicly viewable.

TODO: Creating an AMI for Worker Node

Follow the instructions here to create an AMI for the worker node. Use an ubuntu 14.04 base box. You can use the official Ubuntu release. You may need to make your own AMI with more storage. Make sure you make it in the same region where your VM and S3 buckets are located.

Consonance CLI on the Host VM

You probably want to install the Consonance command line on the host VM so you can submit work from outside the Docker containers running the various Consonance services. Likewise, you can install the CLI on other hosts and submit work to the queue.

Download the consonance command line from the Consonance releases page:

https://github.com/Consonance/consonance/releases

For example:

 https://github.com/Consonance/consonance/releases/download/2.0.0-alpha.15/consonance
 mv consonance /usr/local/bin/
 chmod a+x /usr/local/bin/consonance
nning the command will install the tool and prompt you to enter your token, please get the token after running install_bootstrap
onance

Follow the interactive directions for setting up this CLI. You will need the elastic IP you setup previously (or, better yet, the “base domain” from above).

Setup for Boardwalk

Here is a summary of what you need to do. See the Boardwalk README for details.

Changing the maximum virtual memory in your VM

ElasticSearch requires that you set vm.max_map_count to at least 262144. The bootstrap installer will take care of this. However, the changes are not permanent, and if you restart your VM, vm.max_map_count will change to its default. To make this change permanent, edit the file /etc/sysctl.conf on your VM and add/edit this line: vm.max_map_count=262144. This will make the change permanent even in the case the VM is restarted.

Create a Google Oauth2 app

You need to create a Google Oauth2 app to enable Login and token download from the dashboard. If you don't want to enable this on the dashboard during installation, simply enter a random string when asked for the Google Client ID and the Google Client Secret. You can consult here under “Creating A Google Project” if you want to read more details. Here is a summary of what you need to do:

Please note: at this point, the dashboard only accepts login from emails with a 'ucsc.edu' domain. In the future, it will support different email domains.

Running the Installer

Once the above setup is done, clone this repository onto your server and run the bootstrap script

# note, you may need to checkout the particular branch or release tag you are interested in...
git clone https://github.com/BD2KGenomics/dcc-ops.git && cd dcc-ops && sudo bash install_bootstrap
Installer Question Notes

The install_bootstrap script will ask you to configure each service interactively.

Once the installer completes, the system should be up and running. Congratulations! See docker ps to get an idea of what's running.

Post-Installation
TODO

Here are things we need to explain how to do post install:

Confirm Proper Function

To test that everything installed successfully, you can run cd test && ./integration.sh. This will do an upload and download with core-client and check the results.

Run Consonance

Make sure you have the consonance CLI installed.

Make a run.json


nput_file": {
    "class": "File",
    "path": "https://raw.githubusercontent.com/briandoconnor/dockstore-tool-md5sum/master/md5sum.input"
}

consonance run --tool-dockstore-id quay.io/briandoconnor/dockstore-tool-md5sum:1.0.3 --flavour r3.8xlarge --run-descriptor run.json
# and it produces this
"job_uuid" : "66a67327-ccd3-4af0-a5c8-688fb52da778"

# you can check the status
consonance status --job_uuid 66a67327-ccd3-4af0-a5c8-688fb52da778
Upload and Download

End users should be directed to use the quay.io/ucsc_cgl/core-client docker image as documented in its README. The test/integration.sh file also demonstrates normal core-client usage.

Here is a sample command you can run from the test folder to do an upload:

NOTE: Make sure you create an access token for yourself first. You can do so by running within dcc-ops the command redwood/cli/bin/redwood token create -u myemail@ucsc.edu -s 'aws.upload aws.download'. This will create a global token that you can use for testing for upload and download on any project. End users should only be given project-specific scopes like aws.PROJECT.upload.

 docker run --rm -it -e ACCESS_TOKEN=<your_token> -e REDWOOD_ENDPOINT=<your_url.com> \
        -v $(pwd)/manifest.tsv:/dcc/manifest.tsv -v $(pwd)/samples:/samples \
        -v $(pwd)/outputs:/outputs quay.io/ucsc_cgl/core-client:1.1.0-alpha spinnaker-upload \
        --force-upload /dcc/manifest.tsv

Here is a sample command you can run to download the using a manifest file. On the dashboard, go to the “BROWSER” tab, and click on “Download Manifest” at the bottom of the list. Save this file, and run the following command. This will download the files specified from the manifest:

 docker run --rm -it -e ACCESS_TOKEN=<your_token> -e REDWOOD_ENDPOINT=<your_url.com> \
        -v $(pwd)/<your_manifest_file_name.tsv>:/dcc/dcc-spinnaker-client/data/manifest.tsv \
        -v $(pwd)/samples:/samples -v $(pwd)/outputs:/outputs \
        -v $(pwd):/dcc/data quay.io/ucsc_cgl/core-client:1.1.0-alpha \
        redwood-download /dcc/dcc-spinnaker-client/data/manifest.tsv /dcc/data/
Running RNA-Seq Analysis on Sample Data

To do RNA-Seq Analysis, you must first upload reference files to Redwood. You can obtain the reference files by running from within dcc-ops:

rence/download_reference.sh

This will download the files under reference/samples. You can then use the core client to do a spinnaker upload as described previously and use the manifest.tsv within the reference folder.

Once you have successfully uploaded the reference files, you can start submitting fastq files to redwood to run analysis on them. See the help section on the file browser for more information on the template. Use RNA-Seq or scRNA-Seq when filling out the Submitter Experimental Design column on your manifest.

Troubleshooting

If something goes wrong, you can open an issue or contact a human.

Tips
To Do

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.