Name: cgp-data-store
Owner: BD2K Center for Translational Genomics
Owner: Data Biosphere
Description: Design specs and prototypes for the HCA Data Storage System ("blue box")
Forked from: HumanCellAtlas/data-store
Created: 2018-02-07 03:58:54.0
Updated: 2018-04-30 12:52:45.0
Pushed: 2018-04-27 21:28:08.0
Homepage: https://dss.staging.data.humancellatlas.org/
Size: 9105
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
This repository contains design specs and prototypes for the replicated data storage system (aka the “blue box”) of the Human Cell Atlas. We use this Google Drive folder for design docs and meeting notes, and this Zenhub board to track our GitHub work.
The prototype in this repository uses Swagger to specify the API in dss-api.yml, and Connexion to map the API specification to its implementation in Python.
You can use the
Swagger Editor to
review and edit the prototype API specification. When the prototype app is running, the Swagger spec is also available
at /v1/swagger.json
.
The prototype is deployed continuously from the master
branch, with the resulting producer and consumer API available
at https://dss.staging.data.humancellatlas.org/.
The HCA DSS prototype development environment requires Python 3.6+ to run. Run pip install -r requirements-dev.txt
in
this directory.
The HCA DSS prototype requires Python 3.6+ to run. Run pip install -r requirements.txt
in this directory.
Tests also use data from the data-bundle-examples subrepository. Run: git submodule update --init
Environment variables are required for test and deployment. The required environment variables and their default values
are in the file environment
. To customize the values of these environment variables:
environment.local.example
to environment.local
environment.local
to add custom entries that override the default values in environment
Run source environment
now and whenever these environment files are modified.
Follow the instructions in http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html to get the
aws
command line utility.
Create an S3 bucket that you want DSS to use and in environment.local
, set the environment variable DSS_S3_BUCKET
to the name of that bucket. Make sure the bucket region is consistent with AWS_DEFAULT_REGION
in
environment.local
.
Repeat the previous step for
DSS_S3_CHECKOUT_BUCKET
DSS_S3_CHECKOUT_BUCKET_TEST
DSS_S3_CHECKOUT_BUCKET_TEST_FIXTURES
If you wish to run the unit tests, you must create two more S3 buckets, one for test data and another for test
fixtures, and set the environment variables DSS_S3_BUCKET_TEST
and DSS_S3_BUCKET_TEST_FIXTURES
to the names of
those buckets.
Hint: To create S3 buckets from the command line, use aws s3 mb --region REGION s3://BUCKET_NAME/
.
Follow the instructions in https://cloud.google.com/sdk/downloads to get the gcloud
command line utility.
In the Google Cloud Console, select the correct Google user account on the top right and the correct GCP project in the drop down in the top center. Go to “IAM & Admin”, then “Service accounts”, then click “Create service account” and select “Furnish a new private key”. Under “Roles” select “Project ? Owner”, “Project ? Service Account Actor” and “Cloud Functions ? Cloud Function Developer”. Create the account and download the service account key JSON file.
In environment.local
, set the environment variable GOOGLE_APPLICATION_CREDENTIALS
to the path of the service
account key JSON file.
Choose a region that has support for Cloud Functions and set GCP_DEFAULT_REGION
to that region. See
https://cloud.google.com/about/locations/ for a list of supported regions.
Run gcloud auth activate-service-account --key-file=/path/to/service-account.json
.
Run gcloud config set project PROJECT_ID
where PROJECT_ID is the ID, not the name (!) of the GCP project you
selected earlier.
Enable required APIs: gcloud service-management enable cloudfunctions.googleapis.com
; gcloud service-management
enable runtimeconfig.googleapis.com
Generate OAuth application secrets to be used for your instance:
1) Go to https://console.developers.google.com/apis/credentials (you may have to select Organization and Project again)
2) Click Create Credentials and select OAuth client
3) For Application type choose Other
4) Under application name, use hca-dss-
followed by the stage name i.e., the value of DSS_DEPLOYMENT_STAGE. This
is a convention only and carries no technical significance.
5) Click Create, don't worry about noting the client ID and secret, click OK
6) Click the edit icon for the new credentials and click Download JSON
7) Place the downloaded JSON file into the project root as application_secrets.json
Create a Google Cloud Storage bucket and in environment.local
, set the environment variable DSS_GS_BUCKET
to the
name of that bucket. Make sure the bucket region is consistent with GCP_DEFAULT_REGION
in environment.local
.
If you wish to run the unit tests, you must create two more buckets, one for test data and another for test
fixtures, and set the environment variables DSS_GS_BUCKET_TEST
and DSS_GS_BUCKET_TEST_FIXTURES
to the names of
those buckets.
Hint: To create GCS buckets from the command line, use gsutil mb -c regional -l REGION gs://BUCKET_NAME/
.
AZURE_STORAGE_ACCOUNT_NAME
and AZURE_STORAGE_ACCOUNT_KEY
.Run ./dss-api
in the top-level data-store
directory.
Check that software packages required to test and deploy are available, and install them if necessary.
Run: make --dry-run
To run the tests, test fixture data must be set up using the following command. This command will completely empty the given buckets before populating them with test fixture data, please ensure the correct bucket names are provided.
tests/fixtures/populate.py --s3-bucket $DSS_S3_BUCKET_TEST_FIXTURES --gs-bucket $DSS_GS_BUCKET_TEST_FIXTURES
Set the environment variable DSS_TEST_ES_PATH
to the path of the elasticsearch
binary on your machine. Then to
perform the data store tests:
Run make test
in the top-level data-store
directory.
Assuming the tests have passed above, the next step is to manually deploy. See the section below for information on CI/CD with Travis if continuous deployment is your goal.
The AWS Elasticsearch Service is used for metadata indexing. Currently, the AWS Elasticsearch Service must be configured manually. The AWS Elasticsearch Service domain name must either:
have the value dss-index-$DSS_DEPLOYMENT_STAGE
or, the environment variable DSS_ES_DOMAIN
must be set to the domain name of the AWS Elasticsearch Service instance
to be used.
For typical development deployments the t2.small.elasticsearch instance type is more than sufficient.
Now deploy using make:
make deploy
Set up AWS API Gateway. The gateway is automatically set up for you and associated with the Lambda. However, to get a friendly domain name, you need to follow the directions here. In summary:
Generate a HTTPS certificate via AWS Certificate Manager (ACM). See note below on choosing a region for the certificate.
Set up the custom domain name in the API gateway console. See note below on the DNS record type.
In Amazon Route 53 point the domain to the API gateway
In the API Gateway, fill in the endpoints for the custom domain name e.g. Path=/
, Destination=dss
and
dev
. These might be different based on the profile used (dev, stage, etc).
Set the environment variable API_HOST
to your domain name in the environment.local
file.
Note: The certificate should be in the same region as the API gateway or, if that's not possible, in us-east-1
. If the
ACM certificate's region is us-east-1
and the API gateway is in another region, the type of the custom domain name
must be Edge Optimized. Provisioning such a domain name typically takes up to 40 minutes because the certificate needs
to be replicated to all involved CloudFront edge servers. The corresponding record set in Route 53 needs to be an
alias A record, not a CNAME or a regular A record, and it must point to the CloudFront host name associated with the
edge-optimized domain name. Starting November 2017, API gateway supports regional certificates i.e., certificates in
regions other than us-east-1
. This makes it possible to match the certificate's region with that of the API
gateway. and cuts the provisioning of the custom domain name down to seconds. Simply create the certificate in the same
region as that of the API gateway, create a custom domain name of type Regional and in Route53 add a CNAME recordset
that points to the gateway's canonical host name.
If successful, you should be able to see the Swagger API documentation at:
https://<domain_name>
And you should be able to list bundles like this:
curl -X GET "https://<domain_name>/v1/bundles" -H "accept: application/json"
Now that you have deployed the data store, the next step is to use the HCA Data Store CLI to upload and download data to
the system. See data-store-cli for installation instructions. The
client requires you change hca/api_spec.json
to point to the correct host, schemes, and, possibly, basePath. Examples
of CLI use:
# list bundles
hca get-bundles
# upload full bundle
hca upload --replica aws --staging-bucket staging_bucket_name data-bundle-examples/smartseq2/paired_ends
Now that you've uploaded data, the next step is to confirm the indexing is working properly and you can query the indexed metadata.
hca post-search --query '
{
"query": {
"bool": {
"must": [{
"match": {
"files.sample_json.donor.species": "Homo sapiens"
}
}, {
"match": {
"files.assay_json.single_cell.method": "Fluidigm C1"
}
}, {
"match": {
"files.sample_json.ncbi_biosample": "SAMN04303778"
}
}]
}
}
}'
We use Travis CI for continuous integration testing and
deployment. When make test
succeeds, Travis CI deploys the application into the dev
stage on AWS for every commit
that goes on the master branch. This behavior is defined in the deploy
section of .travis.yml
.
Encrypted environment variables give Travis CI the AWS credentials needed to run the tests and deploy the app. Run
scripts/authorize_aws_deploy.sh IAM-PRINCIPAL-TYPE IAM-PRINCIPAL-NAME
(e.g. authorize_aws_deploy.sh group
travis-ci
) to give that principal the permissions needed to deploy the app. Because a group policy has a higher size
limit (5,120 characters) than a user policy (2,048 characters), it is advisable to apply this to a group and add the
principal to that group. Because this is a limited set of permissions, it does not have write access to IAM. To set up
the IAM policies for resources in your account that the app will use, run make deploy
using privileged account
credentials once from your workstation. After this is done, Travis CI will be able to deploy on its own. You must
repeat the make deploy
step from a privileged account any time you change the IAM policies templates in
iam/policy-templates/
.
The direct runtime dependencies of this project are defined in requirements.txt.in
. Direct development dependencies
are defined in requirements-dev.txt.in
. All dependencies, direct and transitive, are defined in the corresponding
requirements.txt
and requirements-dev.txt
files. The latter two can be generated using make requirements.txt
or
make requirements-dev.txt
respectively. Modifications to any of these four files need to be committed. This process is
aimed at making dependency handling more deterministic without accumulating the upgrade debt that would be incurred by
simply pinning all direct and transitive dependencies. Avoid being overly restrictive when constraining the allowed
version range of direct dependencies in -requirements.txt.in
and requirements-dev.txt.in
If you need to modify or add a direct runtime dependency declaration, follow the steps below:
1) Make sure there are no pending changes to requirements.txt
or requirements-dev.txt
.
2) Make the desired change to requirements.txt.in
or requirements-dev.txt.in
3) Run make requirements.txt
. Run make requirements-dev.txt
if you have modified requirements-dev.txt.in
.
4) Visually check the changes to requirements.txt
and requirements-dev.txt
.
5) Commit them with a message like Bumping dependencies
.
You now have two commits, one that catches up with updates to transitive dependencies, and one that tracks your explict
change to a direct dependency. This process applies to development dependencies as well, except for
requirements-dev.txt
and requirements-dev.txt.in
respectively.
If you wish to re-pin all the dependencies, run make refresh_all_requirements
. It is advisable to do a full
test-deploy-test cycle after this (the test after the deploy is required to test the lambdas).
Always use a module-level logger, call it logger
and initialize it as follows:
rt logging
er = logging.getLogger(__name__)
Do not configure logging at module scope. It should be possible to import any module without side-effects on
logging. The dss.logging
module contains functions that configure logging for this application, its Lambda
functions and unit tests.
When logging a message, pass either
an f-string as the first and only positional argument or
a %-string as the first argument and substitution values as subsequent arguments. Do not mix the two string interpolation methods. If you mix them, any percent sign in a substituted value will raise an exception.
other words, use
er.info(f"Foo is {foo} and bar is {bar}")
er.info("Foo is %s and bar is %s", foo, bar)
t not
er.info(f"Foo is {foo} and bar is %s", bar)
yword arguments can be used safely in conjunction with f-strings:
er.info(f"Foo is {foo}", exc_info=True)
To enable verbose logging by application code, set the environment variable DSS_DEBUG
to 1
. To enable verbose
logging by dependencies set DSS_DEBUG
to 2
. To disable verbose logging unset DSS_DEBUG
or set it to 0
.
To assert in tests that certain messages were logged, use the dss
logger or one of its children
logger = logging.getLogger('dss')
self.assertLogs(dss_logger) as log_monitor:
# do stuff
rt dss
self.assertLogs(dss.logger) as log_monitor:
# do stuff