awslabs/harmonize-search-analyze

Name: harmonize-search-analyze

Owner: Amazon Web Services - Labs

Owner: AWS Samples

Description: Code samples related to "Harmonize, Search, and Analyze Loosely Coupled Datasets on AWS" (https://aws.amazon.com/blogs/big-data/harmonize-search-and-analyze-loosely-coupled-datasets-on-aws/) published on the Big Data Blog

Created: 2017-02-22 20:31:25.0

Updated: 2017-12-05 14:07:41.0

Pushed: 2017-12-01 23:59:09.0

Homepage:

Size: 425

Language: HTML

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Harmonize, Search, and Analyze Loosely Coupled Datasets on AWS

This repository contains the source code of the Harmonize, Search, and Analyze Loosely Coupled Datasets on AWS blog post. It is a set of CloudFormation templates and tools for deploying a data harmonization and search application which uses sample data from the Public Safety Open Data Portal.

Click this CloudFormation button to launch your own copy of the sample application in the us-east-1 (N. Virginia) AWS region:

cloudformation-launch-stack

Overview

This repository consist of a set of nested CloudFormation templates that deploy the following:

Directory Structure

The project contains the following main directories:

.
|__ build               # project wide build make files and environment config
|__ infrastructure      # cloudformation templates and emr bootstrap scripts
|__ notebooks           # jupyter notebooks and associated source
|__ services            # web search application container definitions and sources
CloudFormation Template Descriptions

The CloudFormation templates below are included in this repository:

| Template | Description | | — | — | | master.yaml | This is the master template used to deploy the stack to CloudFormation. It uses nested templates to include the ECS Reference Architecture templates as well as the ones listed below. | | infrastructure/elasticsearch.yaml | Elasticsearch cluster that enforces AWS authentication. The cluster holds the data dictionary, indexed data and the Kibana dashboard configuration. | | infrastructure/jupyterspark.yaml | EMR cluster with Apache Spark and Jupyter Notebooks. Used to explore, clean, harmonize (transform), describe, save, and index multiple loosely coupled datasets. | | infrastructure/pipeline.yaml | Continuous deployment of the data discovery web application (see service template below) using CodePipeline and CodeBuild. The pipeline takes the source, builds the data discovery web application using CodeBuild, pushes the container images to ECR and deploys the service to the ECS cluster using CloudFormation. The template deploys ECR, CodePipeline, CodeBuild and associated IAM resources. | | infrastructure/service.yaml | ECS service and task definition for the data discovery web application plus related IAM, CloudWatch and ALB resources. It is used to run the containers that form the search interface including: Kibana, aws-es-kibana, NGINX and the web application. It is instantiated from pipeline stack. |

Data Discovery Web Application Description

The data discovery web application is powered by Docker containers running in ECS. It is a JavaScript based interface that drives an embedded Kibana dashboard.

Here is a description of each container in the service:

You can find the Dockerfile definition and related source/configuration of each service under its own subdirectory in the services directory of the project.

How do I …?

Deploy Using My Own S3 Bucket
  1. Modify the master.yaml template to point to your own S3 bucket. Please note that the S3 bucket must have versioning enabled for CodePipeline to work. Additionally, the bucket must be in the same region as the CloudFormation stack. The bucket and path are configured by the ArtifactBucket and ArtifactPrefix variables under the Mappings section of the template.
  2. Modify the variables in the local build environment file: build/config.env. These variables control the build environment and web application deployment. In specific, you should modify the following variables:
    • ENV_BUCKET_PATH: point it to your own bucket and prefix merged together as the path to the artifacts (same as step 1)
    • ENV_NAME: make it the same as the EnvironmentName parameter used when launching the CloudFormation stack
    • ENV_VERSION: you should bump the version variable everytime you make changes to the web application source to cause a new ECS deployment
  3. Upload the files to your S3 bucket. The build directory under the root of the repo contains a Makefile that can be used to build the artifacts and upload the files into your S3 bucket. It uses the aws cli to upload to S3. The Makefile uploads a zip file (from git archive) of your local repository to S3 so you should commit any local changes before uploading.

To upload the files to your s3 bucket, issue the following commands (from the root of the repository):

t commit any pending changes in the local repo prior to upload
 build
ke upload # requires properly configured aws cli
Build a Stand-Alone Version of the Web Application

The front-end part of the web application (html, JavaScript and css) can be built and packaged so that it can be deployed separately in a different web server. The application build environment and dependencies are managed using npm. Here are the commands to build it:

 services/webapp
m install
m build

The application is built and bundled using webpack. The output files of the build process can be found in the dist directory. That includes the bundled JavaScript and CSS files which can be added to your web application.

Please note that moving it to a different web server may require configuring CORS and changing the publicPath variable in the webpack configuration file (webpack.config.js) to point it to the right URL path in the web server.

Run the Web Application on a Development Workstation

The data discovery web application can be run on a development workstation using Docker Compose. The services directory contains the files Makefile and docker-compose.yml which are used to run the containers locally. The Makefile serves as a wrapper to docker-compose to setup the environment and build process.

This Docker Compose service points the local aws-es-kibana container to the AWS Elasticsearch Service cluster. That requires the Elasticsearch cluster created by the CloudFormation templates in this project to be up and running. Additionally, you need the aws cli configured with credentials having permissions to obtain the Elasticsearch endpoint from CloudFormation and to make requests to the Elasticsearch cluster.

If the CloudFormation stack was deployed to a region different than the default one (us-east-1), you should set the AWS_DEFAULT_REGION variable in the build/config.env file to the right AWS region.

The local development environment runs the web application using webpack-dev-server from the webapp container. It mounts the webapp source directory from the host to allow hot-module-replacement. Depending on your Docker configuration, you may need to configure Docker so that the webapp directory is available to be mounted by containers and point the WEBAPP_DIR environment variable to the directory.

To run the discovery web application on a workstation, issue the following commands:

 services
ke up
Cleanup the CloudFormation stacks?

The resources created in this environment can be easily removed from your account by deleting the master CloudFormation stack. The master stack (default stack name: datasearch-blog) is the one that was first created using the “Launch Stack” button. By deleting this stack, the rest of the sub-stacks will be deleted as well. Some of the nested sub-stacks use CloudFormation Custom Resources to facilitate cleaning up the resources.

The environment retains the EMR logs S3 bucket in case you need to troubleshoot it. You should manually remove this bucket if you don't want to keep this data. The name of this bucket is: datasearch-blog-jupyterspark-<ID> (assuming default stack name was used).

Contributing

Please create a new GitHub issue for any feature requests, bugs, or documentation improvements.

Where possible, please also submit a pull request for the change.

License

Copyright 2011-2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the “License”). You may not use this file except in compliance with the License. A copy of the License is located at

http://aws.amazon.com/apache2.0/

or in the “license” file accompanying this file. This file is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.