Name: harmonize-search-analyze
Owner: Amazon Web Services - Labs
Owner: AWS Samples
Description: Code samples related to "Harmonize, Search, and Analyze Loosely Coupled Datasets on AWS" (https://aws.amazon.com/blogs/big-data/harmonize-search-and-analyze-loosely-coupled-datasets-on-aws/) published on the Big Data Blog
Created: 2017-02-22 20:31:25.0
Updated: 2017-12-05 14:07:41.0
Pushed: 2017-12-01 23:59:09.0
Size: 425
Language: HTML
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
This repository contains the source code of the Harmonize, Search, and Analyze Loosely Coupled Datasets on AWS blog post. It is a set of CloudFormation templates and tools for deploying a data harmonization and search application which uses sample data from the Public Safety Open Data Portal.
Click this CloudFormation button to launch your own copy of the sample application in the us-east-1 (N. Virginia) AWS region:
This repository consist of a set of nested CloudFormation templates that deploy the following:
The project contains the following main directories:
.
|__ build # project wide build make files and environment config
|__ infrastructure # cloudformation templates and emr bootstrap scripts
|__ notebooks # jupyter notebooks and associated source
|__ services # web search application container definitions and sources
The CloudFormation templates below are included in this repository:
| Template | Description | | — | — | | master.yaml | This is the master template used to deploy the stack to CloudFormation. It uses nested templates to include the ECS Reference Architecture templates as well as the ones listed below. | | infrastructure/elasticsearch.yaml | Elasticsearch cluster that enforces AWS authentication. The cluster holds the data dictionary, indexed data and the Kibana dashboard configuration. | | infrastructure/jupyterspark.yaml | EMR cluster with Apache Spark and Jupyter Notebooks. Used to explore, clean, harmonize (transform), describe, save, and index multiple loosely coupled datasets. | | infrastructure/pipeline.yaml | Continuous deployment of the data discovery web application (see service template below) using CodePipeline and CodeBuild. The pipeline takes the source, builds the data discovery web application using CodeBuild, pushes the container images to ECR and deploys the service to the ECS cluster using CloudFormation. The template deploys ECR, CodePipeline, CodeBuild and associated IAM resources. | | infrastructure/service.yaml | ECS service and task definition for the data discovery web application plus related IAM, CloudWatch and ALB resources. It is used to run the containers that form the search interface including: Kibana, aws-es-kibana, NGINX and the web application. It is instantiated from pipeline stack. |
The data discovery web application is powered by Docker containers running in ECS. It is a JavaScript based interface that drives an embedded Kibana dashboard.
Here is a description of each container in the service:
You can find the Dockerfile
definition and related source/configuration
of each service under its own subdirectory in the services
directory of the project.
ArtifactBucket
and ArtifactPrefix
variables under
the Mappings
section of the template.ENV_BUCKET_PATH
: point it to your own bucket and prefix merged
together as the path to the artifacts (same as step 1)ENV_NAME
: make it the same as the EnvironmentName
parameter
used when launching the CloudFormation stackENV_VERSION
: you should bump the version variable everytime you make
changes to the web application source to cause a new ECS deploymentMakefile
that can be used to
build the artifacts and upload the files into your S3 bucket. It uses the
aws cli to upload to S3. The Makefile
uploads a zip file (from git archive
) of your local repository to S3
so you should commit any local changes before uploading.To upload the files to your s3 bucket, issue the following commands (from the root of the repository):
t commit any pending changes in the local repo prior to upload
build
ke upload # requires properly configured aws cli
The front-end part of the web application (html, JavaScript and css) can be built and packaged so that it can be deployed separately in a different web server. The application build environment and dependencies are managed using npm. Here are the commands to build it:
services/webapp
m install
m build
The application is built and bundled using
webpack. The output files of the build
process can be found in the dist
directory. That includes the bundled
JavaScript and CSS files which can be added to your web application.
Please note that moving it to a
different web server may require configuring
CORS
and changing the publicPath
variable in the webpack configuration file
(webpack.config.js) to point
it to the right URL path in the web server.
The data discovery web application can be
run on a development workstation using Docker Compose. The services
directory contains the files Makefile and
docker-compose.yml which are used to
run the containers locally. The Makefile
serves as a wrapper to
docker-compose
to setup the environment and build process.
This Docker Compose service points the local aws-es-kibana container to the AWS Elasticsearch Service cluster. That requires the Elasticsearch cluster created by the CloudFormation templates in this project to be up and running. Additionally, you need the aws cli configured with credentials having permissions to obtain the Elasticsearch endpoint from CloudFormation and to make requests to the Elasticsearch cluster.
If the CloudFormation stack was deployed to a region different than the
default one (us-east-1), you should set the AWS_DEFAULT_REGION
variable
in the build/config.env file to the right AWS region.
The local development environment runs the web application using
webpack-dev-server
from the webapp container. It mounts the
webapp source directory from the host to allow
hot-module-replacement.
Depending on your Docker configuration, you may need to configure Docker
so that the webapp
directory is available to be mounted by containers
and point the WEBAPP_DIR
environment variable to the directory.
To run the discovery web application on a workstation, issue the following commands:
services
ke up
The resources created in this environment can be easily removed from your account by deleting the master CloudFormation stack. The master stack (default stack name: datasearch-blog) is the one that was first created using the “Launch Stack” button. By deleting this stack, the rest of the sub-stacks will be deleted as well. Some of the nested sub-stacks use CloudFormation Custom Resources to facilitate cleaning up the resources.
The environment retains the EMR logs S3 bucket in case you need
to troubleshoot it. You should manually remove this bucket if
you don't want to keep this data. The name of this bucket is:
datasearch-blog-jupyterspark-<ID>
(assuming default stack name
was used).
Please create a new GitHub issue for any feature requests, bugs, or documentation improvements.
Where possible, please also submit a pull request for the change.
Copyright 2011-2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the “License”). You may not use this file except in compliance with the License. A copy of the License is located at
http://aws.amazon.com/apache2.0/
or in the “license” file accompanying this file. This file is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.