awslabs/ecs-machine-learning

Name: ecs-machine-learning

Owner: Amazon Web Services - Labs

Owner: AWS Samples

Description: An EC2 Container Service Architecture which provides the infrastructure for Deep Learning

Created: 2016-08-17 21:21:55.0

Updated: 2017-12-01 19:48:54.0

Pushed: 2016-10-05 20:50:46.0

Homepage:

Size: 28

Language: null

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Orchestrating GPU-Accelerated Workloads on Amazon ECS

Background

AWS Solutions Architects are seeing an emerging type of application for ECS: GPU-accelerated workloads, or, more specifically, workloads that need to leverage large amounts of GPUs across many nodes. For example, at Amazon.com, the Amazon Personalization Team runs significant Machine Learning workloads that leverage many GPUs on Amazon ECS. Let?s take a look at how ECS enables GPU workloads.

Solution

In order to run GPU-enabled work on an ECS cluster, a Docker image configured with Nvidia CUDA drivers, which allow the container to communicate with the GPU hardware, is built and stored in EC2 Container Registry. An ECS Task Definition is used to point to the container image in ECR and specify configuration for the container at runtime, like how much CPU and memory each container should use, the command to run inside the container, if a data volume should be mounted in the container, where the source dataset lives in S3, and so on.

Once the ECS Tasks are run, the ECS scheduler finds a suitable place to run the containers by identifying an instance in the cluster with available resources. As shown in the below architecture diagram, ECS can place containers into the cluster of GPU instances (?GPU slaves? in the diagram)

Deploying the architecture

In this template, we spin up an ECS Cluster with a single GPU instance in an autoscaling group. You can, however, adjust the ASG desired capacity to run a larger cluster if you?d like. The instance is configured with all of the necessary software, like Nvidia drivers, that DSSTNE requires for interaction with the underlying GPU hardware. We also install some development tools, like Make and GCC, so that we can compile the DSSTNE library at boot time. We then build a Docker container with the DSSTNE library packaged up and upload it to EC2 Container Registry. We grab the URL of the resulting container image in ECR and build an ECS Task Definition that points to the container.

Once the CloudFormation template completes, take a look at the ?Outputs? tab to get an idea of where to look for your new resources.

Prerequisistes
Network configuration

The instances launched will need to have access to the internet hence either be in a public subnet and provided a public IP or in a private subnet with acess to a NAT gateway.

Accepting terms
  1. Accept AWS Marketplace terms for Amazon Linux AMI with NVIDIA GRID GPU Driver by going to the MarketPlace page

  2. Click Continue on the right.

  3. Click on the Manual Launch tab and click on the Accept Software Terms button.

  4. Wait for an email confirmation that your marketplace subscription is active.

Launch the stack
  1. Choose Launch Stack to launch the template in the us-east-1 region in your account: Launch ECS Machine Learning into North Virginia with CloudFormation

(The template will build a DSSTNE container on the ECS cluster instance. Note this can take up to 25 minutes and the CloudFormation stack will not report completion until the entire build process is done.)

  1. Give a Stack Name and select your preferred key name. If you do not have a key available, see Amazon EC2 Key Pairs.
Run the model
  1. Find the name of the DSSTNE ECS Task Definition in CloudFormation stack outputs. It will start with “arn:aws[…]” and contain the CloudFormation template name right after “task-defintion/“.

  2. Go to the ECS Console, click on Task Definitions (left column) and find the one you spotted in the step above.

  3. Tick the one revision you see. Click on the Actions drop-down menu and hit Run task. Make sure to select the ECS Cluster that was brought up by the CloudFormation template. By running this task, you are essentially running the DSSTNE sample modeling as described on the amazon-dsstne GitHub page.

  4. You can easily check that the GPU is being used by logging to the EC2 instance and running watch -n1 nvidia-smi

Collect predictions
  1. You should be able to find the name of the relevant CloudWatch Logs Group in CloudFormation stack outputs

  2. Look at the task logs for details and output from the task run and location of results file in S3

  3. Navigate to this S3 bucket via the S3 Console. This is where you will be able to access the results file and confirm that this GPU-enabled Machine Learning run was successful.

Bonus activities

In both bonus steps, look at the CloudWatch Logs to view the task logs (different training commands, taking advantage of multiple GPUs, etc.)

Conclusion

You should now have a good grasp on how to leverage ECS and GPU-optimized EC2 instances for your Machine Learning needs. Head on over to the AWS Big Data blog to learn more about how DSSTNE interacts with Apache Spark, trains models, generates predictions, and other fun Machine Learning concepts.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.