awslabs/ecs-refarch-task-rebalancing

Name: ecs-refarch-task-rebalancing

Owner: Amazon Web Services - Labs

Owner: AWS Samples

Description: A serverless approach using AWS Lambda and Amazon ECS Event Stream to proactively rebalance the ECS tasks.

Created: 2017-08-10 20:50:26.0

Updated: 2017-12-21 19:05:25.0

Pushed: 2017-09-20 17:27:04.0

Homepage: null

Size: 2079

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Rebalancing Amazon ECS Tasks using AWS Lambda

  1. Introduction
  2. Create an ECS Cluster and Deploy a Sample App
  3. Use ECS Events to Rebalance Tasks
  4. Clean Up

Introduction

Containerization offers many benefits to modern microservices architectures. Amazon EC2 Container Service (ECS) allows you to easily run Docker Containers on a managed cluster of Amazon EC2 instances. As an organization grows in maturity, cost optimizations such as Auto-Scaling and deployment methodologies such as Blue/Green can create a lot of churn in the environment.

Consider an ECS cluster with tasks distributed evenly across multiple ECS instances within the cluster. If the cluster is scaled down in order to save cost, the tasks on the removed instance are assigned to remaining nodes automatically. However, when the ECS cluster is scaled up again, tasks are not automatically redistributed across all available instances. This leads to unused capacity and an under-utilized cluster, which could negatively affect application availibility.

In this reference architecture, we will demonstrate a serverless approach using AWS Lambda and Amazon ECS Event Stream to proactively rebalance the ECS tasks.

Create an ECS Cluster and Deploy a Sample App

For your convenience, we have created a CloudFormation template that will create the core infrastructure that we will use throughout this example. The template creates an Application Load Balancer (ALB), an ECS Cluster containing two m3.medium instances running the ECS Optimized AMI, and a task definition for a small web application. An S3 bucket is also created to host our lambda function.

Core Infrastructure

Let's Get Started

Sample App

Explore the ECS Cluster

Use ECS Events to Rebalance Tasks

We propose a solution that listens to ?ECS Container Instance State Change? events on the ECS event stream and triggers a lambda that rebalances tasks on the ECS cluster.

Rebalancing Diagram

This involves:

Explore the Lambda Function

The code for our lambda function is ecs-task-rebalancer.py and can be found at https://github.com/awslabs/ecs-refarch-task-rebalancing/blob/master/ecs-task-rebalancer.py.

Let's take a look at what this script it doing in more detail. The Lambda function will be triggered by a specific CloudWatch Event from ECS. Details of the trigger are below.

{
    "detail-type": [
        "ECS Container Instance State Change"
    ],
    "source": [
        "aws.ecs"
    ],
    "detail": {
        "clusterArn": [
            "arn:aws:ecs:us-west-2:<AWS ACCOUNT NUMBER>:cluster/ECS-                Rebalancing-Stack-ECSCluster"
    ]
  }
}

Once the Lambda function has been triggered, it evaluates the event details to determine whether a rebalancing action is required.

scribe the container instance that caused the event.
response = ecs.describe_container_instances(
    cluster=cluster_name,
    containerInstances=[containerInstanceArn]
)

containerInstances = response["containerInstances"]
print "Number of container instances", len(containerInstances)
if(len(containerInstances) != 0):
    containerInstance = containerInstances[0]
    numberOfRunningTasks = containerInstance["runningTasksCount"]
    numberOfPendingTasks = containerInstance["pendingTasksCount"]
    version = containerInstance["version"]

    if numberOfRunningTasks == 0 and numberOfPendingTasks == 0 and agentConnected == True:
        print ("Rebalancing the tasks due to the event.")
        rebalance_tasks()
    else :
        print ("Event does not warrant task rebalancing.")

If the ECS Agent is connected, but there are no running or pending tasks, then a new instance has joined the cluster and is ready to receive tasks. The script will iterate through all running services and create a new copy of the existing task definition. This will redeploy the service according to the configured placement strategy.

alance ECS tasks of all services deployed in the cluster
def rebalance_tasks():
    all_services = get_cluster_services()

    #For each service, figure out the taskDefinition, register a new version
    #and update the service -- This sequence will rebalance the tasks on all
    #available and connected instances
    response = ecs.describe_services(
        cluster=cluster_name,
        services=all_services
    )

    described_services = response["services"]
    for service in described_services:

        print ("service : ", service)

        #Get information about the task definition of the service
        task_definition = service["taskDefinition"];

        response = ecs.describe_task_definition(
            taskDefinition=task_definition
        )

        taskDefinitionDescription = response["taskDefinition"]

        containerDefinitions = taskDefinitionDescription["containerDefinitions"]
        volumes = taskDefinitionDescription["volumes"]
        family = taskDefinitionDescription["family"]

        print ("containerDefinitions : ", containerDefinitions)
        print ("volumes : ", volumes)
        print ("family : ", family)

        #Register a new version of the task_definition
        response = ecs.register_task_definition(
            family=family,
            containerDefinitions=containerDefinitions,
            volumes= volumes
        )

        newTaskDefinitionArn = response["taskDefinition"]["taskDefinitionArn"]
        print "New task definition arn : " , newTaskDefinitionArn

        response = ecs.update_service(
            cluster=cluster_name,
            service=service["serviceArn"],
            taskDefinition=newTaskDefinitionArn
        )

        print ("Updated the service ", service, "with new task definition")

The service performs an In-Place Deployment. Two new tasks are started growing the number of tasks to 200% of its desired count which is the maximum permitted. ECS Service Deployment Options

After the new tasks are verified to be healthy by the Elastic Load Balancer health check, the two previous tasks with the older task definition are drained and stopped.

ECS Service Deployment Options

Now that we understand what the function is doing, let's deploy it!

Deploy the Lambda Function

Disclaimer: This solution works by creating new versions of the task definitions to trigger the rebalancing. This may impact applications with long lived connections. This will also create new task definition versions which are not deleted. Please test the solution with your application to ensure it will function as expected.


Clean Up

Now we will clean up the resources that we created to avoid being charged. CloudFormation cannot delete an bucket that is not empty. We will delete the lambda function zip file from the S3 bucket. Next, delete the lambda function CloudFormation stack that we created, this will delete the lambda functions. Next delete the ecs cluster stack. This should delete all the resources that were created for this exercise

Conclusion

We have explored using the ECS event stream to proactively rebalance ECS tasks. If you'd like to dive deeper into any of these topics, please check out the following links:

Monitor Cluster State with Amazon ECS Event Stream

ECS Reference Architecture: Continuous Deployment

Blue/Green deployments on ECS


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.