allegro/marathon-appcop

Name: marathon-appcop

Owner: Allegro Tech

Description: Marathon applications law enforcement

Created: 2017-03-28 06:37:14.0

Updated: 2018-05-24 06:52:23.0

Pushed: 2017-05-19 06:46:43.0

Homepage: null

Size: 51

Language: Go

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

AppCop Build Coverage Status

Marathon AppCop - Marathon applications law enforcement.

In large Mesos deployments there could be thousands of applications running and deploying every day. Sometimes they happen to be broken, forgotten and unmaintained which could exert pressure on cluster in numerous ways.

To address that AppCop clears Marathon from broken application deployments.

How it works

AppCop takes information provided by the Marathon event-stream related to applications failures and scales them down.

Scoring Mechanism

Based on Marathon events (TASK_KILL, TASK_FAIL, TASK_FINISHED), AppCop is building score registry for each application event emited. Each score is incremented by each app event, so if events related to failures are comming it is constantly raising. When application passes treshold, then AppCop scales application one instance down forcefully and put appcop label in app definition. After that, score for this application is reset. When there is only one instance, then and score is pass theshold then application is suspended. Scores are periodically reset.

GarbageCollection

AppCop is periodically fetching applications and groups from Marathon. When application is suspended or group is empty for long (configurable) time then it is deleted.

Metrics

AppCop provides set of standard system metrics as well as application based metrics.

Metric Types

System Metrics - AppCop specific telemetry (e.g - queue Size, Event delays etc). Location equals, metrics-prefix append metrics-system-sub-prefix.

Applications Metrics - Applications telemetry calculated based on events provided by marathon (like: task_killed, task_finished counters). Location equals, metrics-prefix (append) metrics-app-sub-prefix.

Please note the existance of appid-prefix config option, if set, removes matching string from application id when it comes to metric publication. For example, assumming

d-prefix = com.example.
D = com.example.exampleapp

your applications metric will be placed under:

fix}.{metrics-app-sub-prefix}.exampleapp
Installation
Installing from source code

To simply compile and run the source code:

un main.go [options]

To run the tests:

 test

To build the binary:

 build

To build deb package:

 pack

Check dist/ dir.

Setting up AppCop

AppCcop should be installed on all Marathon masters. The event subscription should be set to localhost to reduce network traffic. Please refer to options section for more.

Marathon Labels

AppCop is using Marathon labels to communicate actions or to tune execution logic.

Used labels:

Name | Possible values | r/w | Description ————————–|—————————|———-|—————— appcop | suspend, scaleDown | w | Every time AppCop scales or suspend application, put appropriate label in app definition APP_IMMUNITY | false, true | r | When AppCop encounters this label in app definition, treats it as immune to all penalties (excused from all criminal acts on cluster). Use this feature wisely, because if applied to often it could defeat whole purpose for using AppCop

r - label is taken from app definition, not altered, w - label is manipulated by AppCop.

Options

Argument | Default | Description —————————-|——————-|—————————————————— config-file | | Path to a JSON file to read configuration from. Note: Will override options set earlier on the command line event-stream-location | /v2/events | Get events from this stream my-leader | marathon-dev | My leader, when Marathon /v2/leader endpoint return the same string as this one, make subscription to event stream and launch jobs. events-queue-size | 1000 | Size of events queue listen | :4444 | Accept connections at this address log-file | | Save logs to file (e.g.: /var/log/appcop.log). If empty logs are published to STDERR log-format | text | Log format: JSON, text log-level | info | Log level: panic, fatal, error, warn, info or debug marathon-location | example.com:8080| Marathon URL marathon-password | | Marathon password for basic auth marathon-protocol | http | Marathon protocol (http or https) marathon-ssl-verify | true | Verify certificates when connecting via SSL marathon-timeout | 30s | Time limit for requests made by the Marathon HTTP client. A timeout of zero means no timeout appid-prefix | | Prefix common to all fully qualified application ID's. Remove this preffix from applications id's ([Metric Types](#metric types)) marathon-username | | Marathon username for basic auth scale-down-score | 30 | Score for application to scale it one instance down scale-limit | 2 | How many scale down actions to commit in one scaling down iteration update-interval | 2s | Interval for updating app scores reset-interval | 1d | How often collected scores are reset evaluate-interval | 30s | How often collected scores are compared against scale-down-score metrics-interval | 30s | Metrics reporting interval metrics-location | | Graphite URL (used when metrics-target is set to graphite) metrics-prefix | default | Metrics prefix (default is resolved to . metrics-system-sub-prefix | appcop-internal | System specific metrics. Append to metric-prefix metrics-app-sub-prefix | applications | Applications specific metrics. Appended to metric-prefix metrics-target | stdout | Metrics destination stdout or graphite (empty string disables metrics) workers-pool-size | 10 | Number of concurrent workers processing events mgc-enabled | true | Enable garbage collecting of Marathon, old suspended applications will be deleted mgc-max-suspend-time | 7 days | How long application should be suspended before deleting it mgc-interval | 8 hours | Marathon GC interval mgc-appcop-only | true | Delete only applications suspended by AppCop dry-run | false | Perform a trial run with no changes made to marathon

Endpoints

Endpoint | Description ———-|———————————————————————————— /health | healthcheck - returns OK


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.