Netflix-Skunkworks/atlas-system-agent

Name: atlas-system-agent

Owner: Netflix-Skunkworks

Description: null

Created: 2018-04-13 19:10:32.0

Updated: 2018-05-22 23:09:27.0

Pushed: 2018-05-22 23:10:31.0

Homepage: null

Size: 568

Language: C++

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Build Status

Atlas System Agent / Atlas Titus Agent

:warning: Experimental

An agent that reports metrics for ec2 instances or titus containers.

Build Instructions
 apt-get update
 apt-get install -y zlib1g-dev uuid-dev libblkid-dev libpcre3-dev

rf build && mkdir build
uild
e -DCMAKE_BUILD_TYPE=RELWITHDEBINFO -DTITUS_AGENT=ON ..
 VERBOSE=1 -j4
ntests && make DESTDIR=../root install

Repeat the above commands but do not define -DTITUS_AGENT=ON

Titus Agent
CPU Metrics
cgroup.cpu.processingCapacity

Amount of processing time requested for the container. This value is computed based on the number of shares allocated when creating the job. Note that this is not a hard limit, if there is no contention a job can use more than the requested capacity. However, a user should not rely on getting more than requested.

Unit: seconds/second

cgroup.cpu.processingTime

Amount of time spent processing code in the container. This metric would typically get used for one of two use-cases:

  1. Utilization: to see how close it is coming to saturating the requested resources for the job you can divide the processing time by the processing capacity.
  2. Performance Regression: for comparative analysis the sum can be used. Note you should ensure that both systems being compared have the same amount of resources.

Unit: seconds/second

cgroup.cpu.shares

Number of shares configured for the job. The Titus scheduler treats each CPU core as 100 shares. Generally the processing capacity is more relevant to the user as it has been normalized to the same unit as the measured processing time.

Unit: num shares

cgroup.cpu.usageTime

Amount of time spent processing code in the container in either the system or user category.

Unit: seconds/second

Dimensions:

Memory Metrics
cgroup.mem.failures

Counter indicating an allocation failure occurred. Typically this will be seen when the application hits the memory limit.

Unit: failures/second

cgroup.mem.limit

Memory limit for the cgroup.

Unit: bytes

cgroup.mem.used

Memory usage for the cgroup.

Unit: bytes

cgroup.mem.pageFaults

Description from kernel.org

Counter indicating the number of times that a process of the cgroup triggered a “page fault” and a “major fault”, respectively. A page fault happens when a process accesses a part of its virtual memory space which is nonexistent or protected. The former can happen if the process is buggy and tries to access an invalid address (it will then be sent a SIGSEGV signal, typically killing it with the famous Segmentation fault message). The latter can happen when the process reads from a memory zone which has been swapped out, or which corresponds to a mapped file: in that case, the kernel will load the page from disk, and let the CPU complete the memory access. It can also happen when the process writes to a copy-on-write memory zone: likewise, the kernel will preempt the process, duplicate the memory page, and resume the write operation on the process` own copy of the page. “Major” faults happen when the kernel actually has to read the data from disk. When it just has to duplicate an existing page, or allocate an empty page, it is a regular (or “minor”) fault.

Unit: faults/second

Dimensions:

cgroup.mem.processUsage

Amount of memory used by processes running in the cgroup.

Unit: bytes

Dimensions:


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.