Name: atlas-system-agent
Owner: Netflix-Skunkworks
Description: null
Created: 2018-04-13 19:10:32.0
Updated: 2018-05-22 23:09:27.0
Pushed: 2018-05-22 23:10:31.0
Homepage: null
Size: 568
Language: C++
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
:warning: Experimental
An agent that reports metrics for ec2 instances or titus containers.
This build requires a C++11 compiler, some system libraries, and libatlasclient
To build the titus-agent:
apt-get update
apt-get install -y zlib1g-dev uuid-dev libblkid-dev libpcre3-dev
rf build && mkdir build
uild
e -DCMAKE_BUILD_TYPE=RELWITHDEBINFO -DTITUS_AGENT=ON ..
VERBOSE=1 -j4
ntests && make DESTDIR=../root install
Repeat the above commands but do not define -DTITUS_AGENT=ON
Amount of processing time requested for the container. This value is computed based on the number of shares allocated when creating the job. Note that this is not a hard limit, if there is no contention a job can use more than the requested capacity. However, a user should not rely on getting more than requested.
Unit: seconds/second
Amount of time spent processing code in the container. This metric would typically get used for one of two use-cases:
Unit: seconds/second
Number of shares configured for the job. The Titus scheduler treats each CPU core as 100 shares. Generally the processing capacity is more relevant to the user as it has been normalized to the same unit as the measured processing time.
Unit: num shares
Amount of time spent processing code in the container in either the system or user category.
Unit: seconds/second
Dimensions:
id
: category of usage, either system
or user
Counter indicating an allocation failure occurred. Typically this will be seen when the application hits the memory limit.
Unit: failures/second
Memory limit for the cgroup.
Unit: bytes
Memory usage for the cgroup.
Unit: bytes
Description from kernel.org
Counter indicating the number of times that a process of the cgroup triggered
a “page fault” and a “major fault”, respectively. A page fault happens when a
process accesses a part of its virtual memory space which is nonexistent or
protected. The former can happen if the process is buggy and tries to access
an invalid address (it will then be sent a SIGSEGV
signal, typically killing
it with the famous Segmentation fault
message). The latter can happen when the
process reads from a memory zone which has been swapped out, or which corresponds
to a mapped file: in that case, the kernel will load the page from disk, and let
the CPU complete the memory access. It can also happen when the process writes to
a copy-on-write memory zone: likewise, the kernel will preempt the process,
duplicate the memory page, and resume the write operation on the process` own copy
of the page. “Major” faults happen when the kernel actually has to read the data
from disk. When it just has to duplicate an existing page, or allocate an empty
page, it is a regular (or “minor”) fault.
Unit: faults/second
Dimensions:
id
: either minor
or major
.Amount of memory used by processes running in the cgroup.
Unit: bytes
Dimensions:
id
: how the processes are using the memory. Values are cache
, rss
, rss_huge
,
and mapped_file
.