yahoo/egads

Name: egads

Owner: Yahoo Inc.

Description: Extendible Generic Anomaly Detection System

Created: 2015-05-06 17:47:52.0

Updated: 2018-05-24 13:02:57.0

Pushed: 2018-01-02 23:12:13.0

Homepage:

Size: 1342

Language: Java

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Build Status

EGADS Java Library

EGADS (Extensible Generic Anomaly Detection System) is an open-source Java package to automatically detect anomalies in large scale time-series data. EGADS is meant to be a library that contains a number of anomaly detection techniques applicable to many use-cases in a single package with the only dependency being Java. EGADS works by first building a time-series model which is used to compute the expected value at time t. Then a number of errors E are computed by comparing the expected value with the actual value at time t. EGADS automatically determines thresholds on E and outputs the most probable anomalies. EGADS library can be used in a wide variety of contexts to detect outliers and change points in time-series that can have a various seasonal, trend and noise components.

How to get started

EGADS was designed as a self contained library that has a collection of time-series and anomaly detection models that are applicable to a wide-range of use cases. To compile the library into a single jar, clone the repo and type the following:

clean compile assembly:single

You may have to set your JAVA_HOME variable to the appropriate JVM. To do this run:

rt JAVA_HOME=/usr/lib/jvm/{JVM directory for desired version}

Usage

To run a simple example type:

 -Dlog4j.configurationFile=src/test/resources/log4j2.xml -cp target/egads-*-jar-with-dependencies.jar com.yahoo.egads.Egads src/test/resources/sample_config.ini src/test/resources/sample_input.csv

which produces the following picture (Note that you can enable this UI by setting OUTPUT config key to GUI in sample_config.ini).

gui

One can also specify config parameters on a command line. For example to do anomaly detection using Olympic Scoring as a time-series model and a density based method as an anomaly detection model use the following.

 -Dlog4j.configurationFile=src/test/resources/log4j2.xml -cp target/egads-*-jar-with-dependencies.jar com.yahoo.egads.Egads "MAX_ANOMALY_TIME_AGO:999999999;AGGREGATION:1;OP_TYPE:DETECT_ANOMALY;TS_MODEL:OlympicModel;AD_MODEL:ExtremeLowDensityModel;INPUT:CSV;OUTPUT:STD_OUT;BASE_WINDOWS:168;PERIOD:-1;NUM_WEEKS:3;NUM_TO_DROP:0;DYNAMIC_PARAMETERS:0;TIME_SHIFTS:0" src/test/resources/sample_input.csv

To run anomaly detection using no time-series model with an auto static threshold for anomaly detection, use the following:

 -Dlog4j.configurationFile=src/test/resources/log4j2.xml -cp target/egads-*-jar-with-dependencies.jar com.yahoo.egads.Egads "MAX_ANOMALY_TIME_AGO:999999999;AGGREGATION:1;OP_TYPE:DETECT_ANOMALY;TS_MODEL:NullModel;AD_MODEL:SimpleThresholdModel;SIMPLE_THRESHOLD_TYPE:AdaptiveMaxMinSigmaSensitivity;INPUT:CSV;OUTPUT:STD_OUT;AUTO_SENSITIVITY_ANOMALY_PCNT:0.2;AUTO_SENSITIVITY_SD:2.0" src/test/resources/sample_input.csv

Overview

While rapid advances in computing hardware and software have led to powerful applications, still hundreds of software bugs and hardware failures continue to happen in a large cluster compromising user experience and subsequently revenue. Non-stop systems have a strict uptime requirement and continuous monitoring of these systems is critical. From the data analysis point of view, this means non-stop monitoring of large volume of time-series data in order to detect potential faults or anomalies. Due to the large scale of the problem, human monitoring of this data is practically infeasible which leads us to automated anomaly detection. An anomaly, or an outlier, is a data point which is significantly different from the rest of the data. Generally, the data in most applications is created by one or more generating processes that reflect the functionality of a system.

When the underlying generating process behaves in an unusual way, it creates outliers. Fast and efficient identification of these outliers is useful for many applications including: intrusion detection, credit card fraud, sensor events, medical diagnoses, law enforcement and others. Current approaches in automated anomaly detection suffer from a large number of false positives which prohibit the usefulness of these systems in practice. Use-case, or category specific, anomaly detection models may enjoy a low false positive rate for a specific application, but when the characteristics of the time-series change, these techniques perform poorly without proper retraining.

EGADS (Extensible Generic Anomaly Detection System) enables the accurate and scalable detection of time-series anomalies. EGADS separates forecasting and anomaly detection two separate components which allows the person to add her own models into any of the components.

Architecture

The EGADS framework consists of two main components: the time-series modeling module (TMM), the anomaly detection module (ADM). Given a time-series the TMM component models the time-series producing an expected value later consumed by the ADM that computes anomaly scores. EGADS was built as a framework to be easily integrated into an existing monitoring infrastructure. At Yahoo, our internal Yahoo Monitoring Service (YMS) processes millions of data-points every second. Therefore, having a scalable, accurate and automated anomaly detection for YMS is critical. For this reason, EGADS can be compiled into a single light-weight jar and deployed easily at scale.

The TMM and ADM can be found under main/java/com/yahoo/egads/models.

The example of the models supported by TMM and ADM can be found in in the two table below. We expect this collection of models to grow as more contribution is put forward by the community.

List of current TimeSeries Models

models

List of current Anomaly Detection Models

admodels

Configuration

Below are the various configuration parameters supported by EGADS.

ly show anomalies no older than this.
 this is set to 0, then only output an anomaly
 it occurs on the last time-stamp.
ANOMALY_TIME_AGO  99999

notes how much should the time-series be aggregated by.
 set to 1 or less, this setting is ignored.
EGATION 1

_TYPE specifies the operation type.
tions: DETECT_ANOMALY,
       UPDATE_MODEL,
   TRANSFORM_INPUT
YPE DETECT_ANOMALY

_MODEL specifies the time-series
del type.
tions: AutoForecastModel
       DoubleExponentialSmoothingModel
       MovingAverageModel
       MultipleLinearRegressionModel
       NaiveForecastingModel
       OlympicModel
       PolynomialRegressionModel
       RegressionModel
       SimpleExponentialSmoothingModel
       TripleExponentialSmoothingModel
       WeightedMovingAverageModel
   SpectralSmoother
   NullModel
ODEL    OlympicModel

_MODEL specifies the anomaly-detection
del type.
tions: ExtremeLowDensityModel
       AdaptiveKernelDensityChangePointDetector
       KSigmaModel
       NaiveModel
       DBScanModel
       SimpleThresholdModel
ODEL    ExtremeLowDensityModel

pe of the simple threshold model.
tions: AdaptiveMaxMinSigmaSensitivity
       AdaptiveKSigmaSensitivity
MPLE_THRESHOLD_TYPE

ecifies the input src.
tions: STDIN
       CSV
T   CSV

ecifies the output src.
tions: STD_OUT,
       ANOMALY_DB
       GUI
       PLOT
UT  STD_OUT

RESHOLD specifies the threshold for the
omaly detection model.
mment to auto-detect all thresholds.
tions: mapee,mae,smape,mape,mase.
RESHOLD mape#10,mase#15

#################################
Olympic Forecast Model Config ###
#################################

e possible time-shifts for Olympic Scoring.
_SHIFTS 0,1

e possible base windows for Olympic Scoring.
_WINDOWS  24,168

riod specifies the periodicity of the
me-series (e.g., the difference between successive time-stamps).
tions: (numeric)
       0 - auto detect.
       -1 - disable.
OD  -1


M_WEEKS specifies the number of weeks
 use in OlympicScoring.
WEEKS 8

M_TO_DROP specifies the number of
ghest and lowest points to drop.
TO_DROP 0

 dynamic parameters is set to 1, then
ADS will dynamically vary parameters (NUM_WEEKS)
 produce the best fit.
MIC_PARAMETERS  0

###############################################
ExtremeLowDensityModel & DBScanModel Config ###
###############################################

notes the expected % of anomalies
 your data.
_SENSITIVITY_ANOMALY_PCNT   0.01

fers to the cluster standard deviation.
_SENSITIVITY_SD 3.0

########################
NaiveModel Config ###
########################

ndow size where the spike is to be found.
OW_SIZE 0.1

###################################################
AdaptiveKernelDensityChangePointDetector Config ###
###################################################

ange point detection parameters
WINDOW_SIZE 48
_WINDOW_SIZE    48
IDENCE  0.8

###########################
SpectralSmoother Config ###
###########################

NDOW_SIZE should be greater than the size of longest important seasonality.
 default it is set to 192 = 8 * 24 which is worth of 8 days (> 1 week) for hourly time-series.
OW_SIZE 192

LTERING_METHOD specifies the filtering method for Spectral Smoothing
tions:          GAP_RATIO       (Recommended: FILTERING_PARAM = 0.01)
        EIGEN_RATIO     (Recommended: FILTERING_PARAM = 0.1)
        EXPLICIT        (Recommended: FILTERING_PARAM = 10)
        K_GAP           (Recommended: FILTERING_PARAM = 8)
        VARIANCE        (Recommended: FILTERING_PARAM = 0.99)
        SMOOTHNESS      (Recommended: FILTERING_PARAM = 0.97)
ERING_METHOD GAP_RATIO

ERING_PARAM 0.01

Contributions

  1. Clone your fork
  2. Hack away
  3. If you are adding new functionality, document it in the README
  4. Verify your code by running mvn package and adding additional tests.
  5. Push the branch up to GitHub
  6. Send a pull request to the yahoo/egads project.

We actively welcome contributions. If you don't know where to start, try checking out the issue list and fixing up the place. Or, you can add a model - a goal of this project is to have a robust, lightweight and dependency-free set of models to choose from that are ready to be deployed in production.

References

Generic and Scalable Framework for Automated Time-series Anomaly Detection by Nikolay Laptev, Saeed Amizadeh, Ian Flint , KDD 2015 (August 10, 2015)

Citation

If you use EGADS in your projects, please cite: Generic and Scalable Framework for Automated Time-series Anomaly Detection by Nikolay Laptev, Saeed Amizadeh, Ian Flint , KDD 2015

BibTeX:

roceedings{laptev2015generic,
    title={Generic and Scalable Framework for Automated Time-series Anomaly Detection},
    author={Laptev, Nikolay and Amizadeh, Saeed and Flint, Ian},
    booktitle={Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
    pages={1939--1947},
    year={2015},
    organization={ACM}

License

Code licensed under the GPL License. See LICENSE file for terms.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.