openshift/node-problem-detector

Name: node-problem-detector

Owner: OpenShift

Description: This is a place for various problem detectors running on the Kubernetes nodes.

Forked from: kubernetes/node-problem-detector

Created: 2017-09-13 20:40:49.0

Updated: 2018-04-10 13:12:58.0

Pushed: 2018-05-24 01:20:41.0

Homepage: null

Size: 3613

Language: Go

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

node-problem-detector

Build Status Go Report Card

node-problem-detector aims to make various node problems visible to the upstream layers in cluster management stack. It is a daemon which runs on each node, detects node problems and reports them to apiserver. node-problem-detector can either run as a DaemonSet or run standalone. Now it is running as a Kubernetes Addon enabled by default in the GCE cluster.

Background

There are tons of node problems could possibly affect the pods running on the node such as:

Currently these problems are invisible to the upstream layers in cluster management stack, so Kubernetes will continue scheduling pods to the bad nodes.

To solve this problem, we introduced this new daemon node-problem-detector to collect node problems from various daemons and make them visible to the upstream layers. Once upstream layers have the visibility to those problems, we can discuss the remedy system.

Problem API

node-problem-detector uses Event and NodeCondition to report problems to apiserver.

Problem Daemon

A problem daemon is a sub-daemon of node-problem-detector. It monitors a specific kind of node problems and reports them to node-problem-detector.

A problem daemon could be:

Currently, a problem daemon is running as a goroutine in the node-problem-detector binary. In the future, we'll separate node-problem-detector and problem daemons into different containers, and compose them with pod specification.

List of supported problem daemons:

| Problem Daemon | NodeCondition | Description | |—————-|:—————:|:————| | KernelMonitor | KernelDeadlock | A system log monitor monitors kernel log and reports problem according to predefined rules. | | AbrtAdaptor | None | Monitor ABRT log messages and report them further. ABRT (Automatic Bug Report Tool) is health monitoring daemon able to catch kernel problems as well as application crashes of various kinds occurred on the host. For more information visit the link. | | CustomPluginMonitor | On-demand(According to users configuration) | A custom plugin monitor for node-problem-detector to invoke and check various node problems with user defined check scripts. See proposal here. |

Usage

Flags
Build Image

Run make in the top directory. It will:

Push Image

make push uploads the docker image to registry. By default, the image will be uploaded to staging-k8s.gcr.io. It's easy to modify the Makefile to push the image to another registry.

Start DaemonSet
Start Standalone

To run node-problem-detector standalone, you should set inClusterConfig to false and teach node-problem-detector how to access apiserver with apiserver-override.

To run node-problem-detector standalone with an insecure apiserver connection:

-problem-detector --apiserver-override=http://APISERVER_IP:APISERVER_INSECURE_PORT?inClusterConfig=false

For more scenarios, see here

Links


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.