Testing Prometheus alerts

September 8, 2021

Prometheus logo

Testing Prometheus alerts

An alert
The test
Run test with promtool
Source files

So you have your Prometheus set up; it scraped time-series data. Now you want to get notified when servers will reach some usage thresholds.

An alert

For instance, you want to get an alert if the CPU is running high for at least 24 hours. No problem:

# prometheus-alerts.yaml
groups:
- name: core
  rules:
  - alert: HighCPU24
    expr: max by (instance) (irate(node_cpu_seconds_total{region="us-east-1"}[1m])) > 0.8
    for: 24h
    labels:
      severity: page
    annotations:
      summary: High CPU usage on 

The question is, how do you test the above alert HighCPU24? One way is to deploy such an alert and expect that it works. Another approach could be to try and make some nodes or containers work harder. Hence, the processor will be busy for at least 24 hours, but that is a waste of resources. But one can run unit tests against these alert rules as well.

The test

# prometheus-alerts-test.yaml
rule_files: 
- 'prometheus-alerts.yaml' # 1
tests:
- name: 'alert HighCPU24 should be fired'
  interval: 1m # 2 every minute use one value in series
  input_series: # 3 simulate the data (starts at 0)
    - series: node_cpu_seconds_total{instance="10.0.2.9", region="us-east-1"} # 4
      values: '0+50x1500' # 5 - every minute 50sec of available 60sec CPU time is used, load is at 83%, it continues for 1500 minutes
    - series: node_cpu_seconds_total{instance="10.1.1.1", region="us-east-2"} # 4
      values: '0+10x1500' # constant ~20% load on this CPU
  alert_rule_test:
    - alertname: HighCPU24 # 6
      eval_time: 500m # 7 500th minute from start
      exp_alerts: # nothing yet
    - alertname: HighCPU24
      eval_time: 1450m # after 24hr from start
      exp_alerts: # 8
        - exp_labels:
            severity: page
            instance: 10.0.2.9
          exp_annotations:
            summary: High CPU usage on 10.0.2.9

From above:

define the files containing the alert rules that the runner will include in this test
specify the time interval for the values in the series, 1m means there is a gap of one minute between 2 values
add the simulated time series
identify a metric name with its attributes
mock the time series values that appear between an interval (space separated values)
specify which alert the test targets
tell the point in time when the runner should expect the given details
what data do you wish to see at this point

The documentation is a bit “dry” on testing. The hard part was to understand how the time series maps to the “interval” (I hope the comments make sense). Also, how the time series values will expand, e.g.: '0+50x1500' becomes 0 50 100 150 200 250 ....

Run test with promtool

promtool binary is necessary to run the tests. You could use a Prometheus Docker image that has it bundled. Just make sure to change the entrypoint:

#!/bin/bash -e

TEST_DIR=`pwd`
DOCKER_TMP_DIR=/tmp
PROMETHEUS_VERSION=v2.29.1

docker run --rm \
	--volume $TEST_DIR:$DOCKER_TMP_DIR \
	--workdir $DOCKER_TMP_DIR \
	--user root \
	--entrypoint /bin/promtool \
	prom/prometheus:$PROMETHEUS_VERSION test rules prometheus-alerts-test.yaml

--user root is intentional here because the testing framework needs to write temporary files to disk.

After a successful test, you’ll see:

Unit Testing:  prometheus-alerts-test.yaml
  SUCCESS

In a case of a failure, it looks like:

Unit Testing:  prometheus-alerts-test.yaml
  FAILED:
    name: alert HighCPU24 should be fired,
    alertname:HighCPU24, time:1d10m, 
        exp:"[Labels:{alertname=\"HighCPU24\", instance=\"9.9.9.9\", severity=\"warn\"} Annotations:{summary=\"Message\"}]", 
        got:"[Labels:{alertname=\"HighCPU24\", instance=\"10.0.2.9\", severity=\"page\"} Annotations:{summary=\"High CPU usage on 10.0.2.9\"}]"

Source files

You can find the source in a public Github repo github.com/ivarprudnikov/prometheus-alert-test