Centralized host health alert rules

When we relate a metrics provider (e.g. some server) to prometheus, we expect prometheus to post an alert if the server is not responding. With prometheus’s PromQL this can be expressed universally with up and absent expressions:

  • up < 1
  • absent(up)

Instead of having every single charm in the ecosystem duplicate the same alert rules, they are automatically generated by the prometheus-scapre and remote-write charm libraries. This alleviates charm authors from having to implement their own HostHealth rules per charm and reduces implementation error.

Avoiding alert fatigue

The alert rules are designed to be in the Pending state for 5 minutes before transitioning to the Firing state. This is necessary to avoid alerting false positives in cases of new installation, or flapping metric behaviour.

“Host down” vs. “metrics missing”

Note that HostHealth has slightly different semantics between remote-write and scrape:

  • If Prometheus failed to scrape, then the target is down (up < 1).
  • If Grafana Agent failed to remote-write (regardless of whether scrape succeeded) then it’s absent(up).

Scrape

With support for centralized (generic) alerts, Prometheus provides a HostDown alert for each charm and each of its units via alert labels.

The alert rule within prometheus_scrape contains (ignoring annotations):

groups:
  - name: HostHealth
    rules:
    - alert: HostDown
      expr: up < 1
      for: 5m
      labels:
        severity: critical
    - alert: HostMetricsMissing
      # This alert is applicable only when the provider is linked via
      # an aggregator (such as grafana agent)
      expr: absent(up)
      for: 5m
      labels:
        severity: critical

Remote write

With support for centralized (generic) alerts, Prometheus provides a HostMetricsMissing alert for Grafana Agent itself and each application that is aggregated by it.

The HostMetricsMissing alert does not show each unit, only the application!

The alert rule within prometheus_remote_write contains (ignoring annotations):

groups:
  - name: AggregatorHostHealth
    rules:
    - alert: HostMetricsMissing
      expr: absent(up)
      for: 5m
      labels:
        severity: critical

Alerting scenarios

Without Grafana Agent

image

  1. When a unit of some-charm is down for 5 minutes, the HostDown alert fires in the Prometheus UI (showing the specific unit).
  2. If multiple units are down, they show in the labels as well.
  3. These alerts also arrive in Alertmanager.

With Grafana Agent

Scrape

  1. When a unit of some-charm is down for 5 minutes, the HostDown alert fires in the Prometheus UI (showing the specific unit).
  2. If multiple units are down, they show in the labels as well.
  3. These alerts also arrive in Alertmanager.

Remote write

  1. When Grafana Agent is down for 5 minutes, the HostMetricsMissing alert fires for both the HostHealth and AggregatorHostHealth groups in the Prometheus UI.
  2. These alerts also arrive in Alertmanager.

Upgrade Notes

(TODO: add revision information)

By fetching the new libraries you would get a set of new alerts automatically. If charms already had up/absent alerts, this will result in duplication of alerts and rules. These alerts are ubiquitous and are handled by the Prometheus prometheus_scrape and Prometheus_remote_write libraries. Any custom alerts duplicating this behaviour can be removed.

References

  • Support for adding generic alerts is centralized in cos-lib allowing the Prometheus libraries to consume this functionality (link to PR).
  • The Prometheus prometheus_scrape and prometheus_remote_write libraries inject generic up/absent rules to the existing rule set (link to PR).
  • The semantics of the up alert rules is formalized in an architecture decision record (ADR). (link to PR).
2 Likes