When we relate a metrics provider (e.g. some server) to prometheus, we expect prometheus to post an alert if the server is not responding. With prometheus’s PromQL this can be expressed universally with up
and absent
expressions:
up < 1
absent(up)
Instead of having every single charm in the ecosystem duplicate the same alert rules, they are automatically generated by the prometheus-scapre and remote-write charm libraries. This alleviates charm authors from having to implement their own HostHealth
rules per charm and reduces implementation error.
Avoiding alert fatigue
The alert rules are designed to be in the Pending
state for 5 minutes before transitioning to the Firing
state. This is necessary to avoid alerting false positives in cases of new installation, or flapping metric behaviour.
“Host down” vs. “metrics missing”
Note that HostHealth
has slightly different semantics between remote-write and scrape:
- If Prometheus failed to scrape, then the target is down (
up < 1
). - If Grafana Agent failed to remote-write (regardless of whether scrape succeeded) then it’s
absent(up)
.
Scrape
With support for centralized (generic) alerts, Prometheus provides a HostDown
alert for each charm and each of its units via alert labels.
The alert rule within prometheus_scrape
contains (ignoring annotations):
groups:
- name: HostHealth
rules:
- alert: HostDown
expr: up < 1
for: 5m
labels:
severity: critical
- alert: HostMetricsMissing
# This alert is applicable only when the provider is linked via
# an aggregator (such as grafana agent)
expr: absent(up)
for: 5m
labels:
severity: critical
Remote write
With support for centralized (generic) alerts, Prometheus provides a HostMetricsMissing
alert for Grafana Agent itself and each application that is aggregated by it.
The HostMetricsMissing
alert does not show each unit, only the application!
The alert rule within prometheus_remote_write
contains (ignoring annotations):
groups:
- name: AggregatorHostHealth
rules:
- alert: HostMetricsMissing
expr: absent(up)
for: 5m
labels:
severity: critical
Alerting scenarios
Without Grafana Agent
- When a unit of
some-charm
is down for 5 minutes, theHostDown
alert fires in the Prometheus UI (showing the specific unit). - If multiple units are down, they show in the labels as well.
- These alerts also arrive in Alertmanager.
With Grafana Agent
Scrape
- When a unit of
some-charm
is down for 5 minutes, theHostDown
alert fires in the Prometheus UI (showing the specific unit). - If multiple units are down, they show in the labels as well.
- These alerts also arrive in Alertmanager.
Remote write
- When Grafana Agent is down for 5 minutes, the
HostMetricsMissing
alert fires for both theHostHealth
andAggregatorHostHealth
groups in the Prometheus UI. - These alerts also arrive in Alertmanager.
Upgrade Notes
(TODO: add revision information)
By fetching the new libraries you would get a set of new alerts automatically. If charms already had up
/absent
alerts, this will result in duplication of alerts and rules. These alerts are ubiquitous and are handled by the Prometheus prometheus_scrape
and Prometheus_remote_write
libraries. Any custom alerts duplicating this behaviour can be removed.
References
- Support for adding generic alerts is centralized in
cos-lib
allowing the Prometheus libraries to consume this functionality (link to PR). - The Prometheus
prometheus_scrape
andprometheus_remote_write
libraries inject genericup
/absent
rules to the existing rule set (link to PR). - The semantics of the
up
alert rules is formalized in an architecture decision record (ADR). (link to PR).