Centralized host health alert rules - feature notes

crucible · 5 February 2025 20:11

We are happy to announce a new feature for generic HostHealth (up/absent) alert rules in Prometheus and Grafana Agent!

This alleviates charm authors from having to implement their own HostHealth rules per charm and reduces implementation error.

How does it work?

See this explanation doc for further implementation details.

When we relate a metrics provider (e.g. some server) to prometheus, we expect prometheus to post an alert if the server is not responding. With prometheus’s PromQL this can be expressed universally with up and absent expressions:

up < 1
absent(up)

Instead of having every single charm in the ecosystem duplicate the same alert rules, they are automatically generated by the prometheus_scrape, prometheus_remote_write, and cos_agent charm libraries.

Upgrade Notes

Charm revisions:

Grafana-agent (rev412)
- cos_agent charm-lib (v0.18)
Prometheus-k8s (rev229)
- prometheus_scrape charm-lib (v0.49)
- prometheus_remote_write charm-lib (v1.6)
Cos-lib (0.0.54)

By fetching the new libraries you would get a set of new alerts automatically. If charms already had up/absent alerts, this will result in duplication of alerts and rules. These alerts are ubiquitous and are handled by the Prometheus prometheus_scrape, prometheus_remote_write, and cos_agent libraries. Any custom alerts duplicating this behaviour can be removed.

References

Support for adding generic alerts is centralized in cos-lib allowing the Prometheus libraries to consume this functionality (link to PR #115, PR #117).
The Prometheus prometheus_scrape and prometheus_remote_write libraries inject generic up/absent rules to the existing rule set (link to PR #660).
The Grafana-agent cos_agent library injects generic up/absent rules to the existing rule set (link to PR #232).
The semantics of the up alert rules is formalized in an architecture decision record (ADR). (link to PR #224).