Prometheus-k8s docs - How to write great charmed alert rules

:warning: This post is a living document and could change frequently!

Prometheus and Alertmanager allow to respectively evaluate and send alerts. The purpose of this document is to outline best practices for creating alerts that are useful, descriptive, and not overwhelming.

Charmed alerts are packed together with their charm, and are sent over to the Canonical Observability Stack when integrating with Prometheus.

Principles

Define clear objectives for your alerts

Before creating your alerts, you should understand the behavior of your application and the potential impact of issues on the users. Identify the key metrics, service level indicators (SLIs) and objectives (SLOs), and think about how you can best express your failure modes.

Write effective and relevant alerts

Receiving too many alert notifications for irrelevant issues will cause responders to start ignoring them: it’s a phenomenon called “notification overload”.

For alerts to be relevant, you should watch out for the following:

  • false positives: alerts whose conditions are met, but that are not indicative of an issue in the application;
  • recipients: make sure you’re sending alert notifications to people that can act on them;
  • non-actionable alerts: if the alert doesn’t contain enough information (or if it has too much), it becomes hard for someone to resolve quickly pinpoint the issues and solve them.

Alerts should be relevant for all of your users

When bundling alerts in a charm, they’re going to be active for all the users of your application; if an alert only applies to a fragment of them, then it’s probably best to not include that in the charm.

Actionable Advice

Keep the alert title in PagerDuty short

The notification title should communicate what the problem is, but it doesn’t need to contain all the relevant information; the rest can go into the description, so that a responder can still dig deeper and pinpoint the issue.

Use group_by to your advantage

Grouping alerts can be extremely helpful. Imagine an application with lots of units goes down: if you got an alert per each unit, it wouldn’t be more useful than just getting one alert for the application; in fact, the excessive amount of notification could hinder the response process, as you could easily miss some important information.

Common Alert Rules

Prometheus Self-Monitoring

Target/Job Missing

Single Job

expr: absent(up{job=<Service>})

For example, <Service> might be:

  • "prometheus"
  • "alertmanager"

Single Target

expr: up == 0

All Targets

expr: sum by (job) (up) == 0

With Warmup Time

expr: sum by (instance, job) ((up == 0) * on (instance) group_right(job) (node_time_seconds - node_boot_time_seconds > 600))
Too Many Restarts
expr: changes(process_start_time_seconds{job=<Job>}[15m]) > 2

An exmaple <Job> might be:

  • Prometheus → ~"prometheus|pushgateway|alertmanager"
  • Loki → ~".*loki.*"
Configuration Failure

Configuration Reload Failure

expr: <Service>_config_last_reload_successful != 1

Configuration Not Synced

expr: count(count_values("config_hash", <Service>_config_hash)) > 1

Host and Hardware

Adding host and hardware alert rules can warn users of potential system failure and allows for the remediation before a serious system failure occurs. To enable these types of alerts, the Prometheus node exporter is required for hardware and OS metrics exposed by *NIX kernels.

Some alerts that are worth mentioning:

  • (Over|Under)utilized Memory
    • Low Swap Memory
  • (Over|Under)utilized CPU
  • Unusual Network Throughput In/Out
  • Unusual Disk Read/Write Rate

Others

The list of alert rules above is not exhaustive. Alerting coverage can be extended to topics like:

  • Databases and brokers
  • Reverse proxies and load balancers
  • Runtimes
  • Orchestrators
  • Network, security and storage

References