Prometheus-k8s docs - How to write great charmed alert rules

:warning: This post is a living document and could change frequently!

Prometheus and Alertmanager allow to respectively evaluate and send alerts. The purpose of this document is to outline best practices for creating alerts that are useful, descriptive, and not overwhelming.

Charmed alerts are packed together with their charm, and are sent over to the Canonical Observability Stack when integrating with Prometheus.

Principles

Define clear objectives for your alerts

Before creating your alerts, you should understand the behavior of your application and the potential impact of issues on the users. Identify the key metrics, service level indicators (SLIs) and objectives (SLOs), and think about how you can best express your failure modes.

Map failure modes

Map failure modes to symptoms

Start with thinking about the potential failures that the system may encounter.

Potential failure Likeliness Symptoms
Overloaded querying engine goes out of memory Depends on load and resources Querying errors are logged and counter accumulates

Inspect official documentation

Known failure modes could be already documented. Read up on it.

Map metrics to failure modes

Inspect the /metrics endpoint of the application and come up with potential failure modes and the steps that could be taken to resolve the failure.

Try to address root causes rather than symptoms.

For example:

Metric name Failure mode / root cause Potential resolution
querying_errors_total The querying engine service is OOM-killed Increase resource limits or update rate limit in config file
querying_errors_total Client sends malformed queries Confirm the client is using an appropriate schema

Map log lines to failure modes

Inspect the logs your application emits immediately prior to a failure. Map the contents or amount of log lines to failure modes. For example:

Try to address root causes rather than symptoms.

For example:

Log line Failure mode / root cause Potential resolution
write: broken pipe (*tls.permanentError) Client terminated connection prematurely due to too low timeout Increase timeout on the client side
write: broken pipe (*tls.permanentError) TLS certificate expired or version mismatch Verify certs and connections configured correctly

Combine the tables above, grouping by failure mode

We want to alert on root causes (e.g. “cert expired”) rather than symptoms (e.g. “broken pipe”), and keep the list of failure modes up-to-date by updating it as part of incident retrospective.

Write effective and relevant alerts

Receiving too many alert notifications for irrelevant issues will cause responders to start ignoring them: it’s a phenomenon called “notification overload”.

For alerts to be relevant, you should watch out for the following:

  • false positives: alerts whose conditions are met, but that are not indicative of an issue in the application;
  • recipients: make sure you’re sending alert notifications to people that can act on them;
  • non-actionable alerts: if the alert doesn’t contain enough information (or if it has too much), it becomes hard for someone to resolve quickly pinpoint the issues and solve them.

Alerts should be relevant for all of your users

When bundling alerts in a charm, they’re going to be active for all the users of your application; if an alert only applies to a fragment of them, then it’s probably best to not include that in the charm.

Actionable Advice

Keep the alert title in PagerDuty short

The notification title should communicate what the problem is, but it doesn’t need to contain all the relevant information; the rest can go into the description, so that a responder can still dig deeper and pinpoint the issue.

Use group_by to your advantage

Grouping alerts can be extremely helpful. Imagine an application with lots of units goes down: if you got an alert per each unit, it wouldn’t be more useful than just getting one alert for the application; in fact, the excessive amount of notification could hinder the response process, as you could easily miss some important information.

Common Alert Rules

Prometheus Self-Monitoring

Target/Job Missing

Single Job

expr: absent(up{job=<Service>})

For example, <Service> might be:

  • "prometheus"
  • "alertmanager"

Single Target

expr: up == 0

All Targets

expr: sum by (job) (up) == 0

With Warmup Time

expr: sum by (instance, job) ((up == 0) * on (instance) group_right(job) (node_time_seconds - node_boot_time_seconds > 600))
Too Many Restarts
expr: changes(process_start_time_seconds{job=<Job>}[15m]) > 2

An exmaple <Job> might be:

  • Prometheus → ~"prometheus|pushgateway|alertmanager"
  • Loki → ~".*loki.*"
Configuration Failure

Configuration Reload Failure

expr: <Service>_config_last_reload_successful != 1

Configuration Not Synced

expr: count(count_values("config_hash", <Service>_config_hash)) > 1

Host and Hardware

Adding host and hardware alert rules can warn users of potential system failure and allows for the remediation before a serious system failure occurs. To enable these types of alerts, the Prometheus node exporter is required for hardware and OS metrics exposed by *NIX kernels.

Some alerts that are worth mentioning:

  • (Over|Under)utilized Memory
    • Low Swap Memory
  • (Over|Under)utilized CPU
  • Unusual Network Throughput In/Out
  • Unusual Disk Read/Write Rate

Others

The list of alert rules above is not exhaustive. Alerting coverage can be extended to topics like:

  • Databases and brokers
  • Reverse proxies and load balancers
  • Runtimes
  • Orchestrators
  • Network, security and storage

References