This post is a living document and could change frequently!
Prometheus and Alertmanager allow to respectively evaluate and send alerts. The purpose of this document is to outline best practices for creating alerts that are useful, descriptive, and not overwhelming.
Charmed alerts are packed together with their charm, and are sent over to the Canonical Observability Stack when integrating with Prometheus.
Principles
Define clear objectives for your alerts
Before creating your alerts, you should understand the behavior of your application and the potential impact of issues on the users. Identify the key metrics, service level indicators (SLIs) and objectives (SLOs), and think about how you can best express your failure modes.
Write effective and relevant alerts
Receiving too many alert notifications for irrelevant issues will cause responders to start ignoring them: it’s a phenomenon called “notification overload”.
For alerts to be relevant, you should watch out for the following:
- false positives: alerts whose conditions are met, but that are not indicative of an issue in the application;
- recipients: make sure you’re sending alert notifications to people that can act on them;
- non-actionable alerts: if the alert doesn’t contain enough information (or if it has too much), it becomes hard for someone to resolve quickly pinpoint the issues and solve them.
Alerts should be relevant for all of your users
When bundling alerts in a charm, they’re going to be active for all the users of your application; if an alert only applies to a fragment of them, then it’s probably best to not include that in the charm.
Actionable Advice
Keep the alert title in PagerDuty short
The notification title should communicate what the problem is, but it doesn’t need to contain all the relevant information; the rest can go into the description, so that a responder can still dig deeper and pinpoint the issue.
Use group_by
to your advantage
Grouping alerts can be extremely helpful. Imagine an application with lots of units goes down: if you got an alert per each unit, it wouldn’t be more useful than just getting one alert for the application; in fact, the excessive amount of notification could hinder the response process, as you could easily miss some important information.
Common Alert Rules
Prometheus Self-Monitoring
Target/Job Missing
Single Job
expr: absent(up{job=<Service>})
For example, <Service>
might be:
"prometheus"
"alertmanager"
Single Target
expr: up == 0
All Targets
expr: sum by (job) (up) == 0
With Warmup Time
expr: sum by (instance, job) ((up == 0) * on (instance) group_right(job) (node_time_seconds - node_boot_time_seconds > 600))
Too Many Restarts
expr: changes(process_start_time_seconds{job=<Job>}[15m]) > 2
An exmaple <Job>
might be:
- Prometheus →
~"prometheus|pushgateway|alertmanager"
- Loki →
~".*loki.*"
Configuration Failure
Configuration Reload Failure
expr: <Service>_config_last_reload_successful != 1
Configuration Not Synced
expr: count(count_values("config_hash", <Service>_config_hash)) > 1
Host and Hardware
Adding host and hardware alert rules can warn users of potential system failure and allows for the remediation before a serious system failure occurs. To enable these types of alerts, the Prometheus node exporter is required for hardware and OS metrics exposed by *NIX kernels.
Some alerts that are worth mentioning:
- (Over|Under)utilized Memory
- Low Swap Memory
- (Over|Under)utilized CPU
- Unusual Network Throughput In/Out
- Unusual Disk Read/Write Rate
Others
The list of alert rules above is not exhaustive. Alerting coverage can be extended to topics like:
- Databases and brokers
- Reverse proxies and load balancers
- Runtimes
- Orchestrators
- Network, security and storage