This post is a living document and could change frequently!
Prometheus and Alertmanager allow to respectively evaluate and send alerts. The purpose of this document is to outline best practices for creating alerts that are useful, descriptive, and not overwhelming.
Charmed alerts are packed together with their charm, and are sent over to the Canonical Observability Stack when integrating with Prometheus.
Principles
Define clear objectives for your alerts
Before creating your alerts, you should understand the behavior of your application and the potential impact of issues on the users. Identify the key metrics, service level indicators (SLIs) and objectives (SLOs), and think about how you can best express your failure modes.
Map failure modes
Map failure modes to symptoms
Start with thinking about the potential failures that the system may encounter.
Potential failure | Likeliness | Symptoms |
---|---|---|
Overloaded querying engine goes out of memory | Depends on load and resources | Querying errors are logged and counter accumulates |
Inspect official documentation
Known failure modes could be already documented. Read up on it.
Map metrics to failure modes
Inspect the /metrics
endpoint of the application and come up with potential failure modes and the steps that could be taken to resolve the failure.
Try to address root causes rather than symptoms.
For example:
Metric name | Failure mode / root cause | Potential resolution |
---|---|---|
querying_errors_total |
The querying engine service is OOM-killed | Increase resource limits or update rate limit in config file |
querying_errors_total |
Client sends malformed queries | Confirm the client is using an appropriate schema |
Map log lines to failure modes
Inspect the logs your application emits immediately prior to a failure. Map the contents or amount of log lines to failure modes. For example:
Try to address root causes rather than symptoms.
For example:
Log line | Failure mode / root cause | Potential resolution |
---|---|---|
write: broken pipe (*tls.permanentError) | Client terminated connection prematurely due to too low timeout | Increase timeout on the client side |
write: broken pipe (*tls.permanentError) | TLS certificate expired or version mismatch | Verify certs and connections configured correctly |
Combine the tables above, grouping by failure mode
We want to alert on root causes (e.g. “cert expired”) rather than symptoms (e.g. “broken pipe”), and keep the list of failure modes up-to-date by updating it as part of incident retrospective.
Write effective and relevant alerts
Receiving too many alert notifications for irrelevant issues will cause responders to start ignoring them: it’s a phenomenon called “notification overload”.
For alerts to be relevant, you should watch out for the following:
- false positives: alerts whose conditions are met, but that are not indicative of an issue in the application;
- recipients: make sure you’re sending alert notifications to people that can act on them;
- non-actionable alerts: if the alert doesn’t contain enough information (or if it has too much), it becomes hard for someone to resolve quickly pinpoint the issues and solve them.
Alerts should be relevant for all of your users
When bundling alerts in a charm, they’re going to be active for all the users of your application; if an alert only applies to a fragment of them, then it’s probably best to not include that in the charm.
Actionable Advice
Keep the alert title in PagerDuty short
The notification title should communicate what the problem is, but it doesn’t need to contain all the relevant information; the rest can go into the description, so that a responder can still dig deeper and pinpoint the issue.
Use group_by
to your advantage
Grouping alerts via the configuration file can be extremely helpful. Imagine an application with lots of units goes down: if you got an alert per each unit, it wouldn’t be more useful than just getting one alert for the application; in fact, the excessive amount of notification could hinder the response process, as you could easily miss some important information.
Common Alert Rules
Prometheus Self-Monitoring
Target/Job Missing
Single Job
expr: absent(up{job=<Service>})
For example, <Service>
might be:
"prometheus"
"alertmanager"
Single Target
expr: up == 0
All Targets
expr: sum by (job) (up) == 0
With Warmup Time
expr: sum by (instance, job) ((up == 0) * on (instance) group_right(job) (node_time_seconds - node_boot_time_seconds > 600))
Too Many Restarts
expr: changes(process_start_time_seconds{job=<Job>}[15m]) > 2
An exmaple <Job>
might be:
- Prometheus →
~"prometheus|pushgateway|alertmanager"
- Loki →
~".*loki.*"
Configuration Failure
Configuration Reload Failure
expr: <Service>_config_last_reload_successful != 1
Configuration Not Synced
expr: count(count_values("config_hash", <Service>_config_hash)) > 1
Host and Hardware
Adding host and hardware alert rules can warn users of potential system failure and allows for the remediation before a serious system failure occurs. To enable these types of alerts, the Prometheus node exporter is required for hardware and OS metrics exposed by *NIX kernels.
Some alerts that are worth mentioning:
- (Over|Under)utilized Memory
- Low Swap Memory
- (Over|Under)utilized CPU
- Unusual Network Throughput In/Out
- Unusual Disk Read/Write Rate
Others
The list of alert rules above is not exhaustive. Alerting coverage can be extended to topics like:
- Databases and brokers
- Reverse proxies and load balancers
- Runtimes
- Orchestrators
- Network, security and storage