Understanding SLOs: The Key to Reliable Service Delivery
When providing a service to a customer, it’s useful to agree on some levels of guaranteed performance and availability. These Service Level Agreements (SLAs) are a framework which utilizes quantifiabile metrics and measurements - generally called Service Level Indicators, or SLIs for short - to formulate a set of promises to a customer. These promises are Service Level Objectives (SLOs), and this post will explore what SLOs are, why they are important, and how SLO alerts can help teams stay on track.
What are SLOs?
SLOs are specific, measurable targets that define the expected performance or health of a service. The objectives are sometimes expressed by their allowed leeway, or “error budget”, and any discussion around the system’s reliability should keep the focus on SLOs.
For example, an SLO might specify that a web application should maintain 99.9% uptime over a given period: this means the service can only be down for a limited amount of time to meet the agreed-upon standard. These standards can be different based on the context: you might want to have internal SLOs (higher objectives, striving for quality) and external SLOs (lower objectives, striving for promises reliability). Uptime is only one example though; others are:
- application performance: metrics such as response time and latency;
- resource utilization: metrics related to underlying infrastructure, such as CPU and memory usage.
Why are SLOs Important?
SLOs are extremely important because they improve software quality, by defining acceptable levels of downtime for a service and helping providing software that meets user expectations.
There are more reasons why SLOs are important:
- they allow establishing an error budget, helping with balancing the prioritization between new features and maintenance;
- when SLOs are simple and well-written, they improve cross-team communications by providing some ground truths that everyone understands.
SLO Alerts: Keeping Teams Informed
Writing SLO alerts helps ensuring those objectives are met, and allow to take action if they’re not. Some examples of this are:
- if the uptime of a service drops below the agreed SLO, an alert could surface the need to investigate an underlying issue before the Service Level Agreement (SLA) is breached;
- if the average response time for an application is exceeding a certain threshold, that might also be an indicator of a performance bottleneck;
- if the error rate of a service exceeds the expected value (or if CPU or memory usage are too high) an alert can help to quickly respond to a potential outage or service degradation.
How do I write SLOs?
Writing SLOs is all about figuring out the correct thresholds for your Indicators (SLIs). This process is iterative in nature: it involves lots of trial-and-error, and it requires continuous refinement over time.
However, here are some general guidelines you can follow when writing SLOs:
- simplicity: try to establish simple, clear, measurable targets that are easy to understand and verify;
- minimalism: producing a limited number of SLOs helps focusing on the most critical aspects of service performance;
- weak thresholds: SLOs should be set to the lowest level of reliability that is accepatable for the users; this leaves enough error budget for potential new features, maintenance windows, etc;
- try to use the rule of 9s: thresholds are usually formulated using “a certain number of 9s”, meaning they are set to 90%, 99%, 99.9%, and so on; this is especially good if you don’t know where to start.
What about an example?
There are several ways to define SLOs and alert on them. If you don’t know where to start, there are tools that help you formulate SLOs with a declarative approach, and then take care of generating the Prometheus configuration and alerts. Sloth is one of those tools: let’s inspect the SLO spec format for a concrete example.
Sloth requires you to define SLOs in a YAML file (source):
version: "prometheus/v1"
service: "myservice"
labels:
owner: "myteam"
repo: "myorg/myservice"
tier: "2"
slos:
# We allow failing (5xx and 429) 1 request every 1000 requests (99.9%).
- name: "requests-availability"
objective: 99.9
description: "Common SLO based on availability for HTTP request responses."
sli:
events:
error_query: sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[{{.window}}]))
total_query: sum(rate(http_request_duration_seconds_count{job="myservice"}[{{.window}}]))
alerting:
name: MyServiceHighErrorRate
labels:
category: "availability"
annotations:
# Overwrite default Sloth SLO alert summmary on ticket and page alerts.
summary: "High error rate on 'myservice' requests responses"
page_alert:
labels:
severity: pageteam
routing_key: myteam
ticket_alert:
labels:
severity: "slack"
slack_channel: "#alerts-myteam"
In this example, http_request_duration_seconds_count
is a metric expressing how long it took the server to answer HTTP requests; as written in the comment above, we want to assert that the amount of failing requests are below a certain threshold.
Regardless of the tool or methodology you use, this hopefully shows which elements you should look at when writing an SLO.
Conclusion
We’ve seen how SLOs can be useful to establish clear and measurable targets to ensure customer expectations are met. If you want to learn how to write even better alert rules, go read our best practices on how to write alerts and start writing your own SLOs!