Understand SLO Error Budgets and Burn Rate Alerts - Alibaba Cloud Service Mesh

Service Mesh (ASM) provides built-in monitoring and alerting based on service level objectives (SLOs). SLO monitoring tracks performance metrics -- such as latency and error rate -- for calls between application services, giving application developers, platform operators, and O&M personnel a shared benchmark to measure and improve service quality.

Key terms

Service level indicator (SLI): A quantitative metric that measures service health, such as availability or response latency.
Service level objective (SLO): A target percentage for an SLI over a defined period. An SLO consists of one or more SLIs. Combining multiple SLIs into one SLO provides a more accurate picture of overall service health.
Error budget: The amount of allowable unreliability derived from an SLO target (100% - SLO target). Use the error budget to balance reliability with development velocity -- it represents the risk budget you can invest in shipping features.

SLO examples:

Average queries per second (QPS) > 100,000/s
Latency of 99% of requests < 500 ms
Bandwidth per minute for 99% of requests > 200 MB/s

SLI types

ASM supports two SLI types:

SLI type	What it measures	Plug-in type	Failure condition
Service availability	The proportion of requests that receive a successful response	`availability`	HTTP status code is 429 or 5XX (any code starting with 5)
Latency	The time for a service to return a response	`latency`	Response time exceeds the specified maximum

Set meaningful objectives

Objectives should reflect real user impact. For example, if users cannot perceive the difference between 200 ms and 600 ms latency, set the latency objective to 600 ms.

Different services warrant different targets based on their criticality:

Availability target	Approximate downtime per year	Typical use
99%	~3 days	Non-critical services
99.999%	~5 minutes	Mission-critical services

Compliance period

The compliance period defines how long SLIs are measured against the SLO target. The same target percentage has very different implications at different time scales:

Compliance period	99% availability allows continuous downtime of up to
1 day	~14 minutes (24 hours x 1%)
30 days	~7 hours (30 days x 1%)

ASM supports compliance periods of 7, 14, 28, and 30 days.

Error budget

The error budget quantifies how much failure an SLO permits before a violation occurs.

Formula: Error budget = 100% - SLO target

Example: Consider a service with the following parameters:

Parameter	Value
SLI failure condition	HTTP status code is 429 or 5XX
Compliance period	30 days
SLO target	99.9%
Error budget	0.1% (100% - 99.9%)
Total requests in 30 days	10,000
Error requests allowed	10 (10,000 x 0.1%)

To meet this SLO, no more than 10 failed requests are allowed within the 30-day window.

Use the error budget to guide decisions

The error budget provides a data-driven framework for balancing reliability with development velocity:

Budget nearly exhausted: Avoid deploying new versions. Focus on reliability.
Budget remaining at the end of the compliance period: Deploy updates, since the risk of an SLO violation is low.

The error budget is calculated over a rolling window equal to the compliance period:

Error budget >= 0: The SLO is met within the compliance period.
Error budget < 0: An SLO violation has occurred.

Burn rate

The burn rate measures how fast the error budget is being consumed relative to the compliance period. A burn rate of 1 means the budget will be fully consumed over exactly the compliance period. A burn rate of 2 means the budget will be exhausted in half the time.

Formula: Burn rate = Error rate / (1 - SLO)

To calculate how long until the budget runs out, divide the compliance period by the burn rate.

Example for a 30-day compliance period:

Burn rate	Time to exhaust the error budget	Meaning
1	30 days (30 / 1)	Budget consumed at exactly the expected pace
2	15 days (30 / 2)	Budget consumed at 2x the expected pace
60	12 hours (30 / 60)	Budget consumed at 60x the expected pace -- immediate attention required

Alert rules

Alert rules notify you when the error budget is being consumed too quickly, giving you time to respond before an SLO violation occurs.

ASM uses a multi-window burn rate approach. This method detects both fast burns (high error rates over short periods) and slow burns (moderate error rates over longer periods), while filtering out unnecessary alerts from brief, minor spikes.

How multi-window alerting works

Each alert rule uses two time windows:

Window	Role	Unit
Long window	Detection -- measures the burn rate over a longer period to identify sustained issues	Hours to days
Short window	Recovery -- checks the recent burn rate to clear alerts promptly after an issue is resolved	Minutes (1/12 of the long window)

An alert fires only when the burn rate exceeds the threshold in both windows. This two-window design prevents stale alerts from persisting after a fault is fixed.

Fast burn vs. slow burn alerts

For a 30-day SLO, ASM supports two alert severity levels that correspond to different burn patterns:

Alert type	Burn pattern	Trigger condition	Burn rate threshold
Page-level (fast burn)	High error rate over a short period -- requires immediate attention	2% of error budget consumed in 1 hour, or 5% consumed in 6 hours	14.4x or 6x
Ticket-level (slow burn)	Moderate error rate over a longer period -- investigate during working hours	10% of error budget consumed in 1 day, or over 3 days	3x or 1x

Why the short window matters

Without the short window, an alert persists for the duration of the long window even after the underlying fault is fixed. With the short window set to 1/12 of the long window, alerts clear shortly after the error rate drops.

Example: The error rate remains at 2x the threshold for 3 days. The fault is fixed on day 3. With the short window configured, the alert clears approximately 6 hours later. Without the short window, the alert would persist for the full 3 days after the fault is resolved.