Conceptual framework for service level objectives - Alibaba Cloud Service Mesh

Service Mesh (ASM) provides built-in monitoring and alerting based on service level objectives (SLOs). SLO monitoring tracks call performance between application services -- including latency and error rate -- and triggers alerts when reliability degrades.

Key concepts

Service level indicator (SLI) A quantitative metric that measures service health. For example: the percentage of requests that return a successful response, or the percentage of requests served within a latency threshold.

Service level objective (SLO) A target value or range for an SLI over a defined period. An SLO consists of one or more SLIs and serves as a shared reliability benchmark for application developers, platform operators, and O&M personnel to measure and continuously improve service quality.

Error budget The allowable amount of failure or technical debt derived from the SLO target. Calculated as 100% - SLO target. An SLO of 99.9% gives an error budget of 0.1%.

Burn rate The speed at which the error budget is consumed. A burn rate of 1 means the error budget will be fully consumed by the end of the compliance period. Higher burn rates indicate faster depletion and more severe issues.

SLI types

ASM supports two SLI types:

SLI type	What it measures	Plug-in type	Failure condition
Service availability	Did the service respond successfully?	availability	HTTP status code is 429 or 5XX (any code starting with 5)
Latency	How long did the service take to respond?	latency	Response time exceeds the specified maximum latency

Set realistic objectives

An SLO that consists of multiple SLIs describes service health more accurately than a single SLI alone.

Examples of SLOs:

Average queries per second (QPS) > 100,000/s
Latency of 99% of requests < 500 ms
Bandwidth per minute for 99% of requests > 200 MB/s

When setting objectives, focus on what users actually perceive. If your users cannot distinguish between 200 ms and 600 ms latency, set the latency objective to 600 ms -- a tighter target consumes engineering effort without improving the user experience.

Different services warrant different targets. For example:

Availability target	Approximate downtime per year	Typical use case
99%	~3 days	Non-critical services
99.999%	~5 minutes	Mission-critical systems

Compliance period

The compliance period defines the time window over which SLIs are measured against the SLO target. The same target percentage means very different things depending on the compliance period:

99% availability over 1 day: no more than ~14 minutes of continuous downtime (24 hours x 1%)
99% availability over 30 days: up to ~7 hours of continuous downtime (30 days x 1%)

ASM supports compliance periods of 7, 14, 28, and 30 days.

Error budget

The error budget quantifies how much unreliability a service can tolerate while still meeting its SLO:

Error budget = 100% - SLO target

Worked example

Parameter	Value
SLI failure definition	HTTP status code is 429 or 5XX
Compliance period	30 days
SLO target	99.9%
Error budget	0.1% (100% - 99.9%)
Total requests in 30 days	10,000
Allowed error requests	10 (10,000 x 0.1%)

To meet this SLO, the service must have no more than 10 failed requests in a 30-day window.

Use error budgets to guide decisions

The error budget is updated over a rolling window with the same duration as the compliance period:

Error budget >= 0: The SLO is met during the compliance period.
Error budget < 0: The SLO is violated.

Use the remaining error budget to decide when to deploy changes:

Budget nearly exhausted: Postpone deployments. The risk of an SLO violation is high.
Budget sufficient at the end of the compliance period: Deploy with confidence. Even if the new version introduces some errors, the SLO is unlikely to be violated.

Burn rate

Burn rate measures how fast the error budget is consumed relative to the compliance period. It is the ratio of the current error rate to the error budget:

Burn rate = Error rate / (1 - SLO target)

A burn rate of 1 means the error budget will be fully consumed exactly at the end of the compliance period. A burn rate of 2 means the budget will run out in half the time.

Burn rate examples (30-day compliance period)

Burn rate	Budget consumed per compliance period	Time to exhaust budget
1	100%	30 days
2	200%	15 days
60	6,000%	12 hours

Higher burn rates indicate more severe faults and trigger higher-severity alerts.

Alert rules

Alert rules notify you when the error budget is consumed too quickly, so that you can respond before an SLO violation occurs.

ASM uses multi-window burn rate alerting, which combines a long window for detection with a short window for auto-resolution. This approach catches both sharp spikes and slow-building issues, while clearing alerts promptly after recovery.

How multi-window alerting works

Each alert rule defines two windows:

Long window: Detects when the burn rate exceeds the threshold over a longer period. This triggers the alert.
Short window (1/12 of the long window): Checks whether the elevated error rate persists. When the error rate drops below the threshold in the short window, the alert clears automatically.

This design prevents two common problems:

Missing a gradual increase that depletes the error budget over days
Keeping alerts active long after a fault is resolved

Alert severity levels (30-day SLO example)

Severity	Condition	Burn rate	Response
Page-level	2% of error budget consumed in 1 hour	14.4x	Immediate action required
Page-level	5% of error budget consumed in 6 hours	6x	Immediate action required
Ticket-level	10% of error budget consumed in 1 day	3x	Track via ticket
Ticket-level	10% of error budget consumed in 3 days	1x (threshold)	Track via ticket

Short window in practice

Suppose the error rate stays at twice the threshold for 3 days, and the fault is fixed on day 3:

With a short window: The alert clears ~6 hours after the fix, because the short window detects the recovery.
Without a short window: The alert persists for 3 more days, even though the fault no longer exists.