All Products
Search
Document Center

Alibaba Cloud Service Mesh:SLO overview

Last Updated:Mar 11, 2026

Service Mesh (ASM) provides built-in monitoring and alerting based on service level objectives (SLOs). SLO monitoring tracks performance metrics -- such as latency and error rate -- for calls between application services, giving application developers, platform operators, and O&M personnel a shared benchmark to measure and improve service quality.

Key terms

  • Service level indicator (SLI): A quantitative metric that measures service health, such as availability or response latency.

  • Service level objective (SLO): A target percentage for an SLI over a defined period. An SLO consists of one or more SLIs. Combining multiple SLIs into one SLO provides a more accurate picture of overall service health.

  • Error budget: The amount of allowable unreliability derived from an SLO target (100% - SLO target). Use the error budget to balance reliability with development velocity -- it represents the risk budget you can invest in shipping features.

SLO examples:

  • Average queries per second (QPS) > 100,000/s

  • Latency of 99% of requests < 500 ms

  • Bandwidth per minute for 99% of requests > 200 MB/s

SLI types

ASM supports two SLI types:

SLI typeWhat it measuresPlug-in typeFailure condition
Service availabilityThe proportion of requests that receive a successful responseavailabilityHTTP status code is 429 or 5XX (any code starting with 5)
LatencyThe time for a service to return a responselatencyResponse time exceeds the specified maximum

Set meaningful objectives

Objectives should reflect real user impact. For example, if users cannot perceive the difference between 200 ms and 600 ms latency, set the latency objective to 600 ms.

Different services warrant different targets based on their criticality:

Availability targetApproximate downtime per yearTypical use
99%~3 daysNon-critical services
99.999%~5 minutesMission-critical services

Compliance period

The compliance period defines how long SLIs are measured against the SLO target. The same target percentage has very different implications at different time scales:

Compliance period99% availability allows continuous downtime of up to
1 day~14 minutes (24 hours x 1%)
30 days~7 hours (30 days x 1%)

ASM supports compliance periods of 7, 14, 28, and 30 days.

Error budget

The error budget quantifies how much failure an SLO permits before a violation occurs.

Formula: Error budget = 100% - SLO target

Example: Consider a service with the following parameters:

ParameterValue
SLI failure conditionHTTP status code is 429 or 5XX
Compliance period30 days
SLO target99.9%
Error budget0.1% (100% - 99.9%)
Total requests in 30 days10,000
Error requests allowed10 (10,000 x 0.1%)

To meet this SLO, no more than 10 failed requests are allowed within the 30-day window.

Use the error budget to guide decisions

The error budget provides a data-driven framework for balancing reliability with development velocity:

  • Budget nearly exhausted: Avoid deploying new versions. Focus on reliability.

  • Budget remaining at the end of the compliance period: Deploy updates, since the risk of an SLO violation is low.

The error budget is calculated over a rolling window equal to the compliance period:

  • Error budget >= 0: The SLO is met within the compliance period.

  • Error budget < 0: An SLO violation has occurred.

Burn rate

The burn rate measures how fast the error budget is being consumed relative to the compliance period. A burn rate of 1 means the budget will be fully consumed over exactly the compliance period. A burn rate of 2 means the budget will be exhausted in half the time.

Formula: Burn rate = Error rate / (1 - SLO)

To calculate how long until the budget runs out, divide the compliance period by the burn rate.

Example for a 30-day compliance period:

Burn rateTime to exhaust the error budgetMeaning
130 days (30 / 1)Budget consumed at exactly the expected pace
215 days (30 / 2)Budget consumed at 2x the expected pace
6012 hours (30 / 60)Budget consumed at 60x the expected pace -- immediate attention required

Alert rules

Alert rules notify you when the error budget is being consumed too quickly, giving you time to respond before an SLO violation occurs.

ASM uses a multi-window burn rate approach. This method detects both fast burns (high error rates over short periods) and slow burns (moderate error rates over longer periods), while filtering out unnecessary alerts from brief, minor spikes.

How multi-window alerting works

Each alert rule uses two time windows:

WindowRoleUnit
Long windowDetection -- measures the burn rate over a longer period to identify sustained issuesHours to days
Short windowRecovery -- checks the recent burn rate to clear alerts promptly after an issue is resolvedMinutes (1/12 of the long window)

An alert fires only when the burn rate exceeds the threshold in both windows. This two-window design prevents stale alerts from persisting after a fault is fixed.

Fast burn vs. slow burn alerts

For a 30-day SLO, ASM supports two alert severity levels that correspond to different burn patterns:

Alert typeBurn patternTrigger conditionBurn rate threshold
Page-level (fast burn)High error rate over a short period -- requires immediate attention2% of error budget consumed in 1 hour, or 5% consumed in 6 hours14.4x or 6x
Ticket-level (slow burn)Moderate error rate over a longer period -- investigate during working hours10% of error budget consumed in 1 day, or over 3 days3x or 1x

Why the short window matters

Without the short window, an alert persists for the duration of the long window even after the underlying fault is fixed. With the short window set to 1/12 of the long window, alerts clear shortly after the error rate drops.

Example: The error rate remains at 2x the threshold for 3 days. The fault is fixed on day 3. With the short window configured, the alert clears approximately 6 hours later. Without the short window, the alert would persist for the full 3 days after the fault is resolved.