Service Mesh (ASM) provides built-in monitoring and alerting based on service level objectives (SLOs). SLO monitoring tracks performance metrics -- such as latency and error rate -- for calls between application services, giving application developers, platform operators, and O&M personnel a shared benchmark to measure and improve service quality.
Key terms
Service level indicator (SLI): A quantitative metric that measures service health, such as availability or response latency.
Service level objective (SLO): A target percentage for an SLI over a defined period. An SLO consists of one or more SLIs. Combining multiple SLIs into one SLO provides a more accurate picture of overall service health.
Error budget: The amount of allowable unreliability derived from an SLO target (100% - SLO target). Use the error budget to balance reliability with development velocity -- it represents the risk budget you can invest in shipping features.
SLO examples:
Average queries per second (QPS) > 100,000/s
Latency of 99% of requests < 500 ms
Bandwidth per minute for 99% of requests > 200 MB/s
SLI types
ASM supports two SLI types:
| SLI type | What it measures | Plug-in type | Failure condition |
|---|---|---|---|
| Service availability | The proportion of requests that receive a successful response | availability | HTTP status code is 429 or 5XX (any code starting with 5) |
| Latency | The time for a service to return a response | latency | Response time exceeds the specified maximum |
Set meaningful objectives
Objectives should reflect real user impact. For example, if users cannot perceive the difference between 200 ms and 600 ms latency, set the latency objective to 600 ms.
Different services warrant different targets based on their criticality:
| Availability target | Approximate downtime per year | Typical use |
|---|---|---|
| 99% | ~3 days | Non-critical services |
| 99.999% | ~5 minutes | Mission-critical services |
Compliance period
The compliance period defines how long SLIs are measured against the SLO target. The same target percentage has very different implications at different time scales:
| Compliance period | 99% availability allows continuous downtime of up to |
|---|---|
| 1 day | ~14 minutes (24 hours x 1%) |
| 30 days | ~7 hours (30 days x 1%) |
ASM supports compliance periods of 7, 14, 28, and 30 days.
Error budget
The error budget quantifies how much failure an SLO permits before a violation occurs.
Formula: Error budget = 100% - SLO target
Example: Consider a service with the following parameters:
| Parameter | Value |
|---|---|
| SLI failure condition | HTTP status code is 429 or 5XX |
| Compliance period | 30 days |
| SLO target | 99.9% |
| Error budget | 0.1% (100% - 99.9%) |
| Total requests in 30 days | 10,000 |
| Error requests allowed | 10 (10,000 x 0.1%) |
To meet this SLO, no more than 10 failed requests are allowed within the 30-day window.
Use the error budget to guide decisions
The error budget provides a data-driven framework for balancing reliability with development velocity:
Budget nearly exhausted: Avoid deploying new versions. Focus on reliability.
Budget remaining at the end of the compliance period: Deploy updates, since the risk of an SLO violation is low.
The error budget is calculated over a rolling window equal to the compliance period:
Error budget >= 0: The SLO is met within the compliance period.
Error budget < 0: An SLO violation has occurred.
Burn rate
The burn rate measures how fast the error budget is being consumed relative to the compliance period. A burn rate of 1 means the budget will be fully consumed over exactly the compliance period. A burn rate of 2 means the budget will be exhausted in half the time.
Formula: Burn rate = Error rate / (1 - SLO)
To calculate how long until the budget runs out, divide the compliance period by the burn rate.
Example for a 30-day compliance period:
| Burn rate | Time to exhaust the error budget | Meaning |
|---|---|---|
| 1 | 30 days (30 / 1) | Budget consumed at exactly the expected pace |
| 2 | 15 days (30 / 2) | Budget consumed at 2x the expected pace |
| 60 | 12 hours (30 / 60) | Budget consumed at 60x the expected pace -- immediate attention required |
Alert rules
Alert rules notify you when the error budget is being consumed too quickly, giving you time to respond before an SLO violation occurs.
ASM uses a multi-window burn rate approach. This method detects both fast burns (high error rates over short periods) and slow burns (moderate error rates over longer periods), while filtering out unnecessary alerts from brief, minor spikes.
How multi-window alerting works
Each alert rule uses two time windows:
| Window | Role | Unit |
|---|---|---|
| Long window | Detection -- measures the burn rate over a longer period to identify sustained issues | Hours to days |
| Short window | Recovery -- checks the recent burn rate to clear alerts promptly after an issue is resolved | Minutes (1/12 of the long window) |
An alert fires only when the burn rate exceeds the threshold in both windows. This two-window design prevents stale alerts from persisting after a fault is fixed.
Fast burn vs. slow burn alerts
For a 30-day SLO, ASM supports two alert severity levels that correspond to different burn patterns:
| Alert type | Burn pattern | Trigger condition | Burn rate threshold |
|---|---|---|---|
| Page-level (fast burn) | High error rate over a short period -- requires immediate attention | 2% of error budget consumed in 1 hour, or 5% consumed in 6 hours | 14.4x or 6x |
| Ticket-level (slow burn) | Moderate error rate over a longer period -- investigate during working hours | 10% of error budget consumed in 1 day, or over 3 days | 3x or 1x |
Why the short window matters
Without the short window, an alert persists for the duration of the long window even after the underlying fault is fixed. With the short window set to 1/12 of the long window, alerts clear shortly after the error rate drops.
Example: The error rate remains at 2x the threshold for 3 days. The fault is fixed on day 3. With the short window configured, the alert clears approximately 6 hours later. Without the short window, the alert would persist for the full 3 days after the fault is resolved.