Service Mesh (ASM) provides built-in monitoring and alerting based on service level objectives (SLOs). SLO monitoring tracks call performance between application services -- including latency and error rate -- and triggers alerts when reliability degrades.
Key concepts
Service level indicator (SLI) A quantitative metric that measures service health. For example: the percentage of requests that return a successful response, or the percentage of requests served within a latency threshold.
Service level objective (SLO) A target value or range for an SLI over a defined period. An SLO consists of one or more SLIs and serves as a shared reliability benchmark for application developers, platform operators, and O&M personnel to measure and continuously improve service quality.
Error budget The allowable amount of failure or technical debt derived from the SLO target. Calculated as 100% - SLO target. An SLO of 99.9% gives an error budget of 0.1%.
Burn rate The speed at which the error budget is consumed. A burn rate of 1 means the error budget will be fully consumed by the end of the compliance period. Higher burn rates indicate faster depletion and more severe issues.
SLI types
ASM supports two SLI types:
| SLI type | What it measures | Plug-in type | Failure condition |
|---|---|---|---|
| Service availability | Did the service respond successfully? | availability | HTTP status code is 429 or 5XX (any code starting with 5) |
| Latency | How long did the service take to respond? | latency | Response time exceeds the specified maximum latency |
Set realistic objectives
An SLO that consists of multiple SLIs describes service health more accurately than a single SLI alone.
Examples of SLOs:
Average queries per second (QPS) > 100,000/s
Latency of 99% of requests < 500 ms
Bandwidth per minute for 99% of requests > 200 MB/s
When setting objectives, focus on what users actually perceive. If your users cannot distinguish between 200 ms and 600 ms latency, set the latency objective to 600 ms -- a tighter target consumes engineering effort without improving the user experience.
Different services warrant different targets. For example:
| Availability target | Approximate downtime per year | Typical use case |
|---|---|---|
| 99% | ~3 days | Non-critical services |
| 99.999% | ~5 minutes | Mission-critical systems |
Compliance period
The compliance period defines the time window over which SLIs are measured against the SLO target. The same target percentage means very different things depending on the compliance period:
99% availability over 1 day: no more than ~14 minutes of continuous downtime (24 hours x 1%)
99% availability over 30 days: up to ~7 hours of continuous downtime (30 days x 1%)
ASM supports compliance periods of 7, 14, 28, and 30 days.
Error budget
The error budget quantifies how much unreliability a service can tolerate while still meeting its SLO:
Error budget = 100% - SLO targetWorked example
| Parameter | Value |
|---|---|
| SLI failure definition | HTTP status code is 429 or 5XX |
| Compliance period | 30 days |
| SLO target | 99.9% |
| Error budget | 0.1% (100% - 99.9%) |
| Total requests in 30 days | 10,000 |
| Allowed error requests | 10 (10,000 x 0.1%) |
To meet this SLO, the service must have no more than 10 failed requests in a 30-day window.
Use error budgets to guide decisions
The error budget is updated over a rolling window with the same duration as the compliance period:
Error budget >= 0: The SLO is met during the compliance period.
Error budget < 0: The SLO is violated.
Use the remaining error budget to decide when to deploy changes:
Budget nearly exhausted: Postpone deployments. The risk of an SLO violation is high.
Budget sufficient at the end of the compliance period: Deploy with confidence. Even if the new version introduces some errors, the SLO is unlikely to be violated.
Burn rate
Burn rate measures how fast the error budget is consumed relative to the compliance period. It is the ratio of the current error rate to the error budget:
Burn rate = Error rate / (1 - SLO target)A burn rate of 1 means the error budget will be fully consumed exactly at the end of the compliance period. A burn rate of 2 means the budget will run out in half the time.
Burn rate examples (30-day compliance period)
| Burn rate | Budget consumed per compliance period | Time to exhaust budget |
|---|---|---|
| 1 | 100% | 30 days |
| 2 | 200% | 15 days |
| 60 | 6,000% | 12 hours |
Higher burn rates indicate more severe faults and trigger higher-severity alerts.
Alert rules
Alert rules notify you when the error budget is consumed too quickly, so that you can respond before an SLO violation occurs.
ASM uses multi-window burn rate alerting, which combines a long window for detection with a short window for auto-resolution. This approach catches both sharp spikes and slow-building issues, while clearing alerts promptly after recovery.
How multi-window alerting works
Each alert rule defines two windows:
Long window: Detects when the burn rate exceeds the threshold over a longer period. This triggers the alert.
Short window (1/12 of the long window): Checks whether the elevated error rate persists. When the error rate drops below the threshold in the short window, the alert clears automatically.
This design prevents two common problems:
Missing a gradual increase that depletes the error budget over days
Keeping alerts active long after a fault is resolved
Alert severity levels (30-day SLO example)
| Severity | Condition | Burn rate | Response |
|---|---|---|---|
| Page-level | 2% of error budget consumed in 1 hour | 14.4x | Immediate action required |
| Page-level | 5% of error budget consumed in 6 hours | 6x | Immediate action required |
| Ticket-level | 10% of error budget consumed in 1 day | 3x | Track via ticket |
| Ticket-level | 10% of error budget consumed in 3 days | 1x (threshold) | Track via ticket |
Short window in practice
Suppose the error rate stays at twice the threshold for 3 days, and the fault is fixed on day 3:
With a short window: The alert clears ~6 hours after the fix, because the short window detects the recovery.
Without a short window: The alert persists for 3 more days, even though the fault no longer exists.