Standard load balancing algorithms like round robin and least request distribute traffic based on static rules -- they have no visibility into how each pod is actually performing. If one pod responds slowly due to resource contention, a cold cache, or noisy neighbors on the host, these algorithms keep routing traffic to it, driving up overall latency and error rates.
Service Mesh (ASM) version 1.21 introduces peak EWMA (peak Exponentially Weighted Moving Average), a load balancing algorithm that tracks each pod's response time in real time and automatically shifts traffic away from slow pods toward faster ones. This significantly reduces tail latency (P90, P95, P99).
How peak EWMA works
Peak EWMA scores each backend pod by combining static weights, observed latencies, and error rates into a moving average. Unlike a simple average, EWMA gives more weight to recent observations, so the algorithm reacts quickly to latency spikes. The "peak" variant specifically tracks worst-case latency, making it effective for burst traffic scenarios where some pods temporarily degrade.
When to use peak EWMA
| Algorithm | Best for | Limitations |
|---|---|---|
ROUND_ROBIN | Uniform workloads with similar pod performance | Ignores pod health and latency |
LEAST_REQUEST | General-purpose load balancing (ASM default) | Does not account for response time differences |
RANDOM | Simple, stateless distribution | No intelligence about pod performance |
PEAK_EWMA | Workloads with uneven pod latency or burst traffic | Requires ASM 1.21 or later |
Choose PEAK_EWMA when:
Backend pods have inconsistent response times (for example, due to resource contention or heterogeneous hardware).
Your service handles burst traffic that can temporarily overload individual pods.
Tail latency (P95/P99) matters more than average latency for your use case.
Prerequisites
Before you begin, make sure that you have:
An ASM instance of version 1.21 or later. See Create an ASM instance
A Container Service for Kubernetes (ACK) cluster added to the ASM instance. See Add a cluster to an ASM instance and Update an ASM instance
A kubectl client connected to the ACK cluster. See Connect to a cluster by using kubectl
Configure peak EWMA
Apply a DestinationRule that sets the PEAK_EWMA load balancing algorithm for your target service.
Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.
On the Mesh Management page, click the name of your ASM instance. In the left-side navigation pane, choose Traffic Management Center > DestinationRule. Click Create from YAML.
Paste the following YAML and click Create. Replace
simple-serverandsimple-server.default.svc.cluster.localwith your actual service name and host.apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: simple-server namespace: default spec: host: simple-server.default.svc.cluster.local trafficPolicy: loadBalancer: simple: PEAK_EWMA # Latency-aware load balancing
Example: Measure the latency improvement
This example demonstrates how peak EWMA reduces tail latency when backend pods have uneven response times. The setup uses two deployments behind a single service:
simple-server-normal: responds in 50--100 ms (healthy pod).
simple-server-high-latency: responds in 500--2000 ms (simulates a degraded pod).
A sleep pod acts as the client, sending 100 requests to measure latency with and without peak EWMA.
Step 1: Enable metric monitoring
Enable metric monitoring for the ASM instance to observe latency changes before and after you enable peak EWMA. See Collect metrics to Managed Service for Prometheus.
Step 2: Deploy the test environment
Connect to the ACK cluster with kubectl, then create a
sleep.yamlfile with the following content:Deploy the sleep application:
kubectl apply -f sleep.yamlCreate a
simple.yamlfile with the following content:Deploy both server deployments:
kubectl apply -f simple.yaml
Step 3: Run a baseline test with the default algorithm
The default LEAST_REQUEST algorithm serves as the baseline. Send 100 requests from the sleep pod to the simple-server service:
kubectl exec -it deploy/sleep -c sleep -- sh -c 'for i in $(seq 1 100); do time curl simple-server:8080/hello; echo "request $i done"; done'Expected output (abbreviated):
hello
this is port: 8080real 0m 0.06s
user 0m 0.00s
sys 0m 0.00s
request 1 done
hello
this is port: 8080real 0m 0.09s
...
hello
this is port: 8080real 0m 1.72s
user 0m 0.00s
sys 0m 0.00s
request 100 doneAfter the test completes, view the latency metrics:
On the Mesh Management page, click your ASM instance name. In the left-side navigation pane, choose Observability Management Center > Monitoring metrics.
Click the Cloud ASM Istio Service tab and set the following filters:
Filter Value Namespace defaultService simple-server.default.svc.cluster.localReporter destinationClient Workload Namespace defaultClient Workload sleepService Workload Namespace defaultService Workload simple-server-normal + simple-server-high-latencyClick Client Workloads to view the Incoming Request Duration By Source panel.

The baseline results show a P50 latency of 87.5 ms and a P95 latency of 2.05 s. Because LEAST_REQUEST distributes traffic evenly, the slow pod inflates overall tail latency.
These test results are from a controlled environment. Actual results vary depending on your workload and infrastructure.
Step 4: Enable peak EWMA and rerun the test
Create a DestinationRule to enable peak EWMA for the simple-server service. Follow the steps in Configure peak EWMA using the following YAML:
apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: simple-server namespace: default spec: host: simple-server.default.svc.cluster.local trafficPolicy: loadBalancer: simple: PEAK_EWMARerun the same test:
kubectl exec -it deploy/sleep -c sleep -- sh -c 'for i in $(seq 1 100); do time curl simple-server:8080/hello; echo "request $i done"; done'View the latency metrics using the same filters as Step 3.

The P90, P95, and P99 latencies drop significantly. Peak EWMA detects that the high-latency pod responds slowly and reduces its traffic share, routing most requests to the faster pod. Overall request latency for the service decreases substantially.