Use EWMA for load balancing based on workload latency - Alibaba Cloud Service Mesh

Standard load balancing algorithms like round robin and least request distribute traffic based on static rules -- they have no visibility into how each pod is actually performing. If one pod responds slowly due to resource contention, a cold cache, or noisy neighbors on the host, these algorithms keep routing traffic to it, driving up overall latency and error rates.

Service Mesh (ASM) version 1.21 introduces peak EWMA (peak Exponentially Weighted Moving Average), a load balancing algorithm that tracks each pod's response time in real time and automatically shifts traffic away from slow pods toward faster ones. This significantly reduces tail latency (P90, P95, P99).

How peak EWMA works

Peak EWMA scores each backend pod by combining static weights, observed latencies, and error rates into a moving average. Unlike a simple average, EWMA gives more weight to recent observations, so the algorithm reacts quickly to latency spikes. The "peak" variant specifically tracks worst-case latency, making it effective for burst traffic scenarios where some pods temporarily degrade.

When to use peak EWMA

Algorithm	Best for	Limitations
`ROUND_ROBIN`	Uniform workloads with similar pod performance	Ignores pod health and latency
`LEAST_REQUEST`	General-purpose load balancing (ASM default)	Does not account for response time differences
`RANDOM`	Simple, stateless distribution	No intelligence about pod performance
`PEAK_EWMA`	Workloads with uneven pod latency or burst traffic	Requires ASM 1.21 or later

Choose PEAK_EWMA when:

Backend pods have inconsistent response times (for example, due to resource contention or heterogeneous hardware).
Your service handles burst traffic that can temporarily overload individual pods.
Tail latency (P95/P99) matters more than average latency for your use case.

Prerequisites

Before you begin, make sure that you have:

An ASM instance of version 1.21 or later. See Create an ASM instance
A Container Service for Kubernetes (ACK) cluster added to the ASM instance. See Add a cluster to an ASM instance and Update an ASM instance
A kubectl client connected to the ACK cluster. See Connect to a cluster by using kubectl

Configure peak EWMA

Apply a DestinationRule that sets the PEAK_EWMA load balancing algorithm for your target service.

Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.
On the Mesh Management page, click the name of your ASM instance. In the left-side navigation pane, choose Traffic Management Center > DestinationRule. Click Create from YAML.

Paste the following YAML and click Create. Replace simple-server and simple-server.default.svc.cluster.local with your actual service name and host.

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: simple-server
  namespace: default
spec:
  host: simple-server.default.svc.cluster.local
  trafficPolicy:
    loadBalancer:
      simple: PEAK_EWMA  # Latency-aware load balancing

Example: Measure the latency improvement

This example demonstrates how peak EWMA reduces tail latency when backend pods have uneven response times. The setup uses two deployments behind a single service:

simple-server-normal: responds in 50--100 ms (healthy pod).
simple-server-high-latency: responds in 500--2000 ms (simulates a degraded pod).

A sleep pod acts as the client, sending 100 requests to measure latency with and without peak EWMA.

Step 1: Enable metric monitoring

Enable metric monitoring for the ASM instance to observe latency changes before and after you enable peak EWMA. See Collect metrics to Managed Service for Prometheus.

Step 2: Deploy the test environment

Connect to the ACK cluster with kubectl, then create a sleep.yaml file with the following content:

Show sleep.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: sleep
---
apiVersion: v1
kind: Service
metadata:
  name: sleep
  labels:
    app: sleep
    service: sleep
spec:
  ports:
  - port: 80
    name: http
  selector:
    app: sleep
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sleep
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sleep
  template:
    metadata:
      labels:
        app: sleep
    spec:
      terminationGracePeriodSeconds: 0
      serviceAccountName: sleep
      containers:
      - name: sleep
        image: curlimages/curl
        command: ["/bin/sleep", "infinity"]
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: /etc/sleep/tls
          name: secret-volume
      volumes:
      - name: secret-volume
        secret:
          secretName: sleep-secret
          optional: true
---

Deploy the sleep application:

kubectl apply -f sleep.yaml

Create a simple.yaml file with the following content:

Show simple.yaml

apiVersion: v1
kind: Service
metadata:
  name: simple-server
  labels:
    app: simple-server
    service: simple-server
spec:
  ports:
  - port: 8080
    name: http
  selector:
    app: simple-server
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: simple-server
  name: simple-server-normal
  namespace: default
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: simple-server
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: simple-server
    spec:
      containers:
      - args:
        - --delayMin
        - "50"
        - --delayMax
        - "100"
        image: registry-cn-hangzhou.ack.aliyuncs.com/test-public/simple-server:v1.0.0.0-g88293ca-aliyun
        imagePullPolicy: IfNotPresent
        name: simple-server
        ports:
        - containerPort: 80
          protocol: TCP
        resources:
          limits:
            cpu: 500m
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: simple-server
  name: simple-server-high-latency
  namespace: default
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: simple-server
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: simple-server
    spec:
      containers:
      - args:
        - --delayMin
        - "500"
        - --delayMax
        - "2000"
        image: registry-cn-hangzhou.ack.aliyuncs.com/test-public/simple-server:v1.0.0.0-g88293ca-aliyun
        imagePullPolicy: IfNotPresent
        name: simple-server
        ports:
        - containerPort: 80
          protocol: TCP
        resources:
          limits:
            cpu: 500m
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
---

Deploy both server deployments:

kubectl apply -f simple.yaml

Step 3: Run a baseline test with the default algorithm

The default LEAST_REQUEST algorithm serves as the baseline. Send 100 requests from the sleep pod to the simple-server service:

kubectl exec -it deploy/sleep -c sleep -- sh -c 'for i in $(seq 1 100); do time curl simple-server:8080/hello; echo "request $i done"; done'

Expected output (abbreviated):

hello
 this is port: 8080real 0m 0.06s
user    0m 0.00s
sys     0m 0.00s
request 1 done
hello
 this is port: 8080real 0m 0.09s
...
hello
 this is port: 8080real 0m 1.72s
user    0m 0.00s
sys     0m 0.00s
request 100 done

After the test completes, view the latency metrics:

On the Mesh Management page, click your ASM instance name. In the left-side navigation pane, choose Observability Management Center > Monitoring metrics.

Click the Cloud ASM Istio Service tab and set the following filters:

Filter	Value
Namespace	`default`
Service	`simple-server.default.svc.cluster.local`
Reporter	`destination`
Client Workload Namespace	`default`
Client Workload	`sleep`
Service Workload Namespace	`default`
Service Workload	`simple-server-normal + simple-server-high-latency`

Click Client Workloads to view the Incoming Request Duration By Source panel.

Baseline test results with LEAST_REQUEST

The baseline results show a P50 latency of 87.5 ms and a P95 latency of 2.05 s. Because LEAST_REQUEST distributes traffic evenly, the slow pod inflates overall tail latency.

Important

These test results are from a controlled environment. Actual results vary depending on your workload and infrastructure.

Step 4: Enable peak EWMA and rerun the test

Create a DestinationRule to enable peak EWMA for the simple-server service. Follow the steps in Configure peak EWMA using the following YAML:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: simple-server
  namespace: default
spec:
  host: simple-server.default.svc.cluster.local
  trafficPolicy:
    loadBalancer:
      simple: PEAK_EWMA

Rerun the same test:

kubectl exec -it deploy/sleep -c sleep -- sh -c 'for i in $(seq 1 100); do time curl simple-server:8080/hello; echo "request $i done"; done'

View the latency metrics using the same filters as Step 3.

The P90, P95, and P99 latencies drop significantly. Peak EWMA detects that the high-latency pod responds slowly and reduces its traffic share, routing most requests to the faster pod. Overall request latency for the service decreases substantially.

Alibaba Cloud Service Mesh:Load balancing based on workload latency using EWMA