All Products
Search
Document Center

Alibaba Cloud Service Mesh:Load balancing based on workload latency using EWMA

Last Updated:Mar 11, 2026

Standard load balancing algorithms like round robin and least request distribute traffic based on static rules -- they have no visibility into how each pod is actually performing. If one pod responds slowly due to resource contention, a cold cache, or noisy neighbors on the host, these algorithms keep routing traffic to it, driving up overall latency and error rates.

Service Mesh (ASM) version 1.21 introduces peak EWMA (peak Exponentially Weighted Moving Average), a load balancing algorithm that tracks each pod's response time in real time and automatically shifts traffic away from slow pods toward faster ones. This significantly reduces tail latency (P90, P95, P99).

How peak EWMA works

Peak EWMA scores each backend pod by combining static weights, observed latencies, and error rates into a moving average. Unlike a simple average, EWMA gives more weight to recent observations, so the algorithm reacts quickly to latency spikes. The "peak" variant specifically tracks worst-case latency, making it effective for burst traffic scenarios where some pods temporarily degrade.

When to use peak EWMA

AlgorithmBest forLimitations
ROUND_ROBINUniform workloads with similar pod performanceIgnores pod health and latency
LEAST_REQUESTGeneral-purpose load balancing (ASM default)Does not account for response time differences
RANDOMSimple, stateless distributionNo intelligence about pod performance
PEAK_EWMAWorkloads with uneven pod latency or burst trafficRequires ASM 1.21 or later

Choose PEAK_EWMA when:

  • Backend pods have inconsistent response times (for example, due to resource contention or heterogeneous hardware).

  • Your service handles burst traffic that can temporarily overload individual pods.

  • Tail latency (P95/P99) matters more than average latency for your use case.

Prerequisites

Before you begin, make sure that you have:

Configure peak EWMA

Apply a DestinationRule that sets the PEAK_EWMA load balancing algorithm for your target service.

  1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.

  2. On the Mesh Management page, click the name of your ASM instance. In the left-side navigation pane, choose Traffic Management Center > DestinationRule. Click Create from YAML.

  3. Paste the following YAML and click Create. Replace simple-server and simple-server.default.svc.cluster.local with your actual service name and host.

    apiVersion: networking.istio.io/v1beta1
    kind: DestinationRule
    metadata:
      name: simple-server
      namespace: default
    spec:
      host: simple-server.default.svc.cluster.local
      trafficPolicy:
        loadBalancer:
          simple: PEAK_EWMA  # Latency-aware load balancing

Example: Measure the latency improvement

This example demonstrates how peak EWMA reduces tail latency when backend pods have uneven response times. The setup uses two deployments behind a single service:

  • simple-server-normal: responds in 50--100 ms (healthy pod).

  • simple-server-high-latency: responds in 500--2000 ms (simulates a degraded pod).

A sleep pod acts as the client, sending 100 requests to measure latency with and without peak EWMA.

Architecture diagram

Step 1: Enable metric monitoring

Enable metric monitoring for the ASM instance to observe latency changes before and after you enable peak EWMA. See Collect metrics to Managed Service for Prometheus.

Step 2: Deploy the test environment

  1. Connect to the ACK cluster with kubectl, then create a sleep.yaml file with the following content:

    Show sleep.yaml

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: sleep
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: sleep
      labels:
        app: sleep
        service: sleep
    spec:
      ports:
      - port: 80
        name: http
      selector:
        app: sleep
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: sleep
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: sleep
      template:
        metadata:
          labels:
            app: sleep
        spec:
          terminationGracePeriodSeconds: 0
          serviceAccountName: sleep
          containers:
          - name: sleep
            image: curlimages/curl
            command: ["/bin/sleep", "infinity"]
            imagePullPolicy: IfNotPresent
            volumeMounts:
            - mountPath: /etc/sleep/tls
              name: secret-volume
          volumes:
          - name: secret-volume
            secret:
              secretName: sleep-secret
              optional: true
    ---

    Deploy the sleep application:

    kubectl apply -f sleep.yaml
  2. Create a simple.yaml file with the following content:

    Show simple.yaml

    apiVersion: v1
    kind: Service
    metadata:
      name: simple-server
      labels:
        app: simple-server
        service: simple-server
    spec:
      ports:
      - port: 8080
        name: http
      selector:
        app: simple-server
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: simple-server
      name: simple-server-normal
      namespace: default
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: simple-server
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          creationTimestamp: null
          labels:
            app: simple-server
        spec:
          containers:
          - args:
            - --delayMin
            - "50"
            - --delayMax
            - "100"
            image: registry-cn-hangzhou.ack.aliyuncs.com/test-public/simple-server:v1.0.0.0-g88293ca-aliyun
            imagePullPolicy: IfNotPresent
            name: simple-server
            ports:
            - containerPort: 80
              protocol: TCP
            resources:
              limits:
                cpu: 500m
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: simple-server
      name: simple-server-high-latency
      namespace: default
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: simple-server
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          creationTimestamp: null
          labels:
            app: simple-server
        spec:
          containers:
          - args:
            - --delayMin
            - "500"
            - --delayMax
            - "2000"
            image: registry-cn-hangzhou.ack.aliyuncs.com/test-public/simple-server:v1.0.0.0-g88293ca-aliyun
            imagePullPolicy: IfNotPresent
            name: simple-server
            ports:
            - containerPort: 80
              protocol: TCP
            resources:
              limits:
                cpu: 500m
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
    ---

    Deploy both server deployments:

    kubectl apply -f simple.yaml

Step 3: Run a baseline test with the default algorithm

The default LEAST_REQUEST algorithm serves as the baseline. Send 100 requests from the sleep pod to the simple-server service:

kubectl exec -it deploy/sleep -c sleep -- sh -c 'for i in $(seq 1 100); do time curl simple-server:8080/hello; echo "request $i done"; done'

Expected output (abbreviated):

hello
 this is port: 8080real 0m 0.06s
user    0m 0.00s
sys     0m 0.00s
request 1 done
hello
 this is port: 8080real 0m 0.09s
...
hello
 this is port: 8080real 0m 1.72s
user    0m 0.00s
sys     0m 0.00s
request 100 done

After the test completes, view the latency metrics:

  1. On the Mesh Management page, click your ASM instance name. In the left-side navigation pane, choose Observability Management Center > Monitoring metrics.

  2. Click the Cloud ASM Istio Service tab and set the following filters:

    FilterValue
    Namespacedefault
    Servicesimple-server.default.svc.cluster.local
    Reporterdestination
    Client Workload Namespacedefault
    Client Workloadsleep
    Service Workload Namespacedefault
    Service Workloadsimple-server-normal + simple-server-high-latency
  3. Click Client Workloads to view the Incoming Request Duration By Source panel.

Baseline test results with LEAST_REQUEST

The baseline results show a P50 latency of 87.5 ms and a P95 latency of 2.05 s. Because LEAST_REQUEST distributes traffic evenly, the slow pod inflates overall tail latency.

Important

These test results are from a controlled environment. Actual results vary depending on your workload and infrastructure.

Step 4: Enable peak EWMA and rerun the test

  1. Create a DestinationRule to enable peak EWMA for the simple-server service. Follow the steps in Configure peak EWMA using the following YAML:

    apiVersion: networking.istio.io/v1beta1
    kind: DestinationRule
    metadata:
      name: simple-server
      namespace: default
    spec:
      host: simple-server.default.svc.cluster.local
      trafficPolicy:
        loadBalancer:
          simple: PEAK_EWMA
  2. Rerun the same test:

    kubectl exec -it deploy/sleep -c sleep -- sh -c 'for i in $(seq 1 100); do time curl simple-server:8080/hello; echo "request $i done"; done'
  3. View the latency metrics using the same filters as Step 3.

Test results with PEAK_EWMA

The P90, P95, and P99 latencies drop significantly. Peak EWMA detects that the high-latency pod responds slowly and reduces its traffic share, routing most requests to the faster pod. Overall request latency for the service decreases substantially.

See also