All Products
Search
Document Center

Alibaba Cloud Service Mesh:Configure an outlier detection policy

Last Updated:Mar 11, 2026

Outlier detection is a form of passive health checking. The sidecar proxy tracks error responses from each endpoint of a service and automatically ejects endpoints that exceed an error threshold from the load balancing pool. This prevents cascading failures by stopping traffic to unhealthy endpoints for a configurable duration.

In Service Mesh (ASM), you configure outlier detection through a DestinationRule.

How it works

The sidecar proxy monitors HTTP 5xx responses from each endpoint and ejects endpoints that exceed the error threshold:

  1. The proxy counts consecutive 5xx responses from each endpoint over a rolling time window (interval).

  2. When an endpoint's consecutive error count reaches the consecutiveErrors threshold, the proxy marks it as unhealthy and removes it from the load balancing pool.

  3. The ejected endpoint stops receiving traffic for a base duration (baseEjectionTime). After this period, the endpoint returns to the pool.

  4. If the same endpoint is ejected again, the ejection duration increases linearly: the second ejection lasts 2x the base duration, the third lasts 3x, and so on.

Before ejecting an endpoint, the proxy checks whether the remaining healthy endpoints still meet the maxEjectionPercent threshold. If ejecting the endpoint would cause the percentage of ejected endpoints to exceed this value, the endpoint stays in the pool. Set maxEjectionPercent to 100 to allow all endpoints to be ejected, but be aware that this risks making the entire service unavailable if all endpoints are unhealthy.

Parameters

ParameterDescription
consecutiveErrorsNumber of consecutive 5xx errors that triggers ejection.
intervalTime window for counting errors.
baseEjectionTimeMinimum duration an ejected endpoint remains out of the pool. The actual ejection time equals this value multiplied by the number of times the endpoint has been ejected.
maxEjectionPercentMaximum percentage of endpoints that can be ejected at the same time. Setting this to 100 means all endpoints can be ejected simultaneously, which may cause the entire service to become unavailable.

Parameter interactions

  • consecutiveErrors and maxEjectionPercent: When maxEjectionPercent is 100 and only one endpoint exists, a single error makes the entire service unavailable. Use a value below 100 to always keep some endpoints available.

  • interval and consecutiveErrors: A short interval with a low consecutiveErrors threshold increases ejection sensitivity. This is useful for testing but can cause instability in production due to transient errors.

  • baseEjectionTime and repeated ejections: Repeated ejections result in progressively longer ejection periods. The fifth ejection of the same endpoint lasts 5x the baseEjectionTime.

Prerequisites

Complete the preparations and deploy the HTTPBin and sleep services. For details, see Preparations.

Create a DestinationRule for outlier detection

  1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.

  2. On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Traffic Management Center > DestinationRule.

  3. On the DestinationRule page, click Create. Configure the following parameters, and click Preview. Confirm that the YAML content is correct, click Submit, and then click Create.

    DestinationRule creation page

    The following YAML defines a DestinationRule that ejects any endpoint of the httpbin service after a single error, checks every 1 second, and keeps the endpoint ejected for 15 seconds:

       apiVersion: networking.istio.io/v1beta1
       kind: DestinationRule
       metadata:
         name: httpbin
         namespace: default
         labels: {}
       spec:
         host: httpbin.default.svc.cluster.local
         trafficPolicy:
           outlierDetection:
             consecutiveErrors: 1
             interval: 1s
             baseEjectionTime: 15s
             maxEjectionPercent: 100
    This example uses aggressive values (consecutiveErrors: 1, interval: 1s) to make the policy easy to verify. For production deployments, use higher thresholds and longer intervals to avoid instability from transient errors.

Verify the policy

Confirm that outlier detection ejects unhealthy endpoints as expected.

  1. Connect to the Container Service for Kubernetes (ACK) cluster using kubectl with your kubeconfig file, and send a request that triggers a 502 error:

       kubectl -n legacy exec -it deploy/sleep -- curl httpbin.legacy:8000/status/502 -I

    Expected output:

       HTTP/1.1 502 Bad Gateway
       server: envoy
       date: xxx, xx xxx 202x xx:xx:xx GMT
       content-type: text/html; charset=utf-8
       access-control-allow-origin: *
       access-control-allow-credentials: true
       content-length: 0
       x-envoy-upstream-service-time: 4

    This 502 response comes from the HTTPBin service itself. Because consecutiveErrors is set to 1, this single error triggers ejection of the endpoint.

  2. Within 15 seconds, send the same request again:

       kubectl -n legacy exec -it deploy/sleep -- curl httpbin.legacy:8000/status/502 -I

    Expected output:

       HTTP/1.1 503 Service Unavailable
       content-length: 19
       content-type: text/plain
       date: xxx, xx xxx 202x xx:xx:xx GMT
       server: envoy

    The response is now 503 Service Unavailable instead of 502 Bad Gateway. This confirms the sidecar proxy has ejected all endpoints of the HTTPBin service and the outlier detection policy is working. The proxy returns 503 because no healthy endpoints are available.