All Products
Search
Document Center

Alibaba Cloud Service Mesh:Troubleshoot HTTP 503 status codes in ASM

Last Updated:Mar 11, 2026

HTTP 503 errors in Service Mesh (ASM) typically result from connection lifecycle mismatches, configuration changes, or traffic interception issues. This guide covers each scenario with its root cause, diagnostic signals, and solution.

Identify the root cause

Check the Envoy access logs of the affected pod to find the response flag. The flag indicates why the request failed:

kubectl logs <pod-name> -c istio-proxy -n <namespace>

Match the response flag to a scenario:

Response flagSymptomScenario
UCIntermittent 503 under normal traffic, no config changesIdle connection timeout mismatch
UCBrief 503 spike immediately after changing custom metricsMetric customization config change
N/APersistent 503 after enabling mTLS, health checks failHealth check failure with mTLS
N/AAll requests to a specific service return 503Application listening on localhost

Intermittent 503 scenarios

Brief 503 spike after a metric customization config change

A small number of requests return HTTP 503 immediately after you update custom metric configurations.

Root cause

The metric customization feature generates an Envoy filter that updates the istio.stats configuration. This update is delivered through the Listener Discovery Service (LDS), which modifies the Envoy Listener. When the listener configuration changes, existing connections are terminated and any in-transit requests on those connections receive a 503 response.

The 503 is not sent by the upstream server. The client-side sidecar proxy generates it in response to the upstream connection reset.

Why the default retry policy does not help

The default sidecar proxy retry policy covers these conditions:

"retry_policy": {
    "retry_on": "connect-failure,refused-stream,unavailable,cancelled,retriable-status-codes",
    "num_retries": 2,
    "retry_host_predicate": [
     {
      "name": "envoy.retry_host_predicates.previous_hosts"
     }
    ],
    "host_selection_retry_max_attempts": "5",
    "retriable_status_codes": [
     503
    ]
}
ConditionTrigger
connect-failureConnection failure (connect timeout)
refused-streamHTTP/2 REFUSED_STREAM error
unavailablegRPC unavailable status
cancelledgRPC cancelled status
retriable-status-codesResponse status code matches a code in retriable_status_codes (503 by default)

The reset condition -- which covers upstream disconnects and connection resets -- is not included. That is the condition this scenario triggers.

Solution: add reset to the retry policy

Add reset (and optionally 503) to the retryOn field in a VirtualService for the affected service.

The following example configures retries for the Ratings service:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ratings-route
spec:
  hosts:
  - ratings.prod.svc.cluster.local
  http:
  - route:
    - destination:
        host: ratings.prod.svc.cluster.local
        subset: v1
    retries:
      attempts: 2
      retryOn: connect-failure,refused-stream,unavailable,cancelled,retriable-status-codes,reset,503

Replace ratings.prod.svc.cluster.local and subset v1 with the host and subset of your target service.

For the full list of Envoy retry conditions, see:

Intermittent 503 from idle connection timeout mismatch

HTTP 503 errors appear intermittently without any configuration changes, often increasing under higher traffic. The Envoy access log shows response flag UC (Upstream Connection termination). This typically affects inbound sidecar proxy traffic.

Root cause

The sidecar proxy and the application have different idle connection timeout values. The default idle connection timeout for the sidecar proxy is 1 hour.

When the proxy timeout is longer than the application timeout:

The application closes the idle connection first, but the sidecar proxy still considers the connection active. If a new request arrives on that connection, the proxy forwards it to a closed connection and returns HTTP 503 (response_flags=UC).

Idle connection timeout mismatch - proxy timeout too long

When the proxy timeout is shorter than the application timeout:

The proxy closes the connection first and creates a new one for the next request. No 503 error occurs in this case.

Idle connection timeout mismatch - proxy timeout too short

Solution 1: set idleTimeout in a DestinationRule

Align the idle timeout by setting idleTimeout in a DestinationRule. This setting applies to both inbound and outbound sidecar proxy traffic. It also works when the client does not have a sidecar proxy.

Set idleTimeout to a value slightly shorter than the application's idle timeout. A value that is too short increases the total number of connections.

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: <your-service-idle-timeout>
spec:
  host: <your-service-host>
  trafficPolicy:
    connectionPool:
      tcp:
        idleTimeout: 30m

Replace <your-service-idle-timeout> and <your-service-host> with your service name and host. Adjust 30m based on your application's idle timeout -- set it slightly shorter than the application's value.

Solution 2: configure retries in a VirtualService

A retry triggers a new connection, which resolves the stale-connection problem. Follow the same retry configuration as in Brief 503 spike after a metric customization config change.

Important

Retries on non-idempotent requests (such as POST) are high-risk operations and can cause duplicate operations. Evaluate carefully before enabling retries for these request types.

Intermittent 503 during pod restarts due to sidecar lifecycle misconfiguration

HTTP 503 errors appear briefly each time pods restart.

Root cause

The sidecar proxy container lifecycle is misconfigured. The proxy may shut down before the application finishes draining connections, or begin receiving traffic before the application is ready.

Solution

Configure the sidecar proxy container lifecycle to align with your application's startup and shutdown sequence. For details, see Sidecar proxy lifecycle.

Persistent 503 scenarios

Application listening on localhost

All requests from other pods to a specific application return HTTP 503.

Root cause

The application binds to localhost (127.0.0.1) instead of 0.0.0.0. The sidecar proxy forwards traffic to the application's port, but the application rejects connections from non-loopback addresses.

Solution

Bind the application to 0.0.0.0 so that the sidecar proxy and other pods can reach it. For details, see Expose a cluster application that listens on localhost to other pods.

Health check failure after enabling mTLS

After sidecar injection, pod health checks (liveness and readiness probes) consistently fail, and an HTTP 503 status code is reported.

Root cause

When mutual TLS (mTLS) is enabled in ASM, the sidecar proxy intercepts all incoming traffic to the pod, including kubelet health check requests. Because kubelet lacks an Istio-issued TLS certificate, it cannot complete the mTLS handshake and every health check fails.

Solution

Exclude the health check port from sidecar traffic interception so that kubelet can reach the application directly. For details, see Why is no valid health check information displayed after sidecar injection?