This topic describes common scenarios in which an HTTP 503 status code is returned in Service Mesh (ASM) and the corresponding solutions.
Occasional return of an HTTP 503 status code
Scenario 1: The metric customization feature is used to perform personalized configurations. Whenever you change the configurations of custom metrics, logs indicate that an HTTP 503 status code is returned for a small number of requests.
Causes
The mechanism of the metric customization feature is to generate an Envoy filter for updating the istio.stats
configuration. This configuration takes effect at the Envoy Listener by using the Listener Discovery Service (LDS). When the configuration of the Envoy Listener changes, existing connections are disconnected. An HTTP 503 status code is returned for the corresponding in-transit requests because the connections are reset or closed.
Solutions
If an upstream server closes a connection, the HTTP 503 status code that you see is not sent by the upstream server. Instead, it is returned by the client sidecar proxy as a response to the disconnection by the upstream server.
The default retry configurations of Istio do not include the scenario of "the upstream server closes a connection". Among the retry conditions of the Envoy proxy, Connection Reset meets the trigger condition for this scenario. Therefore, you need to configure a retry policy for a route of the corresponding service. Specifically, you need to add the trigger conditions of Reset to the retry policy configurations of a virtual service.
The following code provides a configuration example. This configuration takes effect only for the Ratings service.
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: ratings-route
spec:
hosts:
- ratings.prod.svc.cluster.local
http:
- route:
- destination:
host: ratings.prod.svc.cluster.local
subset: v1
retries:
attempts: 2
retryOn: connect-failure,refused-stream,unavailable,cancelled,retriable-status-codes,reset,503
FAQ
Why does the default retry mechanism of a sidecar proxy not take effect?
The following section provides the conditions that trigger the default retries of a sidecar proxy. Retries are performed twice by default. The preceding scenario is not covered by the retry conditions. Therefore, the retry mechanism does not take effect.
"retry_policy": {
"retry_on": "connect-failure,refused-stream,unavailable,cancelled,retriable-status-codes",
"num_retries": 2,
"retry_host_predicate": [
{
"name": "envoy.retry_host_predicates.previous_hosts"
}
],
"host_selection_retry_max_attempts": "5",
"retriable_status_codes": [
503
]
}
connect-failure: indicates a connection failure.
refused-stream: indicates that a
REFUSED_STREAM
error code is returned for an HTTP/2 stream.unavailable: indicates that an
unavailable
error code is returned for a gRPC request.cancelled: indicates that a
cancelled
error code is returned for a gRPC request.retriable-status-codes: indicates that the
status_code
returned for a request matches an error code defined in theretriable_status_codes
configuration.
For more information about all the retry conditions for the latest version of the Envoy proxy, see the following documents:
Router: existing HTTP request retry conditions (including HTTP/2 and HTTP/3)
x-envoy-retry-grpc-on: gRPC-specific retry conditions
Scenario 2: An HTTP 503 status code is occasionally returned even when no configuration changes. No specific rules are found for this issue.
An HTTP 503 status code is occasionally returned, but more often when the traffic is high. Generally, this issue occurs to inbound traffic of a sidecar proxy.
Causes
The timeout period of an idle connection for the sidecar proxy does not match that of an application. The default timeout period of an idle connection for the sidecar proxy is 1 hour.
The timeout period of an idle connection for the sidecar proxy is too long, whereas the timeout period for the application is relatively short.
The application has terminated the idle connection, but the sidecar proxy considers that the connection is not terminated. At this point, if a new request is sent to the sidecar proxy, an HTTP 503 status code (response_flags=UC) is reported.
The timeout period of an idle connection for the sidecar proxy is too short, whereas the timeout period for the application is relatively long.
An HTTP 503 status code is not reported under this condition. In this scenario, the sidecar proxy considers that the previous connection has been closed and directly creates a new connection.
Solutions
Solution 1: Configure idleTimeout in a destination rule
The cause of this issue is a mismatch in the idleTimeout settings at the two ends. To resolve this issue, we recommend that you configure idleTimeout in a destination rule.
The idleTimeout setting takes effect for both outbound traffic and inbound traffic of the sidecar proxy. If the client does not have a sidecar proxy, the idleTimeout setting also takes effect and can effectively reduce the reporting of HTTP 503 status codes.
Configuration recommendation: The idleTimeout setting depends on your service. If the idleTimeout parameter is set to a too short period, the number of connections will be too large. We recommend that you set this parameter to a period slightly shorter than the actual idleTimeout period of your service.
Solution 2: Configure retries in a virtual service
A retry triggers connection re-establishment, which can resolve this issue. For more information, see the solutions for scenario 1.
If a request is a non-idempotent request, retries are high-risk operations. Exercise caution when you configure retries in a virtual service.
Scenario 3: An HTTP 503 status code is occasionally returned under conditions where the lifecycle of the sidecar proxy container is improperly configured.
Causes
An HTTP 503 status code is occasionally returned when pods are restarted due to an improper lifecycle setting of the sidecar proxy container.
Solutions
For more information, see Sidecar proxy lifecycle.
Inevitable return of an HTTP 503 status code
Scenario 1: An application listens to the localhost.
Causes
When an application in a cluster listens to the localhost, other pods in the cluster cannot access the application.
Solutions
You can resolve this issue by exposing the application to other pods in the cluster. For more information, see How can I expose a cluster application that listens to the localhost to other pods in the cluster?
Scenario 2: After a sidecar proxy is injected into a pod, health checks of the pod always fail and an HTTP 503 status code is reported.
Causes
After you enable mutual Transport Layer Security (mTLS) in ASM, the requests for health checks sent by kubelet to the pod are intercepted by the sidecar proxy. If kubelet cannot provide the required TLS certificate, the health checks fail.
Solutions
You can configure ports for which traffic is not redirected to the sidecar proxy. For more information, see Why is no valid health check information displayed after sidecar injection?