All Products
Search
Document Center

Alibaba Cloud Service Mesh:Why long-running requests fail after sidecar proxy injection

Last Updated:Mar 11, 2026

When a pod with an injected sidecar proxy terminates, long-running requests can be dropped or fail. By default, Istio forcibly stops the sidecar proxy 5 seconds after receiving the termination signal (terminationDrainDuration = 5s). Requests that take longer than this window are terminated regardless of their processing state.

Symptoms

After a pod with an injected sidecar proxy stops, two types of failures occur:

  • Inbound requests are dropped. Long-running requests sent to this pod are lost mid-processing because the proxy shuts down before the requests complete.

  • Outbound requests fail. Requests that the application sends to other services through the proxy fail because the proxy has already stopped.

Root cause

With a sidecar proxy injected, all pod traffic flows through the proxy. When Kubernetes begins terminating a pod, two things happen in parallel:

  1. Kubernetes removes the pod from Service endpoints. New traffic is no longer routed to this pod.

  2. Istio starts draining the sidecar proxy. On receiving the SIGTERM signal, istio-agent tells Envoy to begin graceful draining: it rejects new connections while allowing existing ones to complete. After the terminationDrainDuration elapses, Istio forcibly kills the proxy.

The default terminationDrainDuration is 5 seconds. During this window:

  • No new inbound traffic is accepted.

  • Existing inbound connections continue processing.

  • Outbound connections remain functional.

If any request takes longer than 5 seconds, the proxy is killed and all remaining inbound and outbound connections are dropped.

Kubernetes termination lifecycle

Understanding the Kubernetes pod termination sequence is essential for configuring timeouts correctly:

  1. Kubernetes sends a SIGTERM signal to all containers in the pod.

  2. Each container handles the signal according to its own logic (for example, Istio begins draining the proxy).

  3. Kubernetes waits up to terminationGracePeriodSeconds (default: 30 seconds) for all containers to exit.

  4. If any container is still running after the grace period, Kubernetes sends a SIGKILL to force termination.

The key timing constraint is:

preStop hook duration + terminationDrainDuration < terminationGracePeriodSeconds

If the combined time exceeds terminationGracePeriodSeconds, Kubernetes sends a SIGKILL and forcibly terminates all containers, bypassing the graceful drain entirely.

Solutions

Solution 1: Extend the termination drain duration

Increase the drain duration to give long-running connections enough time to complete. This approach works best when you can estimate the maximum request duration.

  1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.

  2. On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Data Plane Component Management > Sidecar Proxy Setting.

  3. On the Sidecar Proxy Setting page, click the Namespace tab.

  4. Select a namespace from the Namespace drop-down list. Click Lifecycle Management, select Sidecar Proxy Drain Duration at Pod Termination, enter a value that exceeds your longest expected request duration, and then click Update Settings.

Note The drain duration must be shorter than the pod's terminationGracePeriodSeconds (default: 30 seconds). If the drain duration exceeds the grace period, Kubernetes sends a SIGKILL and forcibly terminates all containers, bypassing the graceful drain.

Solution 2: Use a preStop hook to wait for active requests

When request durations are unpredictable, configure a preStop lifecycle hook on the sidecar proxy. The hook polls for active connections and delays proxy shutdown until all requests complete, instead of relying on a fixed timeout.

How it works: The preStop script checks every second for non-Envoy TCP connections. Once all application connections close, the sidecar proxy exits gracefully after the default 5-second drain.

  1. Log on to the ASM console. In the left-side navigation pane, choose Service Mesh > Mesh Management.

  2. On the Mesh Management page, click the name of the ASM instance. In the left-side navigation pane, choose Data Plane Component Management > Sidecar Proxy Setting.

  3. On the Sidecar Proxy Setting page, click the Namespace tab.

  4. Select a namespace from the Namespace drop-down list. Click Lifecycle Management, select Lifecycle of Sidecar Proxy, enter the following JSON in the code editor, and then click Update Settings.

    The two hooks serve different purposes:

    HookPurpose
    postStartRuns pilot-agent wait to make sure the sidecar proxy is fully ready before the application container starts accepting traffic.
    preStopPolls for active TCP connections (excluding Envoy's own connections). The sidecar proxy shuts down only after all application connections close.
    {
      "postStart": {
        "exec": {
          "command": [
            "pilot-agent",
            "wait"
          ]
        }
      },
      "preStop": {
        "exec": {
          "command": [
            "/bin/sh",
            "-c",
            "while [ $(netstat -plunt | grep tcp | grep -v envoy | wc -l | xargs) -ne 0 ]; do sleep 1; done"
          ]
        }
      }
    }
Note The total time spent in the preStop hook plus the drain duration must not exceed the pod's terminationGracePeriodSeconds. If your workloads have very long-running requests, increase terminationGracePeriodSeconds at the pod level accordingly.

Which solution to choose

ScenarioRecommended solution
Request durations are predictable (for example, under 60 seconds)Solution 1: Set the drain duration to a value that exceeds your longest expected request.
Request durations vary or are unpredictableSolution 2: Use a preStop hook to dynamically wait for all connections to close.
Both long-running and short-lived requests coexistCombine both: set a reasonable drain duration as a safety net, and add the preStop hook for adaptive shutdown.