Locate the cause of high response latency by using access logs - Alibaba Cloud Service Mesh

After sidecar proxies are injected into workloads, sidecar proxies intercept and route traffic based on the specified policy. Sidecar proxies need time to process and forward traffic. This delays responses to requests. However, this delay is insignificant for concurrent processing if the node performance is robust. This topic describes how to use access logs to locate the cause of high response latency.

Step 1: Locate the component that causes high latency in the request path

Check the duration field in access logs to obtain the period of time consumed by a component on the data plane. During this period of time, the component reads and forwards the request to the upstream service, waits for the upstream service to return a response, and then reads and forwards the response to the downstream node.

If this period of time is unexpected, check the next upstream component in the request path. If the latency caused by the next upstream component is expected, the current component on the data plane causes the high latency. If the latency caused by the next upstream component is unexpected, continue to check upstream components until you find the one that causes the high latency.

Step 2: Check why the component causes high latency

Reasons for high latency include delayed network transmission and slow upstream processing. You can confirm the reason by checking related fields in access logs.

Delayed network transmission

View the request_duration and request_tx_duration fields in access logs.

If the value of the request_duration field is larger than expected, the component on the data plane takes a long time to read the request from the downstream node. The component on the data plane can be either the sidecar proxy or the gateway.
If the value of the request_tx_duration field is larger than expected, the component takes a long time to forward the request to the upstream service.

For an HTTP request with a body, the body is simultaneously read and sent to the upstream service, instead of being sent after it is read. Therefore, a long request_duration may lead to a long request_tx_duration.

If only request_tx_duration is long, the request is fast read but slowly sent to the upstream service.
If both request_duration and response_tx_duration are long, the response is slowly read from the upstream service or slowly forwarded to the downstream node.

Slow upstream processing

The response_duration field in access logs indicates the span from the time when the request is initiated to the time when the first byte of the response is received.

The difference between the values of response_duration and request_tx_duration indicates the time consumed by the upstream service to process the request. If the difference between the two values is large, the processing speed of the upstream service is slow or the upstream network latency is high.