edit-icon download-icon

Trace details

Last Updated: Mar 12, 2018

The trace details function enables you to query by TraceId the details of a specific service invocation trace in a selected region.

The trace details page displays the trace of the RPC service calls, not including local method calls.

The trace details function is used mainly for tracking the consumed time and occurred exceptions at each point of the distributed service calls. Local methods are not the core content of the calls, so it is recommended that you use logs to track the consumed time and occurred exceptions for local methods. For example, the trace details page will not display the local trace of methodA() calling localMethodB() and localMethodC(). Therefore, it could happen that the elapsed time on a parent node is longer than the total elapsed time on all subnodes.

You can log on to the EDAS console and choose Digital Operations > Trace Details in the left-side navigation pane to view the details of a service invocation trace. However, a more typical scenario is to view the trace details of the slow or erroneous services. The following example demontrates how to view the trace details entering from Trace Query on the left-side menu bar.

  1. In the trace query result, find the HSF method, DB request, or other RPC service call that consumes the longest time.

    • For DB, Redis, MQ, or other simple calls, find out the reason why accesses to these nodes are slow and check whether they are caused by slow SQL or network congestion.

    • For HSF methods, further analyze the reason why the method consumes so much time.

  2. Confirm the time consumed by a local method.

    Hover the cursor over the time bar on the method row, and in the displayed page, view the elapsed time for the client to send the request, the elapsed time for the server to process the request, and elapsed time for the client to receive the response.

    If it takes a long time for the server to process the request, analyze the tasks. Otherwise, conduct the analysis using the method that is used for analyzing call timeout.

  3. Check whether the total time consumed on subnodes is close to that consumed on the method.

    • If the time difference is small, it indicates that most of the time is consumed on network calls. In this case, reduce network calls as many as possible to shorten the time consumed on each method.

      The preceding figure shows that the same method is cyclically called. Instead, it could be just called once in batch.

    • If the time difference is large, for example, the time consumed on the parent node is 607 ms while the total time consumed on the subnodes does not reach 100 ms. Then it indicates most of the time is consumed on the task logic of the server itself, rather than the RPC service call.

  4. Locate the time-consuming call.

    By looking at the time bars to first locate the call before which much time is consumed. The time is purely consumed by the local logic, for which further troubleshooting is required.

    1. After locating the time-consuming logic, review the codes or add logs to the codes to locate the errors.

      If it is found that the codes do not consume so much time, perform the following step.

    2. Check whether GC occurred at that time. Therefore, the gc.log file is important.

  5. Locate the timeout error.

    An timeout error occurs. Perform the following steps to evaluate the time.

    The time is divided into three parts:

    • 0 ms for the client to send the request. This process includes serialization, network transmission, and deserialization. If this process takes a long time, consider if a consumer GC should be triggered. It will take a long time if the object for serialization or deserialization is large, the network is under great transmission pressure, or the provider GC occurs.

    • 10,077 ms for the server to process the request. The time is taken only by the server to process the request, not including other operations.

    • 3,002 ms for the client to receive the response. As the timeout time of 3s is set, the server directly returns timeout after 3s, but the server is still processing the request. If this process consumes much time, perform troubleshooting using the same method that is used for the client.

Thank you! We've received your feedback.