All Products
Search
Document Center

Application Real-Time Monitoring Service:Troubleshoot applications based on slow and failed traces

Last Updated:Sep 03, 2024

In a production environment, application exceptions, such as the sudden increase in the response time and error rates of applications, have various reasons, including uneven traffic, instance failures, program exceptions, and dependency failures. Comprehensive performance optimization is required before an application is released or a promotion is implemented. Performance bottlenecks need to be identified, and the interfaces or components that often fail or consume excessive time need to be optimized. This topic describes how to use slow and failed traces to troubleshoot failed and slow calls and locate performance bottlenecks.

Prerequisites

  • An Application Real-Time Monitoring Service (ARMS) agent is installed for the application. For more information, see Application Monitoring overview.

  • The new ARMS console is used.

    image

Use the demo of the new ARMS console

Demo

Troubleshoot failed calls based on failed traces

Step 1: Identify the time period

  1. Log on to the ARMS console. In the left-side navigation pane, choose Application Monitoring > Application List.

  2. On the Application List page, select a region in the top navigation bar and click the name of the application that you want to manage.

    Note

    If the Java图标 icon is displayed in the Language column, the application is connected to Application Monitoring. If a hyphen (-) is displayed, the application is connected to Managed Service for OpenTelemetry.

  3. In the top navigation bar, click the Trace Explorer tab.

    Note

    The Trace Explorer tab is available in the new ARMS console. For information about how to use the new console, see the Prerequisites section.

    As shown in the following figure, some HTTP errors occurred in the sample application mall-gateway between 15:20 and 15:28.

    image

  4. Use the preceding time range for troubleshooting.

    image

Step 2: Locate the interfaces or components

As shown on the Wrong/slow Trace analysis tab, the failed traces are mainly concentrated in the /components/api/v1/mall/product interface, and the error code 500 was returned for these traces.

image

Troubleshoot the/components/api/v1/mall/product interface

  1. In the spanName chart, click spanName: /components/api/v1/mall/product.

    image

    The serviceName="mall-gateway" AND spanName="/components/api/v1/mall/product" filter condition is automatically added.

    As shown in the query results, all the traces related to the /components/api/v1/mall/product interface failed.

    image

  2. On the List tab, find a trace and click Details in the Actions column to view the trace details.

    image

Troubleshoot slow calls based on slow traces

Step 1: Identify the time period

  1. Log on to the ARMS console. In the left-side navigation pane, choose Application Monitoring > Application List.

  2. On the Application List page, select a region in the top navigation bar and click the name of the application that you want to manage.

    Note

    If the Java图标 icon is displayed in the Language column, the application is connected to Application Monitoring. If a hyphen (-) is displayed, the application is connected to Managed Service for OpenTelemetry.

  3. In the top navigation bar, click the Trace Explorer tab.

    As shown in the following figure, the sample application mall-user-server has various slow calls that consume more than 5 seconds between 15:40 and 15:49.

    image

  4. Use the preceding time range for troubleshooting.

    image

  5. On the Wrong/slow Trace analysis tab, click Modify Time-consuming Threshold to change the threshold to 5000 ms.

    image

Step 2: Locate the interfaces or components

As shown on the Wrong/slow Trace analysis tab, the slow traces are concentrated in the /components/api/v1/http/success interface and reported through EagleEye, and the interface is deployed in the arms-test namespace.

image

Troubleshoot the /components/api/v1/http/success interface

In the spanName chart, click spanName: /components/api/v1/http/success.

image

The serviceName="mall-user-server" AND spanName="/components/api/v1/http/success" filter condition is automatically added.

As shown in the query results, each call takes more than 5 seconds, and the /components/api/v1/http/success interface is the root cause of the slow calls.

image

As shown in the Time Percentile chart, the average call duration exceeds 5 seconds.

image

Troubleshoot spans reported through EagleEye

In the attributes._arms.trace.protocol.type chart, click attributes._arms.trace.protocol.type: EagleEye.

image

The serviceName="mall-user-server" AND attributes._arms.trace.protocol.type="EagleEye" filter condition is automatically added.

As shown in the query results, all the slow traces are concentrated in the /components/api/v1/http/success interface.

image

Configure the /components/api/v1/http/success interface as a filter condition. As shown in the query results, each call takes more than 5 seconds.

image

As shown in the Time Percentile chart, the average call duration exceeds 5 seconds.

Troubleshoot the spans related to the arms-test namespace

As shown in the query results of the serviceName="mall-user-server" AND attributes.namespace="arms-test" filter condition, all the slow traces are concentrated in the /components/api/v1/http/success interface.

image

Configure the /components/api/v1/http/success interface as a filter condition. As shown in the query results, each call takes more than 5 seconds.

image

Based on the troubleshooting, a conclusion can be drawn that all the slow traces are concentrated in the /components/api/v1/http/success interface. The /components/api/v1/http/success interface is deployed in the arms-test namespace, and the traces are reported through EagleEye.