Locating and troubleshooting response time errors is a long and complex process. Application Real-Time Monitoring Service (ARMS) provides the active diagnosis feature to help you quickly and accurately locate response time errors. Active diagnosis of ARMS shortens the response time of applications.

Background

Response time errors are likely caused by the long response time of downstream applications, uneven traffic, an excessive number of full garbage collection (GC) activities, or an excessive load. To locate and troubleshoot response time errors, you need to:

  • Know the server that causes this spike in response time.
  • Conduct an SQL analysis on the time consumption of applications.
  • Check the number of full GC activities of the application and whether there is a spike of time consumption.
  • Check whether there are memory leaks.
  • Check the exception log.
  • Inspect whether the response time of downstream applications follows the same trend.

Prerequisites

Make sure you have installed the ARMS agent for your application.

Step 1: View exception information

The installed ARMS agent collects and shows the application's total requests, average response time, count of errors, real-time instances, number of full garbage collection (GC) activities, slow SQLs, exceptions, and slow calls within the selected period. Follow these steps to view the exception information of the application.

  1. In the left-side pane of the ARMS console, choose Application Monitoring > Applications.

    On the Applications page, if the application has an exception, the status bar is displayed in red.

  2. On the Applications page, click the red dot in the status bar of the target application.

  3. View exception details on the Key Events page.

    The Key Events page shows the time sequence curves of the response time and average response time of the current application and the downstream applications. Information of abnormal APIs and the Trace IDs of the top 5 abnormal calls are also displayed.

Step 2: Diagnose the causes of the exception

The statistics of application exceptions are insufficient for locating the causes for exceptions. You can use SQL analysis, tracing analysis, or interface snapshot to quickly locate the causes of an exception.

  1. On the Key Events page, click the name of the dependent service. For example, click Invoke MYSQL. Then, in the Application Dependent Services section of the Overview page, view the details of downstream applications.

    The Application Dependent Services section shows the time sequence curves of the request volume, average response, number of application instances, and HTTP status code statistics of the application dependent services. In this example, the response time of the downstream application reaches 1680 ms. It can be concluded that the application's spike in response time is caused by the downstream application's spike in response time.

  2. Return to the Key Events page and click the API that contributes the most to the spike. For example, click /Xxxdata/... page, and then the SQL Analysis tab on /Xxxdata/... page to view API information.

    The SQL Analysis tab shows the SQL call statistics and SQL statement details. For more information, see SQL analysis. In this example, it can be concluded that the cause of the spike in response time of the current application is overly slow SQL calls of the downstream application.

  3. Return to the Key Events page and click the Trace ID of an application whose time consumption is among the top 5. Then click the magnifier icon in the Method Stack column to view the problem code.

    To find the target trace, see Trace query.

    In this example, it can be seen that most of the time of the 536 ms call is spent in SELECT t1.id... of this SQL call.

At this point, the causes for the exception are revealed. This effectively helps you with the subsequent code optimization. You can also return to the Key Events page to view other slow calls in the list and solve them one by one.