Automatic application fault diagnostics - Enterprise Distributed Application Service

When you go to the Application Overview page in the Enterprise Distributed Application Service (EDAS) console, you can customize a time range for a diagnostics test. Then, the system performs an automatic diagnostics test for the status of your application within the specified time range. If an issue is found, a red shield icon 诊断报告图标 appears in the upper part of the Application Overview page. After you click the icon, a diagnostics report appears. You can identify and fix the issue based on the fault definition and root cause analysis in the diagnostics report.

Common scenarios of automatic fault diagnostics

Sudden increase in the RT

If downstream business causes a sudden increase in the response time (RT) of an application, you can contact the person in charge of the downstream business for troubleshooting.
If an application change causes the sudden increase in the RT, you can view the specific changes for troubleshooting.
If a service of the application causes a sudden increase in the RT, you can perform the following operations to troubleshoot the issue:
- Check whether the service has an exception at this time.
- Check whether the downstream service that calls the service has a long RT.
- Check whether the RT of a service that is called by the service is long.
The sudden increase in the RT is caused by the following issues on a single node:
- Full thread pool. A time series curve chart for the number of threads is provided in the diagnostics report.
- Full garbage collection (GC) on the single node.
- Disk read and write errors on the single node.
- Out of memory (OOM) issues on the single node.

High proportion of error requests or a large number of requests

The number of error requests for a service of the application suddenly increases. This results in a high proportion of error requests.
A large number of requests and responses that occur in a specific period account for a high proportion. As a result, serialization and deserialization consume a long time.

Excessive loads on a host

Excessive loads on the host reduce the capability of the container to provide services.

Network issues

When the application is running, an exception occurs due to a network failure in a data center.

View the report of automatic fault diagnostics

Log on to the EDAS console.
In the left-side navigation pane, click Application Management > Applications. In the top navigation bar, select a region. In the upper part of the page, select a namespace. Select Container Service or Serverless Kubernetes Cluster from the Cluster Type drop-down list. Then, find the application that you want to deploy and click the application name.
On the Overview tab of the Application Overview page, specify a time range for a diagnostics test in the upper-right corner.
Important
If the application is diagnosed with an exception within the specified time range, a red shield icon appears on the right side of the application name in the upper part of the page. If the application is not diagnosed with an exception, this does not mean that the application has no potential issues.
1. In the upper part of the Application Overview page, click the red shield icon on the right side of the application name.
2. View the fault symptom and cause analysis in the diagnostics report.

Components of a diagnostics report

The diagnostics report consists of four parts: diagnostics details, fault definition, root cause analysis, and data support.

Diagnostics details: This part consists of the application that is diagnosed, the diagnostics time, and the fault symptom.
Fault definition: This part contains the shallow causes of application failures that are inferred by the diagnostics model. In most cases, the following three causes are included:
- An instance error of the application causes an overall failure.
- An API or service error of the application causes an overall failure.
- A downstream application failure of the application causes a failure of the application.
Root cause analysis: This part contains the deep causes that are inferred by the diagnostics model. A large number of deep causes exist and vary based on the actual situation.
Data support: This part contains data support for obtaining the inference. The diagnostics reports of different faults contain different analysis data.