Alert rule metrics - Application Real-Time Monitoring Service

This topic describes the alert rule metrics of Application Real-Time Monitoring Service (ARMS) Application Monitoring. The data of these metrics is monitored once a minute.

JVM monitoring

Note

The following JVM metrics are for reference only. JVM-related descriptions are subject to JVM official documentation.

Metrics

Metric	Unit	Common	Description
Number of JVM Full GCs (Instantaneous Value)	None	Yes	The number of full garbage collections (GCs) performed by the JVM in the last N minutes. If full GCs frequently occur in your application, errors may occur.
JVM Full GC Duration (Instantaneous Value)	Milliseconds	No	The time consumed for full GCs in the last N minutes. The instantaneous value of full GC duration indicates the garbage collection performance of the current JVM. Generally, the shorter the full GC duration, the better the JVM performance. If full GCs take too long, applications may significantly stutter. This affects the user experience.
Number of JVM Young GCs (Instantaneous Value)	None	Yes	The number of young GCs performed by the JVM in the last N minutes. The instantaneous value of the number of young GCs indicates the speed of JVM object creation and destruction and the usage of the young generation. Generally, the more young GCs, the more objects are created in the application. Besides, the application may have memory leaks or unreasonable memory usage.
JVM Young GC Duration (Instantaneous Value)	Milliseconds	No	The time consumed for young GCs in the last N minutes. The instantaneous value of young GC duration indicates the garbage collection performance of the current JVM. Generally, the longer the young GC duration, the lower the garbage collection efficiency. In this case, the application may stutter.
Total JVM Heap Memory	M	No	The total size of the JVM heap memory, including the memory of young and old generations. The size of JVM heap memory must be properly configured based on the load and performance requirements of the application. Excessively small JVM heap memory leads to frequent garbage collections and affects application performance. Excessively large JVM heap memory occupies a large amount of system resources and affects system stability.
Used JVM Heap Memory	M	Yes	The size of JVM heap memory that has been used by Java programs. The size of used JVM heap memory must be strictly controlled to prevent system performance degradation, or memory overflow caused by memory leaks or excessive memory usage.
Committed JVM Non-heap Memory	M	No	The size of non-heap memory that has been used by Java programs. The size of committed JVM non-heap memory must be strictly controlled to prevent excessive memory usage caused by excessive class loading or a large amount of static variables and constants.
Initial JVM Non-heap Memory	M	No	Generally, the initial size of JVM non-heap memory is dynamically calculated based on factors such as JVM version, operating system, and JVM parameters.
Maximum JVM Non-heap Memory	M	No	If you are using a Java version earlier than 8, this metric is controlled by the JVM parameter MaxPermSize. Otherwise, this metric is controlled by MaxMetaspaceSize.
Used JVM Non-heap Memory	M	Yes	The size of used JVM non-heap memory, including Metaspace and PermGen.
Number of JVM Blocked Threads	None	No	The number of blocked threads that are waiting for monitor locks. Excessive blocked threads may cause system performance degradation.
Total Number of JVM Threads	None	Yes	The number of threads in all states. An excessive number of threads may result in insufficient memory and CPU resources. This affects the application performance and stability.
Number of JVM Deadlocked Threads	None	No	A deadlock occurs when two or more threads are waiting for one another to finish accessing a resource. When a deadlock occurs in the JVM, the number of deadlocked threads increases until more and more deadlocks occur. Generally, the more deadlocked threads, the more severe the situation. The application may even crash.
Number of New JVM Threads	None	No	The number of threads created by the JVM. A large number of threads can be created in a JVM, but excessive threads may result in a waste of system resources and pressure on thread scheduling.
Number of JVM Runnable Threads	None	No	The maximum number of threads supported by the JVM during runtime. If excessive threads are created, a large amount of memory resources are consumed. The system may run slow or crash.
Number of JVM Terminated Threads	None	No	The number of threads that can run at the same time in the JVM during runtime. The number of threads can be controlled based on the actual situation to prevent thread resource waste or thread starvation.
Number of JVM Timed-out Waiting Threads	None	Yes	The number of threads that wait for a resource and times out during JVM runtime. If the number of timed out threads is too large, some bottlenecks may exist in the system. In this case, you need to optimize resources to improve the processing capabilities and response speed of the system.
Number of JVM Waiting Threads	None	No	The number of waiting threads in the current JVM. For highly concurrent applications, an increase in the number of JVM waiting threads may result in performance degradation.
Number of JVM GCs (Cumulative Value)	None	No	The cumulative number of GCs performed in the JVM.
Number of JVM Mark and Sweep Implementations (Cumulative Value)	None	No	The cumulative number of times the mark-and-sweep algorithm is used in the JVM.
JVM Heap Memory Usage	%	No	The ratio between the allocated heap memory and the total heap memory during JVM runtime. This metric can be used to measure the efficiency and performance of JVM memory management. Generally, the JVM heap memory usage must be maintained lower than 70% to prevent problems such as memory overflow.

Metric dimensions

The preceding metrics are monitored by node IP address. You can use the following methods to configure alerting based on the IP addresses:

Traversal: traverses the IP address of each node and monitors the metric data of each node.
=: specifies specific nodes for monitoring and alerting. Example: =172.20.XX.XX.
No dimension: aggregates and monitors the metric data of all nodes.

Scheduled tasks

Note The ARMS application monitoring feature supports only scheduled tasks of the XXL-JOB, SchedulerX, and JDK-Timer types.

Metrics

Metric	Unit	Common	Description
Duration	Milliseconds	No	The average duration of the scheduled task.
Total Number of Executions	None	No	The number of times that the scheduled task is executed.
Number of Execution Errors	None	No	The number of times that the scheduled task is not executed as expected within the specified time interval.
Scheduling Time Delay	Milliseconds	No	The time spent on scheduling before the scheduled task is started.

Metric dimensions

The preceding metrics are monitored by scheduled tasks. You can use the following methods to configure alerting based on the scheduled tasks:

Traversal: traverses the scheduled tasks and monitors the metric data of each scheduled task.
=: specifies specific scheduled tasks for monitoring and alerting. Example: =LoadGenerator.mockUserApiLoad.
No dimension: aggregates and monitors the metric data of all scheduled tasks.

Exceptions

Metrics

Metric	Unit	Common	Description
Number of Exceptions	None	Yes	The number of exceptions that occur during software runtime, such as null pointer exceptions, array out-of-bounds exceptions, and I/O exceptions. You can use this metric to check whether a call stack throws errors and whether application call errors occur.
Response Time of Abnormal API Calls	Milliseconds	Yes	The response time of an abnormal API call for the application. If an API call is abnormal, errors occur. You can use this metric to estimate the impact of errors thrown by the call stack on the response time of the API call, and check whether errors occur.

Metric dimensions

The preceding metrics are monitored by API operation. You can use the following methods to configure alerting based on the API operations:

Traversal: traverses the accessed API operations and monitors the metric data of each API operation.
=: specifies specific API operations for monitoring and alerting. Example: =/tb/api/users/{userId}.
!=: excludes specific API operations from monitoring and alerting, and separately monitors the other API operations. Example:!=/tb/api/users/{userId}
Inclusion: monitors API operations that contain a specific keyword. Example: Include api.
Exclusion: monitors API operations that do not contain a specific keyword. Example: Exclude api.
Regular expression: monitors the API operations that match the specified regular expression. Example: =/(api)/i.
No dimension: aggregates and monitors the metric data of all API operations.

The preceding metrics are monitored by exception. You can use the following methods to configure alerting based on the exceptions:

Traversal: traverses the exceptions and monitors the metric data of each exception.
=: specifies specific exceptions for monitoring and alerting. Example: =FeignException$InternalServerError.
!=: excludes specific exceptions from monitoring and alerting, and separately monitors the other exceptions. Example:!=FeignException$InternalServerError.
Inclusion: monitors exceptions that contain a specific keyword. Example: Include data.
Exclusion: monitors exceptions that do not contain a specific keyword. Example: Exclude data.
Regular expression: monitors the exceptions that match the specified regular expression. Example: =/(data)/i.
No dimension: aggregates and monitors the metric data of all exceptions.

Application-dependent services

Metrics

Metric	Unit	Common	Description
Number of Application-dependent Service Calls	None	No	The number of downstream API operations on which the application depends. You can use this metric to check whether the number of downstream dependent service calls increases.
Application-dependent Service Call Error Rate	%	No	The value of this metric is calculated by using the following formula: Error rate of application-dependent service calls = Number of abnormal downstream API requests/Total number of API requests. You can use this metric to check whether the errors of the downstream dependent services increase and affect the application.
Response Time of Application-dependent Service Calls	Milliseconds	Yes	The average response time of the downstream API operations on which the application depends. You can use this metric to whether the time consumed by the downstream dependent services increases and affects the current application.

Metric dimensions

The preceding metrics are monitored by API request type. You can use the following methods to configure alerting based on the API request types:

Traversal: traverses the API request types and separately monitors the metric data of each type, such as HTTP, MySQL, and Redis.
=: specifies specific API request types for monitoring and alerting. Example: =http.
No dimension: aggregates and monitors the metric data of all API request types.

Host monitoring

Metrics

Metric	Unit	Common	Description
Node CPU Utilization	%	No	The CPU utilization of the node. Each node is a server. Excessive CPU utilization may cause problems such as slow system response and service unavailability.
Node CPU Utilization in User Mode	%	No	The ratio between the node CPU time occupied by processes running in user mode and the total CPU time. Processes in user mode are applications in user space, such as web services and databases.
Idle Disk Space of Node	MB	Yes	The unused disk space of the node. You can use this metric to check whether the disk space is full. If the disk space is full, the system may crash or cannot work as expected.
Node Disk Utilization	%	No	The ratio between the used disk space and the total disk space. The higher the disk utilization, the less the storage capacity of the node.
Node System Load	None	Yes	You can use this metric to check whether the workload of the node is excessively high. For a node that has N cores, the maximum workload is N.
Idle Node Memory	MB	Yes	The size of the unused memory in the node. You can use this metric to check whether the memory of the node is sufficient. If the memory of the node is insufficient, exceptions such as out-of-memory (OOM) errors may occur.
Node Memory Usage	%	No	The percentage of memory in use. If the memory usage of the node exceeds 80%, you need to reduce memory pressure by adjusting the configurations of the node or optimizing the memory usage of tasks.
Number of Error Packets Received on Node	None	No	The number of error packets that the node receives when it processes network communication. These error packets may be caused by network transmission issues or application issues. If error packets are received, the node may fail to process the network communication, and the system may be affected.
Number of Error Packets Sent from Node	None	No	The number of error packets that the node sends when it processes network communication. These error packets may be caused by network transmission issues or application issues. You can use this metric to check whether the node network is normal.
Number of JVM Instances	None	Yes	The number of JVM instances that are running in real time. Generally, this metric is used to configure service downtime alerting.
Number of Bytes Sent from Node	None	No	The amount of data sent by the node over a network, including data, system messages, and error messages sent by the application.
Number of Packets Sent from Node	None	No	The number of messages sent from the node over a network.
Number of Bytes Received on Node	None	No	The total amount of data received by the node over a network.
Number of Packets Received on Node	None	No	The number of packets received by the node over a network.

Metric dimensions

The preceding metrics are monitored by node IP address. You can use the following methods to configure alerting based on the IP addresses:

Traversal: traverses the IP address of each node and monitors the metric data of each node.
=: specifies specific nodes for monitoring and alerting. Example: =172.20.XX.XX.
No dimension: aggregates and monitors the metric data of all nodes.

Application-provided services

Metrics

Metric	Unit	Common	Description
Number of Calls	None	Yes	The number of application entry calls, including HTTP and Dubbo calls. You can use this metric to analyze the number of calls of the application, estimate the business volume, and check whether exceptions occur in the application.
Number of Error Calls	None	Yes	The number of error application entry calls, including HTTP and Dubbo calls. If the status code 400 is returned or the application entry call is intercepted by the top layer of Dubbo, the call is considered as an error. You can use this metric to check whether the application has call errors.
Call Error Rate	%	Yes	The error rate of application entry calls is calculated by using the following formula: Error rate = Number of error application entry calls/Total number of application entry calls × 100%.
Call Response Time	Milliseconds	Yes	The response time of an application entry call, such as an HTTP call or a Dubbo call. You can use this metric to check for slow requests and exceptions.

Metric dimensions

The preceding metrics are monitored by API operation. You can use the following methods to configure alerting based on the API operations:

Traversal: traverses the accessed API operations and monitors the metric data of each API operation.
=: specifies specific API operations for monitoring and alerting. Example: =/tb/api/users/{userId}.
!=: excludes specific API operations from monitoring and alerting, and separately monitors the other API operations. Example:!=/tb/api/users/{userId}
Inclusion: monitors API operations that contain a specific keyword. Example: Include api.
Exclusion: monitors API operations that do not contain a specific keyword. Example: Exclude api.
Regular expression: monitors the API operations that match the specified regular expression. Example: =/(api)/i.
No dimension: aggregates and monitors the metric data of all API operations.

The preceding metrics are monitored by API request type. You can use the following methods to configure alerting based on the API request types:

Traversal: traverses the API request types and separately monitors the metric data of each type, such as HTTP, MySQL, and Redis.
=: specifies specific API request types for monitoring and alerting. Example: =http.
No dimension: aggregates and monitors the metric data of all API request types.

Thread pools

Metrics

Metric	Common	Description
Number of Core Threads	Yes	The number of threads that are always active in the thread pool.
Maximum Number of Threads	Yes	The maximum number of threads that can exist simultaneously in the thread pool.
Number of Active Threads	Yes	The number of threads that are executing tasks. You can use this metric to monitor the status of the thread pool and evaluate the performance of the thread pool.
Queue Size	Yes	The size of the thread queue depends on the application requirements and system resource availability. In multithreaded programming, if the queue size is excessively small, tasks may queue for a long time. This reduces the performance of the programs. If the queue size is excessively large, a large amount of system resources may be consumed. This causes system crashes or performance degradation.
Current Number of Threads	Yes	The number of threads that are running or waiting to run.
Number of Executed Tasks	Yes	The number of tasks that have been executed and completed in a task queue or the thread pool. You can use this metric to evaluate the performance of the task queue or thread pool.
Thread Pool Usage	Yes	The ratio between the number of threads in use in the thread pool and the total number of threads in the thread pool.

Metric dimensions

The preceding metrics are monitored by node IP address. You can use the following methods to configure alerting based on the IP addresses:

Traversal: traverses the IP address of each node and monitors the metric data of each node.
=: specifies specific nodes for monitoring and alerting. Example: =172.20.XX.XX.
No dimension: aggregates and monitors the metric data of all nodes.

The preceding metrics are monitored by thread pool. You can use the following methods to configure alerting based on the thread pools:

Traversal: traverses the thread pools and monitors the metric data of each thread pool.
=: specifies specific thread pools for monitoring and alerting. Example: =pool-*-thread-*.
No dimension: aggregates and monitors the metric data of all thread pools.

The preceding metrics are monitored by thread pool type. You can use the following methods to configure alerting based on the thread pool types:

Traversal: traverses the thread pool types and monitors the metric data of each thread pool type.
=: specifies specific thread pool types for monitoring and alerting. Example: =FixedThreadPool.
No dimension: aggregates and monitors the metric data of all thread pool types.

HTTP status codes

Metrics

Metric	Common	Description
Number of HTTP Requests Returning 4XX Status Code	Yes	The number of HTTP requests for which status codes 4XX are returned. 4XX status codes indicate that the requested resource does not exist, or the required parameters are missing. Common 4XX status codes include 400 and 404.
Number of HTTP Requests Returning 5XX Status Code	Yes	The number of HTTP requests for which status codes 5XX are returned. 5XX status codes indicate that internal server errors have occurred, or the system is busy. Common 5XX status codes include 500 and 503.

Metric dimensions

The preceding metrics are monitored by API operation. You can use the following methods to configure alerting based on the API operations:

Traversal: traverses the accessed API operations and monitors the metric data of each API operation.
=: specifies specific API operations for monitoring and alerting. Example: =/tb/api/users/{userId}.
!=: excludes specific API operations from monitoring and alerting, and separately monitors the other API operations. Example:!=/tb/api/users/{userId}
Inclusion: monitors API operations that contain a specific keyword. Example: Include api.
Exclusion: monitors API operations that do not contain a specific keyword. Example: Exclude api.
Regular expression: monitors the API operations that match the specified regular expression. Example: =/(api)/i.
No dimension: aggregates and monitors the metric data of all API operations.

Databases

Metrics

Metric	Unit	Common	Description
Number of Database Requests	None	Yes	The number of requests that the application sends to a database during runtime. Each request contains a read or write operation. The number of database requests affects the performance and response time of the application.
Number of Database Request Errors	None	Yes	The number of errors that occur when the application requests the database during runtime, such as database connection failures, query statement errors, and insufficient permissions. A large number of database requests errors indicates that the interaction between the application and the database is abnormal. In this case, the application cannot run as expected.
Database Request Response Time	Milliseconds	Yes	The time internal between the time that the application sends a request to a database and the time that the database makes a response. The response time of database requests affects the application performance and user experience. If the response time is excessively long, the application may stutter or slow down.

Metric dimensions

The preceding metrics are monitored by database. You can use the following methods to configure alerting based on the databases:

Traversal: traverses the databases and monitors the metric data of each database.
=: specifies specific database types for monitoring and alerting. Example: =mysql-pod:3306(demo_db).
No dimension: aggregates and monitors the metric data of all databases.