All Products
Search
Document Center

Application Real-Time Monitoring Service:Best practices of alerting

Last Updated:Aug 23, 2024

The Application Monitoring sub-service of Application Real-Time Monitoring Service (ARMS) allows you to configure alerting for the preset metrics. You can also use Prometheus Query Language (PromQL) statements to configure advanced alerting because Application Monitoring data is integrated into Managed Service for Prometheus by default. This topic provides a set of alert configurations and sample PromQL statements to meet the requirements for O&M and emergency response.

Prerequisites

Your application is monitored by Application Monitoring. For more information, see Application Monitoring overview.

Basic alerting

Strategy

In order to ensure service stability and meet the Service-Level Agreement (SLA) requirements, alerting is of great significance for timely emergency response. In this topic, a hierarchical system is provided to vertically define the business, application, and infrastructure to realize quick emergency response and troubleshooting.

image

In this example, the following metrics are used. For information about all preset metrics provided by Application Monitoring, see Alert rule metrics.

Metric

Description

Number of Calls

The number of entry calls, including HTTP and Dubbo calls. You can use this metric to analyze the number of calls of the application, estimate the business volume, and check whether exceptions occur in the application.

Call Error Rate (%)

The error rate of entry calls is calculated by using the following formula: Error rate = Number of entry calls/Total number of entry calls × 100%.

Call Response Time

The response time of an entry call, such as an HTTP call or a Dubbo call. You can use this metric to check for slow requests and exceptions.

Number of Exceptions

The number of exceptions that occur during software runtime, such as null pointer exceptions, array out-of-bounds exceptions, and I/O exceptions. You can use this metric to check whether a call stack throws errors and whether application call errors occur.

Number of HTTP Requests Returning 5XX Status Codes

The number of HTTP requests for which status codes 5XX are returned. 5XX status codes indicate that internal server errors have occurred, or the system is busy. Common 5XX status codes include 500 and 503.

Database Request Response Time

The time internal between the time that the application sends a request to a database and the time that the database makes a response. The response time of database requests affects the application performance and user experience. If the response time is excessively long, the application may stutter or slow down.

Downstream Service Call Error Rate (%)

The value of this metric is calculated by using the following formula: Error rate of downstream service calls = Number of failed downstream service requests/Total number of interface requests. You can use this metric to check whether the errors of the downstream services increase and affect the application.

Average Response Time of Downstream Service Calls (ms)

The average response time of downstream service calls. You can use this metric to check whether the time consumed by the downstream services increases and affects the application.

Number of JVM Full GCs (Instantaneous Value)

The number of full garbage collections (GCs) performed by the JVM in the last N minutes. If full GCs frequently occur in your application, exceptions may occur.

Number of Runnable JVM Threads

The maximum number of threads supported by the JVM during runtime. If excessive threads are created, a large amount of memory resources are consumed. The system may run slow or crash.

Thread Pool Usage

The ratio between the number of threads in use in the thread pool and the total number of threads in the thread pool.

Node CPU Utilization (%)

The CPU utilization of the node. Each node is a server. Excessive CPU utilization may cause problems such as slow system response and service unavailability.

Node Disk Utilization (%)

The ratio between the used disk space and the total disk space. The higher the disk utilization, the less the storage capacity of the node.

Node Memory Usage (%)

The percentage of memory in use. If the memory usage of the node exceeds 80%, you need to reduce memory pressure by adjusting the configurations of the node or optimizing the memory usage of tasks.

Business

You can configure alerting for interfaces related to the key business. In the e-commerce industry, alerting can be configured for interfaces related to the business volume. In the game industry, alerting can be configured for the interfaces related to sign-in.

The following example shows an interface related to adding a product to the cart in an e-commerce scenario.

image

  • The number of interface calls is commonly used as a metric. Interface calls usually drop when the business is affected. If the interface calls surge and exceed the capacity, the business is overloaded. You can configure a data range to trigger alerts when the interface calls fall out of the data range.

    image

    Considering that alerts may be easily triggered during off-peak hours, such as midnight, we recommend that you configure a lower limit and a link relative to monitor the rapid decrease in the interface calls caused by exceptions.

    image

  • An upper limit is configured for the error rate of interface calls, as shown in the following figure.

    image

  • In addition, you can configure other metrics based on your business requirements. If your business requires timeliness, you can configure alerting for the response time or the number of slow calls.

    image

Application

When the business is affected, you can troubleshoot problems based on the application metrics.

  • Generally, a number of exceptions occur together with bugs that happen in release or updates, and abnormal downstream services. Exceptions are crucial to troubleshooting. To monitor exception spikes, we recommend that you configure an upper limit and a link relative to monitor the rapid increase in the exceptions.

    image

  • An increase in the number of exceptions does not necessarily represent that the application has encountered problems. Graceful degradation handles exceptions without disrupting the application. However, some exceptions may not be discovered and affect the returned result of an interface call, which constitutes an error. Therefore, you can configure an upper limit for the error rate of exceptions.

    image

  • For HTTP services, you can focus on HTTP status codes. Generally, 4xx status codes represent external errors, whereas 5xx status codes represent server errors. Therefore, we recommend that you configure an upper limit and a link relative to monitor the increase in the 5xx status codes.

    image

  • When the application encounters problems or the interface calls increase, the overall response time often greatly increases. Therefore, you can configure an upper limit for the response time in the specified time period based on your needs. As shown in the following figure, alerting is configured for the average response time within the last minute. If the business frequently fluctuates, you can specify a longer previous time period, such as 5 minutes or 10 minutes.

    image

  • The increase in the response time of upstream services is generally caused by external reasons and internal reasons. External reasons are concentrated on databases or services on which the upstream services depend.

    In most cases, database issues increase the response time by dozens of or even hundreds of times, which easily triggers alerts. Therefore, you can configure a relatively high upper limit for the response time of database calls.

    image

    Then, you can configure upper limits for the response time and error rate of upstream service calls.

    Configure an upper limit for the response time of upstream service calls

    image

    Configure an upper limit for the error rate of upstream service calls

    image

  • To troubleshoot internal application issues, you can enable Java virtual machine (JVM) and thread pool monitoring.

    Even if JVM has a variety of metrics, we recommend that you configure alerting for the number of Full GC events and traverse all nodes. Because continuous Full GC events and frequent Full GC events in a short period of time are both abnormal, you can configure two alerting conditions to monitor them, as shown in the following figure.

    image

    On one hand, excessive runnable JVM threads consume a large amount of memory resources. On the other hand, if no JVM thread is runnable, the service is abnormal. Therefore, you can add an alerting condition to search for the absence of runnable JVM threads.

    image

    To prevent thread pools from being continuously full, you can configure alerting for the thread pool usage, the number of active threads, or the maximum number of threads within a period of time, as shown in the following figure. If the thread pool size is not specified, 2147483647, which is the maximum positive value for a 32-bit signed binary integer in computing, is considered the thread pool size. In this case, we recommend that you do not add the thread pool size (maximum number of threads) as an alerting condition.

    Note

    In the Metric Type section, ThreadPool_Monitoring is available to the ARMS agent V3.x and Thread_Pool_Version_2 is available to the ARMS agent V4.x.

    image

Infrastructure

For applications deployed in Elastic Compute Service (ECS) instances, ARMS collects node information to trigger alerts. We recommend that you configure upper limits for the most significant alerting conditions: CPU utilization, memory usage, and disk utilization.

As the CPU utilization of the nodes greatly fluctuates, you can check whether the CPU utilization continuously reaches the upper limit in a period of time.

image

Node memory usage

image

Node disk utilization

image

For applications deployed in Container Service for Kubernetes (ACK) clusters and monitored by Managed Service for Prometheus, we recommend that you configure Prometheus alerts. For more information, see Monitor an ACK cluster.

For applications deployed in ACK clusters and not monitored by Managed Service for Prometheus, the ARMS agent V4.1.0 and later collects the CPU and memory information of the clusters for monitoring and alerting. As configuring upper limits for the CPU utilization and memory usage of ACK clusters is optional, values are not provided for reference purposes. You can configure the upper limits based on the request volume and resource size of the clusters.

CPU utilization of ACK clusters

image

Memory usage of ACK clusters

image

Additional information

Filter conditions

  • Traversal: traverses all nodes or interfaces, which is similar to the GROUP BY clause of SQL. Note that the filter condition is not suitable for all interfaces.

  • =: specifies the most significant nodes or interfaces, which is similar to the WHERE clause of SQL.

  • No dimension: monitors the entire metric data. For metrics about hosts, such as CPU utilization, the host with the highest CPU utilization is monitored. For metrics about upstream services, overall service calls are monitored. For the response time, average response time is monitored.

image

Alerting conditions

  • AVG/SUM/MAX/MIN: triggers alerts if the average value, sum of all values, maximum value, or minimum value of a metric in the last X minutes reaches the limit.

  • CONTINUOUS: triggers alerts if the value of a metric continuously reaches the limit in the last X minutes. The alerting condition is often used in scenarios with large fluctuations. Instantaneous values may greatly change.

  • Pxx: specifies a quantile. The alerting condition is often used in time-consuming scenarios.

Note

The minimum period of time that can be specified for a metric is 1 minute, and the AVG, SUM, MAX, MIN, and CONTINUOUS alerting conditions are no different when 1 minute is specified.

image

Threshold

Application Monitoring provides recommended thresholds to meet the monitoring requirements for a variety of services or scenarios.

image

Advanced alerting

You can use Managed Service for Prometheus to configure alerting for metrics through PromQL.

After an application is connected to Application Monitoring, Managed Service for Prometheus automatically creates a Prometheus instance in the region to store the metric data.

image

In addition to the preset alert rules of Application Monitoring, you can configure advanced Prometheus alerting.

image

Take the JVM Heap Memory Usage metric as an example. In Application Monitoring, you can monitor the metric of only one application in one region. However, Prometheus alerting allows you to monitor the metric of all applications in the region by using the max by (serverIp,pid) (last_over_time(arms_jvm_mem_used_bytes{area="heap",id="eden"}[1m])) PromQL statement.

image

For information about the metrics that Application Monitoring supports, see Application Monitoring metrics. We recommend that you use metrics listed in this topic because metrics that are not listed may be incompatible with Application Monitoring in the future.

Sample PromQL statements

Business metrics

Metrics

Metric

PromQL

Number of HTTP Interface Calls

sum by ($dims) (sum_over_time_lorc(arms_http_requests_count{$labelFilters}[1m]))

Response Time of HTTP Interface Calls

sum by ($dims) (sum_over_time_lorc(arms_http_requests_seconds{$labelFilters}[1m])) / sum by ($dims) (sum_over_time_lorc(arms_http_requests_count{$labelFilters}[1m]))

Number of HTTP Interface Calls

sum by ($dims) (sum_over_time_lorc(arms_http_requests_error_count{$labelFilters}[1m]))

Number of Slow HTTP Interface Calls

sum by ($dims) (sum_over_time_lorc(arms_http_requests_count{$labelFilters}[1m]))

Dimensions

  • Similar to the GROUP BY clause of SQL, $dims is used for grouping.

  • Similar to the WHERE clause of SQL, $labelFilters is used for filtering.

Dimension name

Dimension key

Service name

service

Service PID

pid

Server IP address

serverIp

Interface

rpc

Examples:

  • Calculate the number of HTTP interface calls in the host whose IP address is 127.0.0.1 and group the results by interface.

    sum by (rpc) (sum_over_time_lorc(arms_http_requests_count{"serverIp"="127.0.0.1"}[1m]))
  • Calculate the number of HTTP interface calls in the mall/pay interface and group the results by host.

    sum by (serverIp) (sum_over_time_lorc(arms_http_requests_count{"rpc"="mall/pay"}[1m]))

JVM metrics

Metrics

Metric

PromQL

Total JVM Heap Memory

max by ($dims) (last_over_time_lorc(arms_jvm_mem_used_bytes{area="heap",id="old",$labelFilters}[1m)) + max by ($dims) (last_over_time_lorc(arms_jvm_mem_used_bytes{area="heap",id="eden",$labelFilters}[1m])) + max by ($dims) (last_over_time_lorc(arms_jvm_mem_used_bytes{area="heap",id="survivor",$labelFilters}[1m]))

Number of JVM Young GCs

sum by ($dims) (sum_over_time_lorc(arms_jvm_gc_delta{gen="young",$labelFilters}[1m]))

Number of JVM Full GC

sum by ($dims) (sum_over_time_lorc(arms_jvm_gc_delta{gen="old",$labelFilters}[1m]))

Young GC Duration

sum by ($dims) (sum_over_time_lorc(arms_jvm_gc_seconds_delta{gen="young",$labelFilters}[1m]))

Full GC Duration

sum by ($dims) (sum_over_time_lorc(arms_jvm_gc_seconds_delta{gen="old",$labelFilters}[1m]))

Number of Active Threads

max by ($dims) (last_over_time_lorc(arms_jvm_threads_count{state="live",$labelFilters}[1m]))

Heap Memory Usage

(max by ($dims) (last_over_time_lorc(arms_jvm_mem_used_bytes{area="heap",id="old",$labelFilters}[1m])) + max by ($dims) (last_over_time_lorc(arms_jvm_mem_used_bytes{area="heap",id="eden",$labelFilters}[1m])) + max by ($dims) (last_over_time_lorc(arms_jvm_mem_used_bytes{area="heap",id="survivor",$labelFilters}[1m])))/max by ($dims) (last_over_time_lorc(arms_jvm_mem_max_bytes{area="heap",id="total",$labelFilters}[1m]))

Dimensions

Dimension name

Dimension key

Service name

service

Service PID

pid

Server IP address

serverIp

System metrics

Metrics

Metric

PromQL

CPU Utilization

max by ($dims) (last_over_time_lorc(arms_system_cpu_system{$labelFilters}[1m])) + max by ($dims) (last_over_time_lorc(arms_system_cpu_user{$labelFilters}[1m])) + max by ($dims) (last_over_time_lorc(arms_system_cpu_io_wait{$labelFilters}[1m]))

Memory Usage

max by ($dims) (last_over_time_lorc(arms_system_mem_used_bytes{$labelFilters}[1m]))/max by ($dims) (last_over_time_lorc(arms_system_mem_total_bytes{$labelFilters}[1m]))

Disk Utilization

max by ($dims) (last_over_time_lorc(arms_system_disk_used_ratio{$labelFilters}[1m))

System Load

max by ($dims) (last_over_time_lorc(arms_system_load{$labelFilters}[1m]))

Number of Error Messages

max by ($dims) (max_over_time_lorc(arms_system_net_in_err{$labelFilters}[1m]))

Dimensions

Dimension name

Dimension key

Service name

service

Service PID

pid

Server IP address

serverIp

Thread pool and connection pool metrics

Metrics

ARMS agent V4.1.x and later

Metric

PromQL

Thread Pool Usage

avg by ($dims) (avg_over_time_lorc(arms_thread_pool_active_thread_count{$labelFilters}[1m]))/avg by ($dims) (avg_over_time_lorc(arms_thread_pool_max_pool_size{$labelFilters}[1m]))

Connection Pool Usage

avg by ($dims) (avg_over_time_lorc(arms_connection_pool_connection_count{state="used",$labelFilters}[1m]))/avg by ($dims) (avg_over_time_lorc(arms_connection_pool_connection_max_count{$labelFilters}[1m]))

ARMS agent earlier than V4.1.x

The ARMS agent earlier than V4.1.x uses the same thread pool and connection pool metrics. When you use PromQL statements, you must specify the ThreadPoolType parameter. The value can be Tomcat, apache-http-client, Druid, SchedulerX, okhttp3, or Hikaricp. For more information about frameworks supported by thread pool and connection pool, see Thread pool and connection pool monitoring.

Metric

PromQL

Thread Pool Usage

avg by ($dims) (avg_over_time_lorc(arms_threadpool_active_size{ThreadPoolType="$ThreadPoolType",$labelFilters}[1m]))/avg by ($dims) (avg_over_time_lorc(arms_threadpool_max_size{ThreadPoolType="$ThreadPoolType",$labelFilters}[1m]))

Dimensions

Dimension name

Dimension key

Service name

service

Service PID

pid

Server IP address

serverIp

Thread pool name (ARMS agent earlier than V4.1.x)

name

Thread pool type (ARMS agent earlier than V4.1.x)

type