To maintain stable business operations, you need to monitor the resource usage and business request response status of your instances and configure corresponding alert rules based on your actual requirements. This way, you can take measures at the earliest opportunity when resources are insufficient or business is affected.
System metrics
CPU and load
This module monitors the CPU utilization and load of the system, including metrics such as CPU utilization, CPU WIO utilization, CPU idle rate, and average load. CPU utilization includes CPU utilization User(%) and CPU utilization System(%).
When you configure alert rules, we recommend that you set the alert threshold for CPU idle rate(%) based on your business characteristics and sensitivity to latency. Generally, an increase in CPU utilization leads to longer response time (RT), but the impact varies depending on the type of your business. For example, when CPU utilization exceeds 40%, online business may experience latency, while offline business may remain unaffected even at 100% CPU utilization. Therefore, we recommend that you set the alert threshold based on your business situation. If CPU utilization is too high, you can scale up or upgrade the instance to meet business demands. For more information about how to scale up or upgrade the instance, see Manage instance storage and Modify the configurations of an instance.
CPU WIO usage(%) indicates the proportion of time that the CPU spends waiting for I/O operations. A high value indicates a bottleneck in disk read/write operations. You can analyze CPU WIO usage(%) together with Average load per minute(load1) (referred to as the load metric below) to assess the health status of a machine. The load metric reflects both CPU utilization and disk usage.
The acceptable load typically depends on CPU configuration. For example, if your machine has 8 CPU cores, a load value greater than 8 indicates that CPU processing tasks start to queue and the machine is in a suboptimal state. If CPU utilization is low but the load value is high, it indicates that disk usage is too high. When CPU load or WIO utilization is too high, we recommend that you scale up or upgrade your instance to avoid affecting your business.
Network and disk
This module monitors the network and disk performance of the machine. You need to take note of key metrics such as network traffic, disk read/write operations, and IOPS, and make sure that values of these metrics remain below the throttling thresholds of Elastic Compute Service (ECS) instances and cloud disks.
Different types of ECS instances have different network bandwidths. For network throttling thresholds, see Overview of instance families. For disk throttling thresholds, see Block storage performance. If you have questions about network and disk limits, contact Lindorm technical support (DingTalk ID: s0s3eg3).
You can refer to the corresponding ECS performance parameters based on the storage type of your Lindorm instance:
Performance storage: Refer to the performance parameters of SSDs.
Standard storage: Refer to the performance parameters of ESSDs.
Local disk: Refer to the performance parameters of local disks.
Non-local ECS disks have a limit on the total bandwidth. If the combined read and write traffic exceeds the ECS disk bandwidth limit, throttling may occur and impact business operations. During use, you need to closely monitor disk and network usage to prevent exceeding the network and disk limits of the underlying machine.
Cluster storage details
This module monitors the storage usage of your instance. You need to take note of key metrics, such as Storage (hot storage) water level(%) and Cold storage water level(%). When either of the percentage levels exceeds 95%, the system will automatically block write operations.
We recommend that you set alert thresholds between 75% and 80% and closely take note of alert messages. When the value reaches the alert threshold, scale up your instance at the earliest opportunity to avoid affecting your business.
LindormTable metrics
Cluster load
This module includes the following metrics:
LindormTable compute node memory usage ratio(%): the ratio of heap memory currently used by LindormTable. If the heap usage ratio remains high for a long time, LindormTable may experience out-of-memory (OOM) or full garbage collection (GC), affecting business operations. The heap memory size fluctuates. If you temporarily overuse heap memory, the system will naturally reduce the heap size through methods such as GC. When the heap memory size continuously exceeds a specific threshold, attention is required. In this case, we recommend that you upgrade the specifications of LindormTable nodes to increase memory. When you create an alert rule, we recommend that you specify that the alert condition is met when this ratio is greater than a value between 85% and 90% and lasts for 30 to 60 minutes. For more information about how to upgrade specifications, see Change the engine specification of an instance.
Number of regions of RS (unit): the number of data shards (Regions) on each LindormTable node. LindormTable divides tables into shards by range and distributes shards across machines, with centralized allocation managed by the master node. Each shard occupies metadata memory space, so an excessive number of shards can lead to insufficient machine memory. You need to control the number of shards, for example, by reducing the number of tables or reducing the number of pre-partitions when you create tables.
The following table describes the recommended number of shards per machine for different configurations.
Machine configuration (memory size)
Recommended number of shards per machine
8 GB
Less than 500
16 GB
Less than 1000
32 GB
Less than 2000
64 GB
Less than 3000
128 GB
Less than 5000
The above values are only for reference. In actual use, you can determine whether the instance has insufficient memory based on the formula:
Used memory of the LindormTable compute node/Total memory of the LindormTable compute node.HandlerQueue Length (unit): the queuing situation of requests on the server. If the HandlerQueue length is greater than 0, requests need to queue on the server, indicating that the server resources are insufficient to handle the current requests in a timely manner. We recommend that you upgrade the instance configuration to increase CPU resources.
Compaction Queue Length (units): the queuing situation of compaction tasks on the server. When write operations increase, more compaction operations need to be executed, which may lead to queuing scenarios.
NoteA Compaction Queue Length value greater than 0 cannot indicate that the instance is in an unhealthy state. If the business has distinct periodic write patterns, for example, daytime peaks and nighttime troughs, compaction tasks may accumulate during daytime write peaks, causing the Compaction Queue Length value to exceed 0. However, the system automatically processes these accumulated tasks in the nighttime during which the Compaction Queue Length value will reduce to 0. In this case, the instance is healthy. Additionally, if the compaction queue length remains stable at a specific value for a long time, it indicates that the instance is in a stable state and does not require attention.
If the Compaction Queue Length value continues to rise without a downward trend, it indicates that instance resources are insufficient. You need to add nodes or upgrade configurations to increase CPU resources so that the system can process compaction tasks in a timely manner. While short-term accumulation of compaction tasks will not affect business, long-term accumulation may lead to an excessive number of files within shards, which may increase read RT. If the number of files continues to grow, write backpressure will occur, resulting in increased write RT or even timeouts.
Average number of files in Region: the average number of files in a shard. A higher value indicates increased read RT. File metadata occupies memory space and an excessive number of files may trigger Full GC or OOM.
Maximum number of files in Region: the maximum number of files in a single shard. If this limit is exceeded, write backpressure will occur, resulting in write timeouts. For more information, see Limits on data requests.
Read requests
This module includes the following metrics:
Get operation metrics: Get requests (pieces/second), Get Average RT (ms), and Get P99 RT (ms). A point query call is executed on the Lindorm server by using complete primary key information to obtain relevant monitoring metrics including QPS, average RT, and P99 RT. If you use a BatchGet operation, regardless of how many rows are included in the BatchGet operation, it is considered as one point query call because the BatchGet operation is executed serially on a single server. If you only use the BatchGet operation or use both the BatchGet operation and single-row Get operation, the average RT will be higher than the RT of the single-row Get operation.
Scan operation metrics: Scan requests (pieces/second), Scan Average RT (ms), and Scan P99 RT (ms). These metrics are used to monitor range scan requests. The Lindorm server splits a large-range scan request and returns data in a streaming manner. Scan requests (pieces/second) indicates the number of sub-scan requests sent to the server per second after the large-range scan request is split and Scan Average RT (ms) indicates the average time used by each scan operation. Therefore, the Scan requests (pieces/second) value may be more than the actual number of sub-scan requests initiated. The time used by a complete scan request is the sum of the time used by multiple sub-scan requests.
Read operation metrics: Read Requests (s/sec), Read Average RT (ms), and Read traffic. These metrics are used to monitor both Get requests and scan requests and collect information such as the number of rows returned by the instance per second and the average RT used to return each row of data. Multiple rows may be returned in a single Get request or scan request, so these metrics can more accurately reflect the read throughput of the instance.
Write requests
Write traffic: This metric is used to monitor the throughput (unit: KB/s) of write operations to LindormTable. When LindormTable writes data to the underlying storage, the columns of the wide table are converted into key-value pairs, so the stored columns have a larger data volume than the columns actually written. We recommend that you use this metric to determine the write throughput of LindormTable.
High write throughput may lead to accumulated compaction tasks, which may affect the stability of the instance. We recommend that you appropriately configure CPUs based on your business requirements.
The following table describes the recommended write throughput for different configurations.
CPU configuration
Recommended write throughput
4 cores
Less than 5 MB/s
8 cores
Less than 10 MB/s
16 cores
Less than 30 MB/s
32 cores
Less than 60 MB/s
The above values are only for reference. In actual use, you can further consider them in combination with metrics, such as Compaction Queue Length, Average number of files in Region, and Maximum number of files in Region.
Number of times exceeding the upper limit of Memstore (times): LindormTable first writes data to Memstore of the corresponding shard. When the Memstore usage is too high, the system triggers a Flush operation to write data to disks. If write requests are concentrated on a few shards, the Memstores usage of these shards becomes too high, resulting in write backpressure that impacts throughput. Therefore, when this metric value is greater than 0, you need to consider whether write hotspots exist or whether the write transactions per second (TPS) have exceeded the limit that the instance can handle, which may cause data to not be written to disks in a timely manner. In this case, you can use a Hash algorithm to distribute the primary keys more evenly to avoid hotspots. For more information, see Design primary keys for Lindorm wide tables.