Monitor Lindorm CPU & Disk Usage with Alerts to Prevent Bottlenecks - Lindorm

Keeping a Lindorm instance healthy means tracking the right metrics before problems surface — not after. This guide explains what each metric measures, what values indicate trouble, and what to do when a threshold is breached.

Find the right metric for your situation

Most operators arrive here with a symptom, not a metric name. Use this table to jump to the relevant section.

If you see...	Check
Rising response times (RT) on reads or writes	CPU and load, Cluster load
Write timeouts or write rejections	Write requests, Cluster load
Disk or network I/O throttling	Network and disk
Storage full or writes blocked	Cluster storage
Out-of-memory or Full GC events	Cluster load
Slow scans or high scan RT	Read requests

System metrics

CPU and load

What to monitor: CPU utilization, CPU idle rate, CPU WIO utilization, and average load.

CPU utilization breaks down into CPU utilization User(%) (user-space processes) and CPU utilization System(%) (kernel-space processes).

Alert threshold

Use CPU idle rate(%) as your primary alert metric rather than CPU utilization. The impact of high CPU utilization varies by workload:

Online (latency-sensitive) workloads may degrade when CPU utilization exceeds 40%.
Offline batch workloads may tolerate 100% CPU utilization without issue.

Set the alert threshold based on the point at which your specific workload starts experiencing latency, rather than a fixed number.

When CPU resources are insufficient, scale up or upgrade the instance. See also Modify the configurations of an instance.

Diagnosing CPU vs. disk bottlenecks

Two metrics help distinguish CPU pressure from disk pressure:

CPU WIO usage(%): The percentage of time the CPU spends waiting for I/O. A high value indicates a disk read/write bottleneck.
Average load per minute(load1): Reflects combined CPU and disk usage.

Read these together. An acceptable load value is roughly equal to the number of CPU cores — for an 8-core machine, a load above 8 means tasks are queuing and the machine is in a suboptimal state. If CPU utilization is low but load is high, disk I/O is the bottleneck.

When CPU load or WIO utilization is too high, scale up or upgrade the instance.

Network and disk

What to monitor: Network traffic, disk read/write throughput, and IOPS.

Keep all values below the throttling thresholds of the underlying Elastic Compute Service (ECS) instances and cloud disks. Non-local ECS disks have a combined read/write bandwidth limit — exceeding it triggers throttling that affects business operations.

Throttling thresholds

ECS network bandwidth limits vary by instance type. See Overview of instance families. For disk limits, see Block storage performance.

Match your Lindorm storage type to the correct ECS performance parameters:

Lindorm storage type	ECS performance parameters to reference
Performance storage	SSD
Standard storage	ESSD
Local disk	Local disk

For questions about network and disk limits, contact Lindorm technical support (DingTalk ID: s0s3eg3).

Cluster storage

What to monitor: Storage (hot storage) water level(%) and Cold storage water level(%).

Threshold	Level	Action
75%–80%	Alert threshold	Scale up the instance promptly
95%	Critical threshold	System automatically blocks all write operations

Set alerts at 75%–80% to give enough lead time before the system reaches the 95% write-block threshold. When storage reaches the alert level, scale up immediately to avoid a write outage.

LindormTable metrics

Cluster load

Memory usage

Metric: LindormTable compute node memory usage ratio(%)

This is the ratio of heap memory currently in use by LindormTable. The heap size fluctuates naturally — short spikes are handled by garbage collection (GC). Sustained high usage is the concern.

Alert rule: Fire when this ratio exceeds 85%–90% for 30–60 consecutive minutes.

When heap usage is consistently high, upgrade the LindormTable node specifications to increase memory. See Change the engine specification of an instance.

Shard count

Metric: Number of regions of RS (unit)

LindormTable divides tables into data shards (Regions) and distributes them across nodes. Each shard consumes metadata memory, so an excessive number of shards causes memory pressure.

Reference limits by node memory:

Node memory	Max recommended shards
8 GB	< 500
16 GB	< 1,000
32 GB	< 2,000
64 GB	< 3,000
128 GB	< 5,000

These are starting points. For a more accurate assessment, check the actual memory ratio: Used memory / Total memory of the LindormTable compute node. If the shard count is too high, reduce it by lowering the number of tables or reducing pre-partitions when creating tables.

Request queuing

Metric: HandlerQueue Length (unit)

This metric shows how many requests are queued waiting for a server thread. Any value above 0 means the server can't process requests fast enough.

A persistently non-zero HandlerQueue indicates insufficient CPU resources. Upgrade the instance configuration to add CPU capacity.

Compaction queuing

Metric: Compaction Queue Length (units)

Compaction consolidates data files within shards. Write-heavy workloads naturally generate more compaction work, which can queue up.

Not all queuing is a problem. Workloads with predictable peaks (busy during the day, quiet at night) may accumulate compaction tasks during peak hours and drain the queue overnight — this is healthy behavior. Similarly, a queue that stays stable at a fixed value indicates a steady state and does not require attention.

When to act: If the queue grows continuously without any downward trend, the instance lacks the resources to keep up. Add nodes or upgrade the CPU configuration.

Short-term compaction backlogs don't affect reads or writes. Long-term backlogs do — more data files per shard may increase read response time (RT). Eventually, a shard can accumulate enough files to trigger write backpressure, resulting in increased write RT or even write timeouts.

Files per shard

Two related metrics track shard file accumulation:

Average number of files in Region: Higher values increase read RT. Excessive files also increase memory pressure and risk triggering Full GC or OOM.
Maximum number of files in Region: When this exceeds the limit, the instance applies write backpressure, causing write timeouts. See Limits on data requests.

Monitor these alongside Compaction Queue Length — high file counts usually follow sustained compaction backlogs.

Read requests

LindormTable exposes read metrics at three levels: operation type (Get, Scan) and aggregate (Read).

Get operations

Metric	What it measures
Get requests (pieces/second)	Point query throughput (QPS)
Get Average RT (ms)	Average response time for Get operations
Get P99 RT (ms)	99th percentile response time for Get operations

A point query uses a complete primary key to retrieve a single row. BatchGet executes serially on a single server — regardless of how many rows it fetches, it counts as one point query call. This means Get Average RT reflects BatchGet duration, which is higher than single-row Get RT when BatchGet operations are common.

Scan operations

Metric	What it measures
Scan requests (pieces/second)	Sub-scan throughput after server-side splitting
Scan Average RT (ms)	Average response time per sub-scan
Scan P99 RT (ms)	99th percentile response time per sub-scan

Lindorm splits large range scans into sub-scans and returns results in a streaming manner. Scan requests (pieces/second) counts sub-scans per second — not the number of original client requests. The total time for a complete scan is the sum of all its sub-scan durations.

Aggregate read metrics

Metric	What it measures
Read Requests (s/sec)	Total read throughput (rows per second, covering both Get and Scan)
Read Average RT (ms)	Average time to return one row of data
Read traffic	Total read data volume

These metrics cover both Get and Scan operations and measure throughput at the row level. Because a single Get or Scan may return multiple rows, these metrics reflect actual read throughput more accurately than operation count alone.

Write requests

Write throughput

Metric: Write traffic (unit: KB/s)

This is the actual data volume written to LindormTable's underlying storage. Because wide table columns are converted to key-value pairs during storage, the stored data volume is larger than the data actually written by the client.

High write throughput increases compaction pressure. Use these guidelines as a starting point, then adjust based on Compaction Queue Length, Average number of files in Region, and Maximum number of files in Region:

CPU cores	Max recommended write throughput
4	< 5 MB/s
8	< 10 MB/s
16	< 30 MB/s
32	< 60 MB/s

If write throughput consistently exceeds these limits, add CPU resources to keep compaction current.

MemStore pressure

Metric: Number of times exceeding the upper limit of Memstore (times)

LindormTable writes data to a MemStore (in-memory buffer) first, then flushes it to disk. When write traffic concentrates on a small number of shards, those shards' MemStores fill up faster than they can flush, causing write backpressure that reduces throughput.

Any value above 0 warrants investigation:

Write hotspot: Writes are concentrated on a few primary key ranges. Redesign the primary key using a hash algorithm to distribute writes more evenly. See Design primary keys for Lindorm wide tables.
TPS exceeding instance capacity: The total write rate has exceeded what the instance can handle. Scale up or add nodes.

Lindorm:Monitoring and alerting