Keeping a Lindorm instance healthy means tracking the right metrics before problems surface — not after. This guide explains what each metric measures, what values indicate trouble, and what to do when a threshold is breached.
Find the right metric for your situation
Most operators arrive here with a symptom, not a metric name. Use this table to jump to the relevant section.
| If you see... | Check |
|---|---|
| Rising response times (RT) on reads or writes | CPU and load, Cluster load |
| Write timeouts or write rejections | Write requests, Cluster load |
| Disk or network I/O throttling | Network and disk |
| Storage full or writes blocked | Cluster storage |
| Out-of-memory or Full GC events | Cluster load |
| Slow scans or high scan RT | Read requests |
System metrics
CPU and load
What to monitor: CPU utilization, CPU idle rate, CPU WIO utilization, and average load.
CPU utilization breaks down into CPU utilization User(%) (user-space processes) and CPU utilization System(%) (kernel-space processes).
Alert threshold
Use CPU idle rate(%) as your primary alert metric rather than CPU utilization. The impact of high CPU utilization varies by workload:
Online (latency-sensitive) workloads may degrade when CPU utilization exceeds 40%.
Offline batch workloads may tolerate 100% CPU utilization without issue.
Set the alert threshold based on the point at which your specific workload starts experiencing latency, rather than a fixed number.
When CPU resources are insufficient, scale up or upgrade the instance. See also Modify the configurations of an instance.
Diagnosing CPU vs. disk bottlenecks
Two metrics help distinguish CPU pressure from disk pressure:
CPU WIO usage(%): The percentage of time the CPU spends waiting for I/O. A high value indicates a disk read/write bottleneck.
Average load per minute(load1): Reflects combined CPU and disk usage.
Read these together. An acceptable load value is roughly equal to the number of CPU cores — for an 8-core machine, a load above 8 means tasks are queuing and the machine is in a suboptimal state. If CPU utilization is low but load is high, disk I/O is the bottleneck.
When CPU load or WIO utilization is too high, scale up or upgrade the instance.
Network and disk
What to monitor: Network traffic, disk read/write throughput, and IOPS.
Keep all values below the throttling thresholds of the underlying Elastic Compute Service (ECS) instances and cloud disks. Non-local ECS disks have a combined read/write bandwidth limit — exceeding it triggers throttling that affects business operations.
Throttling thresholds
ECS network bandwidth limits vary by instance type. See Overview of instance families. For disk limits, see Block storage performance.
Match your Lindorm storage type to the correct ECS performance parameters:
| Lindorm storage type | ECS performance parameters to reference |
|---|---|
| Performance storage | SSD |
| Standard storage | ESSD |
| Local disk | Local disk |
For questions about network and disk limits, contact Lindorm technical support (DingTalk ID: s0s3eg3).
Cluster storage
What to monitor: Storage (hot storage) water level(%) and Cold storage water level(%).
| Threshold | Level | Action |
|---|---|---|
| 75%–80% | Alert threshold | Scale up the instance promptly |
| 95% | Critical threshold | System automatically blocks all write operations |
Set alerts at 75%–80% to give enough lead time before the system reaches the 95% write-block threshold. When storage reaches the alert level, scale up immediately to avoid a write outage.
LindormTable metrics
Cluster load
Memory usage
Metric: LindormTable compute node memory usage ratio(%)
This is the ratio of heap memory currently in use by LindormTable. The heap size fluctuates naturally — short spikes are handled by garbage collection (GC). Sustained high usage is the concern.
Alert rule: Fire when this ratio exceeds 85%–90% for 30–60 consecutive minutes.
When heap usage is consistently high, upgrade the LindormTable node specifications to increase memory. See Change the engine specification of an instance.
Shard count
Metric: Number of regions of RS (unit)
LindormTable divides tables into data shards (Regions) and distributes them across nodes. Each shard consumes metadata memory, so an excessive number of shards causes memory pressure.
Reference limits by node memory:
| Node memory | Max recommended shards |
|---|---|
| 8 GB | < 500 |
| 16 GB | < 1,000 |
| 32 GB | < 2,000 |
| 64 GB | < 3,000 |
| 128 GB | < 5,000 |
These are starting points. For a more accurate assessment, check the actual memory ratio: Used memory / Total memory of the LindormTable compute node. If the shard count is too high, reduce it by lowering the number of tables or reducing pre-partitions when creating tables.
Request queuing
Metric: HandlerQueue Length (unit)
This metric shows how many requests are queued waiting for a server thread. Any value above 0 means the server can't process requests fast enough.
A persistently non-zero HandlerQueue indicates insufficient CPU resources. Upgrade the instance configuration to add CPU capacity.
Compaction queuing
Metric: Compaction Queue Length (units)
Compaction consolidates data files within shards. Write-heavy workloads naturally generate more compaction work, which can queue up.
Not all queuing is a problem. Workloads with predictable peaks (busy during the day, quiet at night) may accumulate compaction tasks during peak hours and drain the queue overnight — this is healthy behavior. Similarly, a queue that stays stable at a fixed value indicates a steady state and does not require attention.
When to act: If the queue grows continuously without any downward trend, the instance lacks the resources to keep up. Add nodes or upgrade the CPU configuration.
Short-term compaction backlogs don't affect reads or writes. Long-term backlogs do — more data files per shard may increase read response time (RT). Eventually, a shard can accumulate enough files to trigger write backpressure, resulting in increased write RT or even write timeouts.
Files per shard
Two related metrics track shard file accumulation:
Average number of files in Region: Higher values increase read RT. Excessive files also increase memory pressure and risk triggering Full GC or OOM.
Maximum number of files in Region: When this exceeds the limit, the instance applies write backpressure, causing write timeouts. See Limits on data requests.
Monitor these alongside Compaction Queue Length — high file counts usually follow sustained compaction backlogs.
Read requests
LindormTable exposes read metrics at three levels: operation type (Get, Scan) and aggregate (Read).
Get operations
| Metric | What it measures |
|---|---|
| Get requests (pieces/second) | Point query throughput (QPS) |
| Get Average RT (ms) | Average response time for Get operations |
| Get P99 RT (ms) | 99th percentile response time for Get operations |
A point query uses a complete primary key to retrieve a single row. BatchGet executes serially on a single server — regardless of how many rows it fetches, it counts as one point query call. This means Get Average RT reflects BatchGet duration, which is higher than single-row Get RT when BatchGet operations are common.
Scan operations
| Metric | What it measures |
|---|---|
| Scan requests (pieces/second) | Sub-scan throughput after server-side splitting |
| Scan Average RT (ms) | Average response time per sub-scan |
| Scan P99 RT (ms) | 99th percentile response time per sub-scan |
Lindorm splits large range scans into sub-scans and returns results in a streaming manner. Scan requests (pieces/second) counts sub-scans per second — not the number of original client requests. The total time for a complete scan is the sum of all its sub-scan durations.
Aggregate read metrics
| Metric | What it measures |
|---|---|
| Read Requests (s/sec) | Total read throughput (rows per second, covering both Get and Scan) |
| Read Average RT (ms) | Average time to return one row of data |
| Read traffic | Total read data volume |
These metrics cover both Get and Scan operations and measure throughput at the row level. Because a single Get or Scan may return multiple rows, these metrics reflect actual read throughput more accurately than operation count alone.
Write requests
Write throughput
Metric: Write traffic (unit: KB/s)
This is the actual data volume written to LindormTable's underlying storage. Because wide table columns are converted to key-value pairs during storage, the stored data volume is larger than the data actually written by the client.
High write throughput increases compaction pressure. Use these guidelines as a starting point, then adjust based on Compaction Queue Length, Average number of files in Region, and Maximum number of files in Region:
| CPU cores | Max recommended write throughput |
|---|---|
| 4 | < 5 MB/s |
| 8 | < 10 MB/s |
| 16 | < 30 MB/s |
| 32 | < 60 MB/s |
If write throughput consistently exceeds these limits, add CPU resources to keep compaction current.
MemStore pressure
Metric: Number of times exceeding the upper limit of Memstore (times)
LindormTable writes data to a MemStore (in-memory buffer) first, then flushes it to disk. When write traffic concentrates on a small number of shards, those shards' MemStores fill up faster than they can flush, causing write backpressure that reduces throughput.
Any value above 0 warrants investigation:
Write hotspot: Writes are concentrated on a few primary key ranges. Redesign the primary key using a hash algorithm to distribute writes more evenly. See Design primary keys for Lindorm wide tables.
TPS exceeding instance capacity: The total write rate has exceeded what the instance can handle. Scale up or add nodes.