View the monitoring information about an EMR Serverless StarRocks instance - E-MapReduce

EMR Serverless StarRocks provides monitoring and alerting features that allow you to view the status and key performance metrics of EMR Serverless StarRocks instances in real time. This helps you identify issues efficiently.

Limitations

Only monitoring data from the previous 30 days is available.

Precautions

Some metrics are related to the root account, such as the Query metric. The root account is a dedicated account used to manage StarRocks instances. Users cannot view or use the root account.

Procedure

Go to the EMR Serverless StarRocks instance list page.
1. Log on to the E-MapReduce console.
2. In the navigation pane on the left, choose EMR Serverless > StarRocks.
3. In the top menu bar, select the required region.
Click the ID of the instance.
Click the Monitoring and Alerting tab.
On the Monitoring and Alerting tab, configure the Resource Group and Select Time parameters to view specific metrics.
Valid values of the Resource Group parameter:
- default_wg: the default resource group used by query tasks.
- default_mv_wg: the default resource group used by materialized views.

Metrics

Instance

Overview

Metric	Description
FE Availability	The availability of frontend nodes (FEs).
BE/CN Availability	The availability of backend nodes (BEs) or compute nodes (CNs).
FE Count	The number of FEs.
BE or CN Count	The number of BEs or CNs.
Disk Usage (Avg)	The average disk usage of all BEs in the StarRocks instance.
Storage	The actual storage space used by StarRocks. This metric is available only for compute-storage separation scenarios. The value of the metric is updated with a delay of about one hour.
Compaction Score (Max)	The highest Compaction Score of each FE. This parameter is available only for StarRocks shared-nothing instances.
FE Detection	The detection status of FEs. EMR Serverless StarRocks detects the status of FEs by sending HTTP requests. The value On indicates that the detection result is normal, and the value Off indicates that the detection fails.
BE/CN Node Status	The status of BE/CN nodes reported by FE. If the number of Alive nodes is abnormal, you can use the SHOW COMPUTE NODES command to view node details.

Query

Metric	Description
Queries per minute	The number of query tasks per minute.
Number of query faults per minute	The number of query errors per minute.
Query latency p99	The query latency.
Slow Query	The number of slow queries per minute.

Metric	Description
FE transaction resolution statistics	The statistics on the transaction status of each FE or all FEs per minute.
FE Disk Usage	The data disk used by each FE or all FEs. The metric value is updated every hour.

FE CPU
Metric
Description
CPU Util
The CPU utilization of each FE.
FE CPU Load 1min
The average CPU load of each FE in the previous minute.

FE Mem

Metric	Description
JVM Heap Usage	The ratio of used memory to maximum memory in the JVM heap.
JVM Young GC	The number of times and the time when garbage collection is performed in the young generation space.
JVM Heap	The usage of JVM heap memory.
JVM Old GC	The number of times and the time when garbage collection is performed in the old generation space of a Java virtual machine (JVM).

FE Net
Metric
Description
Network Receive Rate
The amount of data that is received per second.
Net Out
The amount of data that is sent per second.
FE Connections
The number of active connections to each FE.

Resource Group

Metric	Description
Query	The number of query tasks that run on the selected resource group per minute.
Query Latency p99	The query latency.
Query (Resource Group)	The number of query tasks that run on all the resource groups per minute.

Materialized View

Metric	Description
MV Status	The status of materialized views. Valid values: 0 and 1. The value 0 indicates that the materialized view is active, and the value 1 indicates that the materialized view is inactive.
MV Refresh Duration p99	The amount of time required to refresh materialized views.
MV Jobs (Total)	The total number of refresh tasks.
MV Jobs (Successful)	The number of successful refresh tasks.
Purge job failed	The number of failed refresh tasks.
Purge Job Empty	The number of refresh tasks that are canceled because no new data is available.
MV Jobs (Running)	The number of refresh tasks that are in progress.
Purge job pending	The number of refresh tasks that wait to run.
MV Hit Count	The number of queries that are rewritten on each materialized view, excluding the queries that are directly run on materialized views.
MV Query Count	The number of queries that are rewritten on each materialized view, including the queries that are directly run on materialized views.

Tables

Metric	Description
DataBase Tables	The distribution of tables across databases in the instance.
Table Count	The number of tables in the instance.
Tablet Count	The number of tablets in the instance.
Table Scan Bytes	The total amount of data scanned from non-system tables. Unit: bytes.
Table Load Bytes	The total amount of data imported to non-system tables. Unit: bytes.

Others
Metric
Description
Transfer Progress
The progress of table migration. This metric is applicable only to cluster migration scenarios.

Compute group

Overview

Metric	Description
CPU Util (Avg)	The average CPU utilization of all BEs or CNs.
Mem Util (Avg)	The average memory usage of all BEs or CNs.
Disk Usage (Max)	The maximum usage of multiple data disks of all BEs or CNs.
BE/CN Node Status	The detection status of BEs or CNs. EMR Serverless StarRocks detects the status of BEs or CNs by sending HTTP requests. The value On indicates that the detection result is normal, and the value Off indicates that the detection fails.

Compaction

Metric	Description
Maximum Compaction Score	The highest compaction score of the FEs.
Mem (Compaction)	The memory used by compaction tasks.
Compaction Bytes	The amount of data that is compacted per minute during the base compaction and cumulative compaction process.
Compaction Rowsets	The number of rowsets that are compacted per minute during the base compaction and cumulative compaction process.

BE/CN

Metric	Description
Query Scan Bytes	The amount of data scanned during the queries on each BE.
Query Scan Rows	The number of rows scanned during the queries on each BE.
Request Statistics	The total number of requests on specific nodes, such as the requests to create tables, publish versions, and clone tables.
Engine Requests (Failed)	The number of failed requests on BEs, such as the requests to create tables, publish versions, and clone tables.
Transaction Requests	The statistics of transaction phases per minute.

BE/CN CPU
Metric
Description
CPU Util
The CPU utilization.
BE/CN CPU Load 1min
The average CPU load of specific nodes in the previous minute.

BE/CN Mem

Metric	Description
Memory utilization	Node memory utilization includes BE/CN process memory, memory used by UDFs, reserved memory for BE/CN, etc.
Process Mem (BE/CN)	Memory usage of the BE/CN process.
Process memory	The process memory depends on the memory items collected by the kernel. Memory items that are not fully collected and fall outside the collection scope are labeled as "Other". For more memory information, see Memory_management.
Node Mem	Divided into three components: pod available memory (Pod Avail Mem), process memory (Process Mem), and non-process memory (Non Process Mem).
Node mem (BE/CN)	BE/CN node memory includes: total node memory, 81% node memory threshold, node memory usage, and process memory usage. The upper limit of BE/CN available memory is jointly restricted by the 0.9 coefficient in the StarRocks code and the mem_limit configuration parameter (default: 0.9). By default, the actual available memory for BE/CN is 81% of total node memory.

BE/CN Disk

Metric	Description
Disk usage	The ratio of used disk space to total capacity, including Data, Trash, etc.
Used disk space	The absolute capacity of used disk space.
Disk Usage (Data)	The disk space occupied by data files on specific nodes.
Disk Usage (Data)	The disk usage of data files on specific nodes.

BE/CN Disk IO

Metric	Description
Read Traffic (SUM)	The read traffic of all disks per second on specific nodes.
Disk IO (Write)	The write traffic of all disks per second on specific nodes.
Disk IOPS (Read)	The number of read operations on all disks per second on specific nodes.
Disk IOPS (Write)	The number of write operations on all disks per second on specific nodes.
Disk IO Latency (Read)	The average read latency of all disks.
Disk IO Latency (Write)	The average write latency of all disks.
IO Util (Max)	The percentage of time that an I/O device, such as a disk or a network interface, is busy over a period of time.

BE/CN Net
Metric
Description
Net (In)
The amount of data that is received per second.
Net (Out)
The amount of data that is sent per second.
TCP connection count
The number of TCP connections.
Cache
Note
The metrics described in the following table are available only for compute-storage separation scenarios.
Metric
Description
FSLIB Cache Hit Ratio
The cache hit ratio per minute.
FSLIB Cache Hit/Miss
The number of cache hits per minute.
Storage
Note
The metrics described in the following table are available only for StarRocks shared-data instances.
Metric
Description
Storage
The amount of fully managed data. Unit: GiB.
Storage IO
The read and write traffic of fully managed data.

Resource Group

Metric	Description
Resource Group Use CPU Cores	The number of CPU cores used by a specific resource group. The value is an estimated average value within two consecutive sampling periods. This metric is available for StarRocks instances of V3.1.4 and later.
Resource Group CPU Usage (v2.x)	The ratio of the CPU time consumed by a specific resource group to the total CPU time.
Resource Group Mem Usage	The memory used by a specific resource group.
Running tasks	The number of query tasks that are running on a specific resource group.
Resource Group Concurrency Overflow	The number of queries that reach the concurrency limit in a specific resource group.
Number of times the large query limit is triggered	The number of times that the large query limit is reached in a specific resource group.

Others

Metric	Description
Page Cache Hit Rate	The number of requests that hit the page cache.
Publish Version Latency P99	The amount of time that is consumed to publish a version when data is written to StarRocks.

Storage

Data Storage

Metric	Description
Storage	The amount of fully managed data. Unit: GiB. This metric is available only for StarRocks shared-data instances. The value of the metric is updated with a delay of about one hour.
Storage IO	The read and write traffic of fully managed data. This metric is available only for StarRocks shared-data instances.

Disk Usage

Compute-storage separation
Metric
Description
Disk usage
The disk usage.
Used disk space
The amount of disk space used.

In-memory computing

Metric	Description
Free space percentage	The percentage of the available space of specific nodes.
Disk Usage (Avail)	The available disk space of specific nodes.
Disk Usage (Data)	The disk space occupied by data files on specific nodes.
Disk Usage (Data)	The disk usage of data files on specific nodes.
Disk Usage (Sum)	The usage of the available, cache, and data files on the disk.
Disk Usage (Sum)

Disk IO

Metric	Description
Disk IO (Read)	The read traffic of all disks per second on specific nodes.
Disk IO (Write)	The write traffic of all disks per second on specific nodes.
Disk IOPS (Read)	The number of read operations on all disks per second on specific nodes.
Disk IOPS (Write)	The number of write operations on all disks per second on specific nodes.
Disk IO Latency (Read)	The average read latency of all disks.
Disk IO Latency (Write)	The average write latency of all disks.
IO Util (Max)	The percentage of time that an I/O device, such as a disk or a network interface, is busy over a period of time.

Metric	Description
CPU Util	The CPU utilization of each FE.
FE CPU Load 1min	The average CPU load of each FE in the previous minute.

Metric	Description
Network Receive Rate	The amount of data that is received per second.
Net Out	The amount of data that is sent per second.
FE Connections	The number of active connections to each FE.

Metric	Description
Transfer Progress	The progress of table migration. This metric is applicable only to cluster migration scenarios.

Metric	Description
Net (In)	The amount of data that is received per second.
Net (Out)	The amount of data that is sent per second.
TCP connection count	The number of TCP connections.

Metric	Description
FSLIB Cache Hit Ratio	The cache hit ratio per minute.
FSLIB Cache Hit/Miss	The number of cache hits per minute.