Enable and use ack-sysom-monitor - Container Service for Kubernetes

System Observer Monitoring (SysOM) is an OS kernel-level container monitoring method. Container Service for Kubernetes (ACK) allows you to monitor containers at the OS kernel level based on SysOM. This capability can help you better deploy and migrate containerized applications and monitor containers. This topic describes how to enable and use ack-sysom-monitor. This topic also describes the SysOM metrics for container monitoring.

Prerequisites

An ACK managed cluster is created or an ACK Serverless cluster is created after October 2021, and the Kubernetes version of the cluster is 1.18.8 or later. For more information about how to create a cluster, see Create an ACK managed cluster and Create an ACK Serverless cluster. For more information about how to update a cluster, see Manually update ACK clusters.
Managed Service for Prometheus is enabled. For more information, see Enable Managed Service for Prometheus.

Introduction to ack-sysom-monitor

ack-sysom-monitor is a SysOM component that uses the extened Berkeley Packet Filter (eBPF) technology to collect node and container metrics and enhance metrics at the kernel level. In addition to system metrics, ack-sysom-monitor also provides enhanced metrics and supports pod kernel-level monitoring and node kernel-level monitoring to help you identify common issues, including system jitters, delays, resource leaks, and pod memory exceptions.

Billing of ack-sysom-monitor

After the ack-sysom-monitor component is enabled, related components automatically send monitoring metrics to Managed Service for Prometheus. These metrics are considered as custom metrics. Fees are charged for custom metrics.

Before you enable this feature, we recommend that you read Billing overview to understand the billing rules of custom metrics. The fees may vary based on the cluster size and number of applications. You can follow the steps in View resource usage to monitor and manage resource usage.

Enable ack-sysom-monitor

Log on to the ARMS console.
In the left-side navigation pane, click Integration Center.
In the Infrastructure section of the Integration Center page, find and click SysOM System Observation.
In the Start Integration step of the SysOM System Observation panel, select the ACK cluster that you want to integrate and click OK.

Use ack-sysom-monitor

Procedure

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Operations > Prometheus Monitoring.
On the Prometheus Monitoring page, click the SysOM tab to view the metrics provided by ack-sysom-monitor.
ack-sysom-monitor supports node kernel-level monitoring and pod kernel-level monitoring.
- Node kernel-level monitoring
  On the SysOM - Nodes tab, you can view the CPU, memory, scheduling, storage, and network metrics of a node.
- Pod kernel-level monitoring
  On the SysOM - Pods tab, you can view the memory, CPU, network, and I/O metrics of a pod in real time.

What to do next

If you want to disable kernel-level container monitoring based on SysOM, you can uninstall the ack-sysom-monitor component. This avoids incurring additional fees. For more information, see Manage components.

Metrics

The metrics provided by ack-sysom-monitor are defined based on the data model used by Prometheus.

Node metrics

Node metrics include CPU, memory, storage, network, and other metrics.

Metrics related to CPUs and scheduling

Metric	Type	Unit	Description
sysom_proc_cpu_total	gauge	%	Displays information about the CPU uptime of a node. This metric indicates the ratio of the CPU uptime in a state to the total CPU uptime. The following states are supported: user mode, kernel mode, softirq, hardirq, idle, and iowait.
sysom_proc_cpus	gauge	%	Displays information about the uptime of a CPU on a node. This metric indicates the ratio of the uptime of a CPU in a state to the total uptime of the CPU. The following states are supported: user mode, kernel mode, softirq, hardirq, idle, and iowait.
sysom_proc_sirq	gauge	%	Displays information about softirq of a node. This metric indicates the number of times that each type of softirq occurs. Supported softirq types include HI, TIMER, NET_TX, NET_RX, BLOCK, IRQ_POLL, TASKLET, SCHED, HRTIMER, and RCU softirqs.
sysom_proc_stat_counters	gauge	-	Displays whether the node runs an excessive number of processes in the D state and information about the system loads. This metric indicates the number of processes in the Running or D state. In addition, it also indicates the system startup time and the number of times of context switching.
sysom_proc_loadavg	gauge	-	Displays the load average of a node. This metric indicates the load average, including the runq length, load average within the previous 1 minute, load average within the previous 5 minutes, load average within the previous 15 minutes, and total number of system processes.
sysom_proc_schedstat	gauge	ns (nanoseconds)	Displays information about the scheduling latency of a node. This metric displays statistics related to CPU scheduling, including the waiting time of the processes in the queue of the current CPU and the length of the timeslice that runs in the current CPU.
sysom_cpu_dist	gauge	-	Displays the overall scheduling information of a node. This metric indicates the interval between the time when the process releases the CPU to the next time when the process is scheduled to the CPU. The metric also counts the number of times that a process falls into each of the following intervals: 1us, 10us, 100us, 1ms, 10ms, 100ms, and 1s.

Metrics related to memory

Metric	Type	Unit	Description
sysom_proc_meminfo	gauge	KiB	Displays the usage of different types of memory resources on a node. This metric indicates the memory usage, including but not limited to the total memory (Total), free memory (Free), available memory (Available), caches (Cache), buffers (Buffers), reclaimable memory (SReclaimable), and Unreclaimable memory (SUnreclaim).
sysom_proc_vmstat	gauge	-	Displays the memory usage and memory events of a node in details. This metric indicates the memory statistics of different pages and memory events. The memory information and memory events include free pages (Free Pages), dirty pages (Dirty Pages), page reads and writes (Pages Read/Write), number of pages reclaimed from the Inactive list (Pages Reclaimed from Inactive List), and number of times that the Out-of-Memory (OOM) killer kills applications.
sysom_proc_buddyinfo	gauge	-	Displays information about how the buddy system allocates and releases kernel memory. This metric indicates the detailed information about the kernel buddy system, including all memory nodes and zones and the number of blocks in different sizes in linked lists.

Metrics related to storage

Metric

Type

Unit

Description

sysom_proc_disks

gauge

Displays information about the input, output, IOPS, and latency of each disk on a node.

This metric indicates disk and partition statistics, including the number of read and write requests completed by a partition, total amount of time used to complete the read and write requests, number of times that read and write requests are merged, and number of inflight read and write requests.

sysom_fs_stat

gauge

Displays the usage of file systems mounted to a node.

This metric indicates the usage of a file system, including the mount target of the file system, block size, number of used blocks and number of available blocks, and number of used inodes and number of available inodes.

Metrics related to networks

Metric	Type	Unit	Description
sysom_proc_networks	gauge	-	Displays information about the data transfer of the network interface cards (NICs) on a node. This metric indicates the data transfer information of an NIC, including the total number of data packets received or sent by the NIC, total number of bytes, total number of data packets discarded by the device driver, and total number of data packets that fail to be sent or received.
sysom_proc_pkt_status	gauge	-	Displays information about data packets processed by the network protocol stack of a node. This metric indicates the number of events that occur when data packets pass through the network protocol stack, including the number of times of packet loss, the number of overflows, and the number of invalid assertions.
sysom_sock_stat	gauge	-	This metric can help identify the insufficient socket or buffer issue caused by the application logic or system parameters. The metric displays statistics about the usage of sockets and buffers, including the usage of total, raw, TCP, and UDP sockets, the number of sockets in the TCP time wait or orphan state, and the memory usage of TCP and UDP sockets.
sysom_softnets	gauge	-	Displays information about data packets received by the NIC softirqs of each CPU on a node. This metric indicates statistics about the NIC softirqs of a CPU, including the number of packets received or sent by a softirq and the number of times that the net_rx_action function is called to handle packet reception softirqs.
sysom_net_health_hist	gauge	-	Displays the trend of the round-trip time (RTT) of all TCP connections on a node. This metric indicates the trend of the RTT of all TCP connections on a node. It counts the number of connections that correspond to each average RTT value, such as 10 milliseconds, 100 milliseconds, and 1 second.
sysom_net_health_count	gauge	-	This metric is similar to the `sysom_net_health_hlist` metric. This metric indicates the average RTT of TCP connections.
sysom_net_retrans_count	gauge	-	Displays retransmission information about all TCP connections on a node. This metric indicates the types of data packets that are retransmitted through TCP connections and the number of retransmitted data packets of each type (such as SYN, SYN-ACK, and RESET packets), including the number of packets retransmitted due to retransmission timeouts.
sysom_net_tcp_count	gauge	-	Displays basic information about the TCP connections on a node. This metric indicates statistics about TCP connections, including the number of active TCP connections, number of TCP segments received or sent, number of TCP segments retransmitted, and number of packets that fail to be received.
sysom_net_udp_count	gauge	-	Displays basic information about the UDP connections on a node. This metric indicates statistics about UDP connections, including the number of UDP packets received or sent, the number of times that the UDP send or receive buffer encounters errors, and the number of data packets that encounter errors because no ports are available.
sysom_net_ip_count	gauge	-	Displays basic information about the IP layer of a node. This metric indicates statistics about the IP layer, including the number of data packets that are forwarded, received, or sent.
sysom_net_icmp_count	gauge	-	Displays basic information about the ICMP protocol of a node. This metric indicates statistics about the ICMP protocol, including the number of data packets that are received or sent by ICMP and the number of data packets that fail to be received or sent.

Other system metrics

Metric

Type

Unit

Description

sysom_cgroups

gauge

Displays the number of cgroups used by different cgroup subsystems to help you identify cgroup leaks.

This metric indicates the number of cgroups in different cgroup subsystems, including the CPU, Cpuacct, Memory, Pids, Blkio, and Devices subsystems.

sysom_uptime

gauge

s (seconds)

Displays system loads.

This metric indicates the uptime of the system from the time when the system starts up to the current time. This metric also indicates the idle time of the system.

Metrics related to containers

Container metrics include CPU, memory, IO, network, and other metrics.

Metrics related to CPUs and scheduling

Metric	Type	Unit	Description
sysom_container_cpu_stat	gauge	-	Helps you monitor and assess whether resource quotas need to be adjusted or other optimizations are required. This metric indicates statistics about CPU limits for containers, including the number of times that CPU limits are enforced in each cgroup, total number of times that CPU limits are enforced, and duration of CPU limit enforcement.
sysom_container_cpu_acctstat	gauge	%	Displays the CPU usage information of containers. This metric indicates the CPU utilization of tasks in a container that runs in each mode, including the CPU utilization in user mode, CPU utilization in kernel mode, and total CPU utilization.
sysom_container_cpu_cfsquota	gauge	-	Displays the period of time during which a container is limited by the Completely Fair Scheduler (CFS). This metric indicates the amount of time that a container can run within each CFS time window, including the cfs_period_us and cfs_quota_us parameters. The cfs_period_us parameter indicates the length of the time window. The cfs_quota_us parameter indicates the total amount of time for which processes or tasks in the cgroup can use CPU resources within the time window (cfs_period_us).

Metrics related to memory

Metric	Type	Unit	Description
sysom_container_memory_stat	gauge	KiB	Displays the usage of different types of memory resources in containers. This metric indicates statistics about the memory usage of containers, including the total memory (Total), free memory (Free), available memory (Available), caches (Cache), buffers (Buffers), reclaimable memory (SReclaimable), and Unreclaimable memory (SUnreclaim).
sysom_container_memory_filecache	gauge	KiB	This metric helps you quickly learn the usage of page caches in containers and identify issues such as insufficient memory, memory latency, and memory jitters caused by overuse of page caches. The metric indicates the usage of page caches in containers, including the top 10 files that occupy the most page caches in each container, the size of each file, and the total size of page caches that are occupied.
sysom_container_memory_gdrcm_latency	gauge	Times	Displays the number of delays caused by memory reclamation due to insufficient memory resources and the duration of the delays. This metric indicates the number of delays caused by memory reclamation due to insufficient memory resources and the duration of the delays, including the number of delays that range from 1 milliseconds to 5 milliseconds, number of delays that range from 5 milliseconds to 10 milliseconds, number of delays that range from 10 milliseconds to 100 milliseconds, number of delays that range from 100 milliseconds to 500 milliseconds, number of delays that range from 500 milliseconds to 1,000 milliseconds, and number of delays that exceed 1,000 milliseconds.
sysom_container_memory_cdrcm_latency	gauge	Times	Displays the number of delays caused by memory reclamation due to insufficient memory cgroups and the duration of the delays. Note This metric is valid only if the current memory cgroups are non-root cgroups or memory limits are configured for the current memory cgroups. This metric indicates the number of delays caused by memory reclamation due to insufficient memory cgroups and the duration of the delays, including the number of delays that range from 1 milliseconds to 5 milliseconds, number of delays that range from 5 milliseconds to 10 milliseconds, number of delays that range from 10 milliseconds to 100 milliseconds, number of delays that range from 100 milliseconds to 500 milliseconds, number of delays that range from 500 milliseconds to 1,000 milliseconds, and number of delays that exceed 1,000 milliseconds.
sysom_container_memory_cpt_latency	gauge	Times	Displays the number of delays caused by kernel memory adjustment. When a process in a container applies for memory resources, memory adjustment is triggered if the node has insufficient memory or an excessive number of memory fragments exists. This metric indicates the number of delays caused by kernel memory adjustment and the duration of the delays, including the number of delays that range from 1 milliseconds to 5 milliseconds, number of delays that range from 5 milliseconds to 10 milliseconds, number of delays that range from 10 milliseconds to 100 milliseconds, number of delays that range from 100 milliseconds to 500 milliseconds, number of delays that range from 500 milliseconds to 1,000 milliseconds, and number of delays that exceed 1,000 milliseconds.

Metrics related to IO

Metric

Type

Unit

Description

sysom_container_blkio_stat

gauge

Displays basic IO information about containers.

This metric indicates the IO statistics of a disk used by a container, including the number and bytes of read or write requests to the disk, the number and bytes of read or write requests that are submitted to the queue, and the waiting time of the read or write requests.

Metrics related to networks

Metric

Type

Unit

Description

sysom_container_network_stat

gauge

Displays basic data transfer information about containers.

This metric indicates the data transfer statistics of a virtual NIC, including the number of data packets or bytes received or sent by the virtual NIC and the number of data packets that are discarded by the virtual NIC device. Data packets that are discarded by the network protocol stack are not taken into account.