Cloud Monitor infrastructure monitoring and operating system monitoring - Cloud Monitor

Elastic Compute Service (ECS) provides two types of metrics for monitoring resources such as CPU utilization and disk usage: infrastructure monitoring metrics and operating system monitoring metrics. Infrastructure monitoring metrics are collected by ECS from the host. This agentless method provides an external perspective and does not require you to install a probe. Operating system monitoring metrics are collected by the CloudMonitor agent installed on the ECS instance. This agent-based method provides an internal perspective and collects metrics from within the operating system. This topic describes the collection methods, scenarios, and definitions for these two types of metrics.

Differences between infrastructure monitoring and operating system monitoring

Comparison	Infrastructure monitoring	Operating system monitoring
Monitoring location	Virtualization stack	Inside the virtual machine's operating system
Collection frequency	Once per minute	Once per second
Aggregated output	None	Data is sampled once per second and aggregated into a data point every 15 seconds. Three metrics are generated: minimum (min), average (avg), and maximum (max).
Installation requirements	No probe required. Ready to use out of the box.	Requires the CloudMonitor agent to be installed.
Pros	No extra resource overhead. Wide applicability. Unaffected by high workloads running on the instance.	Higher data precision. Can be associated with processes to diagnose issues such as "Steal Time".
Cons	Low precision. Cannot detect burst CPU fluctuations. Cannot be associated with the overhead of specific processes.	Requires installation and maintenance. Incurs resource overhead. Data may be lost if the virtual machine (VM) hangs or experiences startup or shutdown issues.
Typical scenarios	Infrastructure monitoring for an instance is not affected by the VM's running state. It is suitable for troubleshooting failures such as instance hangs or breakdowns. However, its low sampling frequency makes it unsuitable for scenarios that require capturing rapid performance fluctuations.	Application performance diagnostics, real-time monitoring, and alerting.

Infrastructure monitoring

ECS collects instance monitoring data from the host. You do not need to install an operating system (OS)-level plugin. This feature is ready to use out of the box.

Collection and reporting

The collection probe on the host gathers one data point per minute for the instance. This data point represents the average usage value for that one-minute interval.

Metrics

Infrastructure monitoring data for an ECS instance is collected at a one-minute granularity. The following table describes the metrics.

Note

If the chart displays data points at a one-minute granularity, the values for Maximum, Minimum, and Average are the same.

Metric Name	Metric descriptions	Unit	MetricName	Dimensions	Statistics
(ECS) CPU utilization	CPU usage	%	CPUUtilization	userId, instanceId	Maximum, Minimum, Average
(ECS) Inbound Internet bandwidth (classic network)	Average rate of inbound Internet traffic	bit/s	InternetInRate	userId, instanceId	Maximum, Minimum, Average
(ECS) Inbound private network bandwidth	Average rate of inbound private network traffic	bit/s	IntranetInRate	userId, instanceId	Maximum, Minimum, Average
(ECS) Outbound Internet bandwidth (classic network)	Average rate of outbound Internet traffic	bit/s	InternetOutRate	userId, instanceId	Maximum, Minimum, Average
(ECS) Outbound private network bandwidth	Average rate of outbound private network traffic	bit/s	IntranetOutRate	userId, instanceId	Maximum, Minimum, Average
(ECS) Read BPS for all disks	Total bytes read from the system disk per second	Byte/s	DiskReadBPS	userId, instanceId	Maximum, Minimum, Average
(ECS) Write BPS for all disks	Total bytes written to the system disk per second	Byte/s	DiskWriteBPS	userId, instanceId	Maximum, Minimum, Average
(ECS) Read IOPS for all disks	Read IOPS for all disks	counts/s	DiskReadIOPS	userId, instanceId	Maximum, Minimum, Average
(ECS) Write IOPS for all disks	Write IOPS for all disks	counts/s	DiskWriteIOPS	userId, instanceId	Average, Minimum, Maximum
(ECS) Inbound Internet bandwidth by IP address	Inbound Internet bandwidth	bit/s	VPC_PublicIP_InternetInRate	userId, instanceId, ip	Maximum, Minimum, Average
(ECS) Outbound Internet bandwidth by IP address	Outbound Internet bandwidth	bit/s	VPC_PublicIP_InternetOutRate	userId, instanceId, ip	Maximum, Minimum, Average
(ECS) Outbound Internet bandwidth utilization by IP address	Outbound Internet bandwidth usage	%	VPC_PublicIP_InternetOutRate_Percent	userId, instanceId, ip	Average
(ECS) Inbound Internet traffic (classic network)	Inbound Internet traffic	Byte	InternetIn	userId, instanceId	Average, Minimum, Maximum, Sum
(ECS) Outbound Internet traffic (classic network)	Outbound Internet traffic	Byte	InternetOut	userId, instanceId	Maximum, Minimum, Average

View infrastructure monitoring data

Log on to the Cloud Monitor console.
In the left-side navigation pane, choose Cloud Resource Monitoring > Host Monitoring.
On the Host Monitoring page, click the host name or click Monitoring Charts in the Actions column of the host.
Click the Basic Monitoring tab.
On the Basic Monitoring tab, view the infrastructure monitoring data of the target host. You can also set alert rules for metrics and view alerts. For more information, see Create an alert rule for a host and View alerts.

Operating system monitoring

CloudMonitor collects a wide range of OS-level metrics using the CloudMonitor agent installed on Alibaba Cloud hosts (ECS instances) and non-Alibaba Cloud hosts. You can set alert rules for these metrics. When a metric triggers an alert rule, CloudMonitor sends you an alert notification so you can promptly address the issue.

Prerequisites

Make sure that you have installed the CloudMonitor agent on your Alibaba Cloud hosts (ECS instances) and non-Alibaba Cloud hosts.

Collection and reporting

The CloudMonitor host probe samples data once per second. The data is then aggregated into a single data point every 15 seconds before being reported to the server. Each report includes three values for the 15-second interval: min (the minimum value), max (the maximum value), and avg (the average value).

Metrics

Operating system monitoring metrics are collected at a frequency of once every 15 seconds. The metrics are categorized as follows:

CPU-related metrics

Windows
The `NtQuerySystemInformation` function in `ntdll` is called to obtain the time spent by each part of the CPU. By calling this function twice at an interval, you can calculate the percentage of time spent by each part of the CPU during that interval.
Linux
For more information about the metrics in the following table, see the output of the top command.

Metric Name	Description	Unit	MetricName	Dimensions	Statistics	Description (Linux only)
(Agent) cpu.idle	Percentage of idle CPU.	%	cpu_idle	userId, instanceId	Maximum, Minimum, Average	The percentage of time that the CPU is idle.
(Agent) cpu.system	Percentage of CPU time spent in kernel space.	%	cpu_system	userId, instanceId	Maximum, Minimum, Average	Overhead from system context switches. A high value for this metric indicates that too many processes or threads are running on the server.
(Agent) cpu.user	Percentage of CPU time spent in user space.	%	cpu_user	userId, instanceId	Maximum, Minimum, Average	CPU consumption by user processes.
(Agent) cpu.wait	Percentage of CPU time spent waiting for I/O operations.	%	cpu_wait	userId, instanceId	Maximum, Minimum, Average	A high value for this metric indicates frequent I/O operations.
(Agent) cpu.other	Percentage of CPU time spent on other tasks.	%	cpu_other	userId, instanceId	Maximum, Minimum, Average	Other consumption = Nice + SoftIrq + Irq + Stolen.
(Agent) cpu.total	Total percentage of CPU consumed.	%	cpu_total	userId, instanceId	Maximum, Minimum, Average	CPU usage = 1 - Host.cpu.idle

Memory-related metrics

Windows
The `GlobalMemoryStatusEx` function in `kernel32.dll` is called to obtain the current usage of physical and virtual memory for a 32-bit Windows operating system.
Linux
For more information about the metrics in the following table, see the output of the free command. The data source is /proc/meminfo.

Metric	Description	Unit	MetricName	Dimensions	Statistics	Description (Linux only)
(Agent) memory.total.space	Total memory.	Byte	memory_totalspace	userId, instanceId	Maximum, Minimum, Average	The total amount of memory on the server. This corresponds to MemTotal in /proc/meminfo.
(Agent) memory.free.space	Amount of free memory.	Byte	memory_freespace	userId, instanceId	Maximum, Minimum, Average	The amount of available memory in the system. This corresponds to MemFree in /proc/meminfo.
(Agent) memory.used.space	Amount of used memory.	Byte	memory_usedspace	userId, instanceId	Maximum, Minimum, Average	The amount of used memory in the system. Calculation method: total - free.
(Agent) memory.actualused.space	The amount of memory consumed by the user.	Byte	memory_actualusedspace	userId, instanceId	Maximum, Minimum, Average	Calculation method: If MemAvailable is present in /proc/meminfo: total - MemAvailable If MemAvailable is not present in /proc/meminfo: used - buffers - cached Note On systems such as CentOS 7.2 and Ubuntu 16.04 or later that use a new Linux kernel, memory estimation is more accurate. For more information about the specific meaning of MemAvailable, see this commit.
(Agent) memory.free.utilization	Percentage of free memory.	%	memory_freeutilization	userId, instanceId	Maximum, Minimum, Average	Calculation method: If MemAvailable is present in /proc/meminfo: (MemAvailable / total) × 100%. If MemAvailable is not present in /proc/meminfo: ((total - actualused) / total) × 100%.
(Agent) memory.used.utilization	Memory usage.	%	memory_usedutilization	userId, instanceId	Maximum, Minimum, Average	Calculation method: If MemAvailable is present in /proc/meminfo: ((total - MemAvailable) / total) × 100%. If MemAvailable is not present in /proc/meminfo: ((total - free - buffers - cached) / total) × 100%.

System average load metrics

Windows
The monitoring metric does not exist.
Linux
For more information about the metrics in the following table, see the output of the top command. A higher value indicates a busier system.

Metric Name	Description	Unit	MetricName	Dimensions	Statistics
(Agent) load.1m	Average system load over the past 1 minute.	None	load_1m	userId, instanceId	Maximum, Minimum, Average
(Agent) load.5m	Average system load over the past 5 minutes.	None	load_5m	userId, instanceId	Maximum, Minimum, Average
(Agent) load.15m	Average system load over the past 15 minutes.	None	load_15m	userId, instanceId	Maximum, Minimum, Average
(Agent) load.1m.percore	Average system load per CPU core over the past 1 minute.	None	load_per_core_1m	userId, instanceId	Maximum, Minimum, Average
(Agent) load.5m.percore	Average system load per CPU core over the past 5 minutes.	None	load_per_core_5m	userId, instanceId	Maximum, Minimum, Average
(Agent) load.15m.percore	Average system load per CPU core over the past 15 minutes.	None	load_per_core_15m	userId, instanceId	Maximum, Minimum, Average

Disk-related metrics

Windows
First, the `GetDiskFreeSpaceExA` function in `Kernel32.dll` is called to retrieve the available disk space. This provides the used storage space, disk usage, free storage space, and total storage space of the disk. Then, the `RegConnectRegistryA` function is called to connect to the `HKEY_PERFORMANCE_DATA` registry. Finally, the `RegQueryValueExA` function is called to query disk-related properties from the `HKEY_PERFORMANCE_DATA` registry. These properties include read count, write count, bytes written, bytes read, time spent reading, time spent writing, and disk usage time.
Linux
For more information about disk usage and inode usage, see the output of the df command. For more information about disk reads and writes, see the output of the iostat command. This information helps you understand the metrics in the following table.

Metric	Description	Unit	MetricName	Dimensions	Statistics
Host.diskusage.used	Used disk storage space.	Byte	diskusage_used	userId, instanceId, device	Maximum, Minimum, Average
Host.diskusage.utilization	Disk usage for regular users.	%	diskusage_utilization	userId, instanceId, device	Maximum, Minimum, Average
Host.diskusage.free	Free disk storage space for regular users and superusers.	Byte	diskusage_free	userId, instanceId, device	Maximum, Minimum, Average
(Agent) disk.usage.avail_device	Free disk storage space for regular users.	Byte	diskusage_avail	userId, instanceId, device	Maximum, Minimum, Average
Host.diskusage.total	Total disk storage space.	Byte	diskusage_total	userId, instanceId, device	Maximum, Minimum, Average
(Agent) disk.read.bps_device	Bytes read from the disk per second.	Byte/s	disk_readbytes	userId, instanceId, device	Maximum, Minimum, Average
(Agent) disk.write.bps_device	Bytes written to the disk per second.	Byte/s	disk_writebytes	userId, instanceId, device	Maximum, Minimum, Average
(Agent) disk.read.iops_device	Number of read requests to the disk per second.	counts/s	disk_readiops	userId, instanceId, device	Maximum, Minimum, Average
(Agent) disk.write.iops_device	Number of write requests to the disk per second.	counts/s	disk_writeiops	userId, instanceId, device	Maximum, Minimum, Average

File system metrics

Windows
The specified monitoring metric does not exist.
Linux
For more information about the metrics in the following table, see the output of the df command.

Monitoring Metric Name	Description	Unit	MetricName	Dimensions	Statistics	Description (Linux only)
(Agent) fs.inode.utilization_device	inode usage.	%	fs_inodeutilization	userId, instanceId, device	Maximum, Minimum, Average	Linux systems use inode numbers instead of filenames to identify files. If the disk is not full but all inodes are allocated, you cannot create new files on the disk. Therefore, you need to monitor inode usage. The number of inodes represents the number of files in the file system. Many small files can lead to high inode usage.

Network-related metrics

Windows
First, the `GetAdaptersAddresses` function in `iphlpapi.dll` is called to retrieve the adapter addresses on the local machine. Then, the `GetIfTable` function is called to retrieve network metrics for each interface. These metrics include bits received per second, bits sent per second, packets received per second, packets sent per second, received error packets, and sent error packets.
Linux
- For more information about the collection of TCP connection counts, see the output of the ss command.
  Note
  The TCP connection count refers to all connections that use the TCP protocol on the ECS host.
  By default, the following TCP connection states are collected: TCP_TOTAL (total connections), ESTABLISHED (connections in the established state), and NON_ESTABLISHED (connections in non-established states, which includes all states other than ESTABLISHED).
- For more information about the network-related metrics in the following table, see the output of the iftop command.

Metric Name	Description	Unit	MetricName	Dimensions	Statistics
(Agent) network.in.rate_device	Bits received by the network interface card (NIC) per second, which is the downstream bandwidth of the NIC.	bit/s	networkin_rate	userId, instanceId, device	Maximum, Minimum, Average
(Agent) network.out.rate_device	Bits sent by the NIC per second, which is the upstream bandwidth of the NIC.	bit/s	networkout_rate	userId, instanceId, device	Maximum, Minimum, Average
(Agent) network.in.packages_device	Packets received by the NIC per second.	packets/s	networkin_packages	userId, instanceId, device	Maximum, Minimum, Average
(Agent) network.out.packages_device	Packets sent by the NIC per second.	packets/s	networkout_packages	userId, instanceId, device	Maximum, Minimum, Average
(Agent) network.in.errorpackages_device	Number of received error packets detected by the device drive.	packets/s	networkin_errorpackages	userId, instanceId, device	Maximum, Minimum, Average
(Agent) network.out.errorpackages_device	Number of sent error packets detected by the device drive.	packets/s	networkout_errorpackages	userId, instanceId, device	Maximum, Minimum, Average
(Agent) network.tcp.connection_state	Number of TCP connections in various states, including the following: LISTEN, SYN_SENT, ESTABLISHED, SYN_RECV, FIN_WAIT1, CLOSE_WAIT, FIN_WAIT2, LAST_ACK, TIME_WAIT, CLOSING, and CLOSED.	Count	net_tcpconnection	userId, instanceId, state	Maximum, Minimum, Average

Top 5 process-related metrics

Windows
- Query
  First, the `OpenProcess` function in `Kernel32.dll` is called to access the process. The `GetProcessTimes` function is called twice at an interval to calculate the CPU usage ratio. Then, the `RegConnectRegistryA` function is called to connect to the `HKEY_PERFORMANCE_DATA` registry. Finally, the `RegQueryValueExA` function is called to query the registry for process properties. These properties include process ID, parent process ID, priority, virtual memory, resident memory, shared memory, process name, number of open files, number of threads, page faults, bytes read, and bytes written.
- Process count (Host.process.number)
  - The `OpenProcess` function is called to open the target process. The `NtQueryInformationProcess` function in `NTDLL` is called to retrieve `RTL_USER_PROCESS_PARAMETERS` information. The `ReadProcessMemory` function is called to retrieve the process command line. This action obtains the process arguments (args) and its root running path, which is the current working directory.
  - The `OpenProcessToken` function is called to retrieve the access token handle. The `GetTokenInformation` function is called to retrieve the token information. The `LookupAccountSid` function is called to obtain the process username and user group.
  - For each process, its arguments (args), root running path, username, and user group are matched against a keyword. If a match is found, a counter is incremented by 1.
Linux
- For more information about process CPU and memory usage, see the output of the top command. The CPU usage reflects multi-core usage.
- For more information about Host.process.openfile, see the output of the lsof command.
- For more information about Host.process.number, see the output of the ps aux | grep '<keyword>' command.

Metric	Description	Unit	MetricName	Dimensions	Statistics	Notes
(Agent) process.cpu_pid	Percentage of CPU consumed by a specific process.	%	process.cpu	userId, instanceId, name, pid	Average	Alerting is not supported.
(Agent) process.memory_pid	Percentage of memory consumed by a specific process.	%	process.memory	userId, instanceId, name, pid	Average	Alerting is not supported.
(Agent) process.openfile_pid	Number of files opened by the current process.	Unit	process.openfile	userId, instanceId, name, pid	Average	Alerting is not supported.
(Agent) process.count_processname	Number of processes with the specified keyword.	Unit	process.number	userId, instanceId, processName	Average	Alerting is not supported.

View operating system monitoring data

Log on to the Cloud Monitor console.
In the left-side navigation pane, choose Cloud Resource Monitoring > Host Monitoring.
On the Host Monitoring page, click the host name or click Monitoring Charts in the Actions column of the host.
On the OS Monitoring tab, view the operating system monitoring data of the target host. You can also set alert rules for metrics and view alerts. For more information, see Create an alert rule for a host and View alerts.

Cloud Monitor:Infrastructure monitoring and operating system monitoring

Differences between infrastructure monitoring and operating system monitoring

Infrastructure monitoring

Collection and reporting

Metrics

View infrastructure monitoring data

Operating system monitoring

Prerequisites

Collection and reporting

Metrics

View operating system monitoring data

References