Elastic Compute Service (ECS) provides two types of metrics for monitoring resources such as CPU utilization and disk usage: infrastructure monitoring metrics and operating system monitoring metrics. Infrastructure monitoring metrics are collected by ECS from the host. This agentless method provides an external perspective and does not require you to install a probe. Operating system monitoring metrics are collected by the CloudMonitor agent installed on the ECS instance. This agent-based method provides an internal perspective and collects metrics from within the operating system. This topic describes the collection methods, scenarios, and definitions for these two types of metrics.
Differences between infrastructure monitoring and operating system monitoring
Comparison | Infrastructure monitoring | Operating system monitoring |
Monitoring location | Virtualization stack | Inside the virtual machine's operating system |
Collection frequency | Once per minute | Once per second |
Aggregated output | None | Data is sampled once per second and aggregated into a data point every 15 seconds. Three metrics are generated: minimum (min), average (avg), and maximum (max). |
Installation requirements | No probe required. Ready to use out of the box. | Requires the CloudMonitor agent to be installed. |
Pros |
|
|
Cons |
|
|
Typical scenarios | Infrastructure monitoring for an instance is not affected by the VM's running state. It is suitable for troubleshooting failures such as instance hangs or breakdowns. However, its low sampling frequency makes it unsuitable for scenarios that require capturing rapid performance fluctuations. | Application performance diagnostics, real-time monitoring, and alerting. |
Infrastructure monitoring
ECS collects instance monitoring data from the host. You do not need to install an operating system (OS)-level plugin. This feature is ready to use out of the box.
Collection and reporting
The collection probe on the host gathers one data point per minute for the instance. This data point represents the average usage value for that one-minute interval.
Metrics
Infrastructure monitoring data for an ECS instance is collected at a one-minute granularity. The following table describes the metrics.
If the chart displays data points at a one-minute granularity, the values for Maximum, Minimum, and Average are the same.
Metric Name | Metric descriptions | Unit | MetricName | Dimensions | Statistics |
(ECS) CPU utilization | CPU usage | % | CPUUtilization | userId, instanceId | Maximum, Minimum, Average |
(ECS) Inbound Internet bandwidth (classic network) | Average rate of inbound Internet traffic | bit/s | InternetInRate | userId, instanceId | Maximum, Minimum, Average |
(ECS) Inbound private network bandwidth | Average rate of inbound private network traffic | bit/s | IntranetInRate | userId, instanceId | Maximum, Minimum, Average |
(ECS) Outbound Internet bandwidth (classic network) | Average rate of outbound Internet traffic | bit/s | InternetOutRate | userId, instanceId | Maximum, Minimum, Average |
(ECS) Outbound private network bandwidth | Average rate of outbound private network traffic | bit/s | IntranetOutRate | userId, instanceId | Maximum, Minimum, Average |
(ECS) Read BPS for all disks | Total bytes read from the system disk per second | Byte/s | DiskReadBPS | userId, instanceId | Maximum, Minimum, Average |
(ECS) Write BPS for all disks | Total bytes written to the system disk per second | Byte/s | DiskWriteBPS | userId, instanceId | Maximum, Minimum, Average |
(ECS) Read IOPS for all disks | Read IOPS for all disks | counts/s | DiskReadIOPS | userId, instanceId | Maximum, Minimum, Average |
(ECS) Write IOPS for all disks | Write IOPS for all disks | counts/s | DiskWriteIOPS | userId, instanceId | Average, Minimum, Maximum |
(ECS) Inbound Internet bandwidth by IP address | Inbound Internet bandwidth | bit/s | VPC_PublicIP_InternetInRate | userId, instanceId, ip | Maximum, Minimum, Average |
(ECS) Outbound Internet bandwidth by IP address | Outbound Internet bandwidth | bit/s | VPC_PublicIP_InternetOutRate | userId, instanceId, ip | Maximum, Minimum, Average |
(ECS) Outbound Internet bandwidth utilization by IP address | Outbound Internet bandwidth usage | % | VPC_PublicIP_InternetOutRate_Percent | userId, instanceId, ip | Average |
(ECS) Inbound Internet traffic (classic network) | Inbound Internet traffic | Byte | InternetIn | userId, instanceId | Average, Minimum, Maximum, Sum |
(ECS) Outbound Internet traffic (classic network) | Outbound Internet traffic | Byte | InternetOut | userId, instanceId | Maximum, Minimum, Average |
View infrastructure monitoring data
Log on to the Cloud Monitor console.
In the left-side navigation pane, choose .
On the Host Monitoring page, click the host name or click Monitoring Charts in the Actions column of the host.
Click the Basic Monitoring tab.
On the Basic Monitoring tab, view the infrastructure monitoring data of the target host. You can also set alert rules for metrics and view alerts. For more information, see Create an alert rule for a host and View alerts.
Operating system monitoring
CloudMonitor collects a wide range of OS-level metrics using the CloudMonitor agent installed on Alibaba Cloud hosts (ECS instances) and non-Alibaba Cloud hosts. You can set alert rules for these metrics. When a metric triggers an alert rule, CloudMonitor sends you an alert notification so you can promptly address the issue.
Prerequisites
Make sure that you have installed the CloudMonitor agent on your Alibaba Cloud hosts (ECS instances) and non-Alibaba Cloud hosts.
Collection and reporting
The CloudMonitor host probe samples data once per second. The data is then aggregated into a single data point every 15 seconds before being reported to the server. Each report includes three values for the 15-second interval: min (the minimum value), max (the maximum value), and avg (the average value).
Metrics
Operating system monitoring metrics are collected at a frequency of once every 15 seconds. The metrics are categorized as follows:
CPU-related metrics
Windows
The `NtQuerySystemInformation` function in `ntdll` is called to obtain the time spent by each part of the CPU. By calling this function twice at an interval, you can calculate the percentage of time spent by each part of the CPU during that interval.
Linux
For more information about the metrics in the following table, see the output of the
topcommand.
Metric Name
Description
Unit
MetricName
Dimensions
Statistics
Description (Linux only)
(Agent) cpu.idle
Percentage of idle CPU.
%
cpu_idle
userId, instanceId
Maximum, Minimum, Average
The percentage of time that the CPU is idle.
(Agent) cpu.system
Percentage of CPU time spent in kernel space.
%
cpu_system
userId, instanceId
Maximum, Minimum, Average
Overhead from system context switches. A high value for this metric indicates that too many processes or threads are running on the server.
(Agent) cpu.user
Percentage of CPU time spent in user space.
%
cpu_user
userId, instanceId
Maximum, Minimum, Average
CPU consumption by user processes.
(Agent) cpu.wait
Percentage of CPU time spent waiting for I/O operations.
%
cpu_wait
userId, instanceId
Maximum, Minimum, Average
A high value for this metric indicates frequent I/O operations.
(Agent) cpu.other
Percentage of CPU time spent on other tasks.
%
cpu_other
userId, instanceId
Maximum, Minimum, Average
Other consumption = Nice + SoftIrq + Irq + Stolen.
(Agent) cpu.total
Total percentage of CPU consumed.
%
cpu_total
userId, instanceId
Maximum, Minimum, Average
CPU usage = 1 - Host.cpu.idle
Memory-related metrics
Windows
The `GlobalMemoryStatusEx` function in `kernel32.dll` is called to obtain the current usage of physical and virtual memory for a 32-bit Windows operating system.
Linux
For more information about the metrics in the following table, see the output of the
freecommand. The data source is/proc/meminfo.
Metric
Description
Unit
MetricName
Dimensions
Statistics
Description (Linux only)
(Agent) memory.total.space
Total memory.
Byte
memory_totalspace
userId, instanceId
Maximum, Minimum, Average
The total amount of memory on the server.
This corresponds to MemTotal in /proc/meminfo.
(Agent) memory.free.space
Amount of free memory.
Byte
memory_freespace
userId, instanceId
Maximum, Minimum, Average
The amount of available memory in the system.
This corresponds to MemFree in /proc/meminfo.
(Agent) memory.used.space
Amount of used memory.
Byte
memory_usedspace
userId, instanceId
Maximum, Minimum, Average
The amount of used memory in the system.
Calculation method: total - free.
(Agent) memory.actualused.space
The amount of memory consumed by the user.
Byte
memory_actualusedspace
userId, instanceId
Maximum, Minimum, Average
Calculation method:
If MemAvailable is present in /proc/meminfo: total - MemAvailable
If MemAvailable is not present in /proc/meminfo: used - buffers - cached
NoteOn systems such as CentOS 7.2 and Ubuntu 16.04 or later that use a new Linux kernel, memory estimation is more accurate. For more information about the specific meaning of MemAvailable, see this commit.
(Agent) memory.free.utilization
Percentage of free memory.
%
memory_freeutilization
userId, instanceId
Maximum, Minimum, Average
Calculation method:
If MemAvailable is present in /proc/meminfo: (MemAvailable / total) × 100%.
If MemAvailable is not present in /proc/meminfo: ((total - actualused) / total) × 100%.
(Agent) memory.used.utilization
Memory usage.
%
memory_usedutilization
userId, instanceId
Maximum, Minimum, Average
Calculation method:
If MemAvailable is present in /proc/meminfo: ((total - MemAvailable) / total) × 100%.
If MemAvailable is not present in /proc/meminfo: ((total - free - buffers - cached) / total) × 100%.
System average load metrics
Windows
The monitoring metric does not exist.
Linux
For more information about the metrics in the following table, see the output of the
topcommand. A higher value indicates a busier system.
Metric Name
Description
Unit
MetricName
Dimensions
Statistics
(Agent) load.1m
Average system load over the past 1 minute.
None
load_1m
userId, instanceId
Maximum, Minimum, Average
(Agent) load.5m
Average system load over the past 5 minutes.
None
load_5m
userId, instanceId
Maximum, Minimum, Average
(Agent) load.15m
Average system load over the past 15 minutes.
None
load_15m
userId, instanceId
Maximum, Minimum, Average
(Agent) load.1m.percore
Average system load per CPU core over the past 1 minute.
None
load_per_core_1m
userId, instanceId
Maximum, Minimum, Average
(Agent) load.5m.percore
Average system load per CPU core over the past 5 minutes.
None
load_per_core_5m
userId, instanceId
Maximum, Minimum, Average
(Agent) load.15m.percore
Average system load per CPU core over the past 15 minutes.
None
load_per_core_15m
userId, instanceId
Maximum, Minimum, Average
Disk-related metrics
Windows
First, the `GetDiskFreeSpaceExA` function in `Kernel32.dll` is called to retrieve the available disk space. This provides the used storage space, disk usage, free storage space, and total storage space of the disk. Then, the `RegConnectRegistryA` function is called to connect to the `HKEY_PERFORMANCE_DATA` registry. Finally, the `RegQueryValueExA` function is called to query disk-related properties from the `HKEY_PERFORMANCE_DATA` registry. These properties include read count, write count, bytes written, bytes read, time spent reading, time spent writing, and disk usage time.
Linux
For more information about disk usage and inode usage, see the output of the
dfcommand. For more information about disk reads and writes, see the output of theiostatcommand. This information helps you understand the metrics in the following table.
Metric
Description
Unit
MetricName
Dimensions
Statistics
Host.diskusage.used
Used disk storage space.
Byte
diskusage_used
userId, instanceId, device
Maximum, Minimum, Average
Host.diskusage.utilization
Disk usage for regular users.
%
diskusage_utilization
userId, instanceId, device
Maximum, Minimum, Average
Host.diskusage.free
Free disk storage space for regular users and superusers.
Byte
diskusage_free
userId, instanceId, device
Maximum, Minimum, Average
(Agent) disk.usage.avail_device
Free disk storage space for regular users.
Byte
diskusage_avail
userId, instanceId, device
Maximum, Minimum, Average
Host.diskusage.total
Total disk storage space.
Byte
diskusage_total
userId, instanceId, device
Maximum, Minimum, Average
(Agent) disk.read.bps_device
Bytes read from the disk per second.
Byte/s
disk_readbytes
userId, instanceId, device
Maximum, Minimum, Average
(Agent) disk.write.bps_device
Bytes written to the disk per second.
Byte/s
disk_writebytes
userId, instanceId, device
Maximum, Minimum, Average
(Agent) disk.read.iops_device
Number of read requests to the disk per second.
counts/s
disk_readiops
userId, instanceId, device
Maximum, Minimum, Average
(Agent) disk.write.iops_device
Number of write requests to the disk per second.
counts/s
disk_writeiops
userId, instanceId, device
Maximum, Minimum, Average
File system metrics
Windows
The specified monitoring metric does not exist.
Linux
For more information about the metrics in the following table, see the output of the
dfcommand.
Monitoring Metric Name
Description
Unit
MetricName
Dimensions
Statistics
Description (Linux only)
(Agent) fs.inode.utilization_device
inode usage.
%
fs_inodeutilization
userId, instanceId, device
Maximum, Minimum, Average
Linux systems use inode numbers instead of filenames to identify files. If the disk is not full but all inodes are allocated, you cannot create new files on the disk. Therefore, you need to monitor inode usage. The number of inodes represents the number of files in the file system. Many small files can lead to high inode usage.
Network-related metrics
Windows
First, the `GetAdaptersAddresses` function in `iphlpapi.dll` is called to retrieve the adapter addresses on the local machine. Then, the `GetIfTable` function is called to retrieve network metrics for each interface. These metrics include bits received per second, bits sent per second, packets received per second, packets sent per second, received error packets, and sent error packets.
Linux
For more information about the collection of TCP connection counts, see the output of the
sscommand.NoteThe TCP connection count refers to all connections that use the TCP protocol on the ECS host.
By default, the following TCP connection states are collected: TCP_TOTAL (total connections), ESTABLISHED (connections in the established state), and NON_ESTABLISHED (connections in non-established states, which includes all states other than ESTABLISHED).
For more information about the network-related metrics in the following table, see the output of the
iftopcommand.
Metric Name
Description
Unit
MetricName
Dimensions
Statistics
(Agent) network.in.rate_device
Bits received by the network interface card (NIC) per second, which is the downstream bandwidth of the NIC.
bit/s
networkin_rate
userId, instanceId, device
Maximum, Minimum, Average
(Agent) network.out.rate_device
Bits sent by the NIC per second, which is the upstream bandwidth of the NIC.
bit/s
networkout_rate
userId, instanceId, device
Maximum, Minimum, Average
(Agent) network.in.packages_device
Packets received by the NIC per second.
packets/s
networkin_packages
userId, instanceId, device
Maximum, Minimum, Average
(Agent) network.out.packages_device
Packets sent by the NIC per second.
packets/s
networkout_packages
userId, instanceId, device
Maximum, Minimum, Average
(Agent) network.in.errorpackages_device
Number of received error packets detected by the device drive.
packets/s
networkin_errorpackages
userId, instanceId, device
Maximum, Minimum, Average
(Agent) network.out.errorpackages_device
Number of sent error packets detected by the device drive.
packets/s
networkout_errorpackages
userId, instanceId, device
Maximum, Minimum, Average
(Agent) network.tcp.connection_state
Number of TCP connections in various states, including the following: LISTEN, SYN_SENT, ESTABLISHED, SYN_RECV, FIN_WAIT1, CLOSE_WAIT, FIN_WAIT2, LAST_ACK, TIME_WAIT, CLOSING, and CLOSED.
Count
net_tcpconnection
userId, instanceId, state
Maximum, Minimum, Average
Top 5 process-related metrics
Windows
Query
First, the `OpenProcess` function in `Kernel32.dll` is called to access the process. The `GetProcessTimes` function is called twice at an interval to calculate the CPU usage ratio. Then, the `RegConnectRegistryA` function is called to connect to the `HKEY_PERFORMANCE_DATA` registry. Finally, the `RegQueryValueExA` function is called to query the registry for process properties. These properties include process ID, parent process ID, priority, virtual memory, resident memory, shared memory, process name, number of open files, number of threads, page faults, bytes read, and bytes written.
Process count (Host.process.number)
The `OpenProcess` function is called to open the target process. The `NtQueryInformationProcess` function in `NTDLL` is called to retrieve `RTL_USER_PROCESS_PARAMETERS` information. The `ReadProcessMemory` function is called to retrieve the process command line. This action obtains the process arguments (args) and its root running path, which is the current working directory.
The `OpenProcessToken` function is called to retrieve the access token handle. The `GetTokenInformation` function is called to retrieve the token information. The `LookupAccountSid` function is called to obtain the process username and user group.
For each process, its arguments (args), root running path, username, and user group are matched against a keyword. If a match is found, a counter is incremented by 1.
Linux
For more information about process CPU and memory usage, see the output of the
topcommand. The CPU usage reflects multi-core usage.For more information about Host.process.openfile, see the output of the
lsofcommand.For more information about Host.process.number, see the output of the
ps aux | grep '<keyword>'command.
Metric
Description
Unit
MetricName
Dimensions
Statistics
Notes
(Agent) process.cpu_pid
Percentage of CPU consumed by a specific process.
%
process.cpu
userId, instanceId, name, pid
Average
Alerting is not supported.
(Agent) process.memory_pid
Percentage of memory consumed by a specific process.
%
process.memory
userId, instanceId, name, pid
Average
Alerting is not supported.
(Agent) process.openfile_pid
Number of files opened by the current process.
Unit
process.openfile
userId, instanceId, name, pid
Average
Alerting is not supported.
(Agent) process.count_processname
Number of processes with the specified keyword.
Unit
process.number
userId, instanceId, processName
Average
Alerting is not supported.
View operating system monitoring data
Log on to the Cloud Monitor console.
In the left-side navigation pane, choose .
On the Host Monitoring page, click the host name or click Monitoring Charts in the Actions column of the host.
On the OS Monitoring tab, view the operating system monitoring data of the target host. You can also set alert rules for metrics and view alerts. For more information, see Create an alert rule for a host and View alerts.