All Products
Search
Document Center

Cloud Monitor:Infrastructure monitoring and operating system monitoring

Last Updated:Dec 05, 2025

Elastic Compute Service (ECS) provides two types of metrics for monitoring resources such as CPU utilization and disk usage: infrastructure monitoring metrics and operating system monitoring metrics. Infrastructure monitoring metrics are collected by ECS from the host. This agentless method provides an external perspective and does not require you to install a probe. Operating system monitoring metrics are collected by the CloudMonitor agent installed on the ECS instance. This agent-based method provides an internal perspective and collects metrics from within the operating system. This topic describes the collection methods, scenarios, and definitions for these two types of metrics.

Differences between infrastructure monitoring and operating system monitoring

Comparison

Infrastructure monitoring

Operating system monitoring

Monitoring location

Virtualization stack

Inside the virtual machine's operating system

Collection frequency

Once per minute

Once per second

Aggregated output

None

Data is sampled once per second and aggregated into a data point every 15 seconds. Three metrics are generated: minimum (min), average (avg), and maximum (max).

Installation requirements

No probe required. Ready to use out of the box.

Requires the CloudMonitor agent to be installed.

Pros

  • No extra resource overhead.

  • Wide applicability. Unaffected by high workloads running on the instance.

  • Higher data precision.

  • Can be associated with processes to diagnose issues such as "Steal Time".

Cons

  • Low precision. Cannot detect burst CPU fluctuations.

  • Cannot be associated with the overhead of specific processes.

  • Requires installation and maintenance. Incurs resource overhead.

  • Data may be lost if the virtual machine (VM) hangs or experiences startup or shutdown issues.

Typical scenarios

Infrastructure monitoring for an instance is not affected by the VM's running state. It is suitable for troubleshooting failures such as instance hangs or breakdowns. However, its low sampling frequency makes it unsuitable for scenarios that require capturing rapid performance fluctuations.

Application performance diagnostics, real-time monitoring, and alerting.

Infrastructure monitoring

ECS collects instance monitoring data from the host. You do not need to install an operating system (OS)-level plugin. This feature is ready to use out of the box.

Collection and reporting

The collection probe on the host gathers one data point per minute for the instance. This data point represents the average usage value for that one-minute interval.

Metrics

Infrastructure monitoring data for an ECS instance is collected at a one-minute granularity. The following table describes the metrics.

Note

If the chart displays data points at a one-minute granularity, the values for Maximum, Minimum, and Average are the same.

Metric Name

Metric descriptions

Unit

MetricName

Dimensions

Statistics

(ECS) CPU utilization

CPU usage

%

CPUUtilization

userId, instanceId

Maximum, Minimum, Average

(ECS) Inbound Internet bandwidth (classic network)

Average rate of inbound Internet traffic

bit/s

InternetInRate

userId, instanceId

Maximum, Minimum, Average

(ECS) Inbound private network bandwidth

Average rate of inbound private network traffic

bit/s

IntranetInRate

userId, instanceId

Maximum, Minimum, Average

(ECS) Outbound Internet bandwidth (classic network)

Average rate of outbound Internet traffic

bit/s

InternetOutRate

userId, instanceId

Maximum, Minimum, Average

(ECS) Outbound private network bandwidth

Average rate of outbound private network traffic

bit/s

IntranetOutRate

userId, instanceId

Maximum, Minimum, Average

(ECS) Read BPS for all disks

Total bytes read from the system disk per second

Byte/s

DiskReadBPS

userId, instanceId

Maximum, Minimum, Average

(ECS) Write BPS for all disks

Total bytes written to the system disk per second

Byte/s

DiskWriteBPS

userId, instanceId

Maximum, Minimum, Average

(ECS) Read IOPS for all disks

Read IOPS for all disks

counts/s

DiskReadIOPS

userId, instanceId

Maximum, Minimum, Average

(ECS) Write IOPS for all disks

Write IOPS for all disks

counts/s

DiskWriteIOPS

userId, instanceId

Average, Minimum, Maximum

(ECS) Inbound Internet bandwidth by IP address

Inbound Internet bandwidth

bit/s

VPC_PublicIP_InternetInRate

userId, instanceId, ip

Maximum, Minimum, Average

(ECS) Outbound Internet bandwidth by IP address

Outbound Internet bandwidth

bit/s

VPC_PublicIP_InternetOutRate

userId, instanceId, ip

Maximum, Minimum, Average

(ECS) Outbound Internet bandwidth utilization by IP address

Outbound Internet bandwidth usage

%

VPC_PublicIP_InternetOutRate_Percent

userId, instanceId, ip

Average

(ECS) Inbound Internet traffic (classic network)

Inbound Internet traffic

Byte

InternetIn

userId, instanceId

Average, Minimum, Maximum, Sum

(ECS) Outbound Internet traffic (classic network)

Outbound Internet traffic

Byte

InternetOut

userId, instanceId

Maximum, Minimum, Average

View infrastructure monitoring data

  1. Log on to the Cloud Monitor console.

  2. In the left-side navigation pane, choose Cloud Resource Monitoring > Host Monitoring.

  3. On the Host Monitoring page, click the host name or click Monitoring Charts in the Actions column of the host.

  4. Click the Basic Monitoring tab.

    On the Basic Monitoring tab, view the infrastructure monitoring data of the target host. You can also set alert rules for metrics and view alerts. For more information, see Create an alert rule for a host and View alerts.

Operating system monitoring

CloudMonitor collects a wide range of OS-level metrics using the CloudMonitor agent installed on Alibaba Cloud hosts (ECS instances) and non-Alibaba Cloud hosts. You can set alert rules for these metrics. When a metric triggers an alert rule, CloudMonitor sends you an alert notification so you can promptly address the issue.

Prerequisites

Make sure that you have installed the CloudMonitor agent on your Alibaba Cloud hosts (ECS instances) and non-Alibaba Cloud hosts.

Collection and reporting

The CloudMonitor host probe samples data once per second. The data is then aggregated into a single data point every 15 seconds before being reported to the server. Each report includes three values for the 15-second interval: min (the minimum value), max (the maximum value), and avg (the average value).

Metrics

Operating system monitoring metrics are collected at a frequency of once every 15 seconds. The metrics are categorized as follows:

  • CPU-related metrics

    • Windows

      The `NtQuerySystemInformation` function in `ntdll` is called to obtain the time spent by each part of the CPU. By calling this function twice at an interval, you can calculate the percentage of time spent by each part of the CPU during that interval.

    • Linux

      For more information about the metrics in the following table, see the output of the top command.

    Metric Name

    Description

    Unit

    MetricName

    Dimensions

    Statistics

    Description (Linux only)

    (Agent) cpu.idle

    Percentage of idle CPU.

    %

    cpu_idle

    userId, instanceId

    Maximum, Minimum, Average

    The percentage of time that the CPU is idle.

    (Agent) cpu.system

    Percentage of CPU time spent in kernel space.

    %

    cpu_system

    userId, instanceId

    Maximum, Minimum, Average

    Overhead from system context switches. A high value for this metric indicates that too many processes or threads are running on the server.

    (Agent) cpu.user

    Percentage of CPU time spent in user space.

    %

    cpu_user

    userId, instanceId

    Maximum, Minimum, Average

    CPU consumption by user processes.

    (Agent) cpu.wait

    Percentage of CPU time spent waiting for I/O operations.

    %

    cpu_wait

    userId, instanceId

    Maximum, Minimum, Average

    A high value for this metric indicates frequent I/O operations.

    (Agent) cpu.other

    Percentage of CPU time spent on other tasks.

    %

    cpu_other

    userId, instanceId

    Maximum, Minimum, Average

    Other consumption = Nice + SoftIrq + Irq + Stolen.

    (Agent) cpu.total

    Total percentage of CPU consumed.

    %

    cpu_total

    userId, instanceId

    Maximum, Minimum, Average

    CPU usage = 1 - Host.cpu.idle

  • Memory-related metrics

    • Windows

      The `GlobalMemoryStatusEx` function in `kernel32.dll` is called to obtain the current usage of physical and virtual memory for a 32-bit Windows operating system.

    • Linux

      For more information about the metrics in the following table, see the output of the free command. The data source is /proc/meminfo.

    Metric

    Description

    Unit

    MetricName

    Dimensions

    Statistics

    Description (Linux only)

    (Agent) memory.total.space

    Total memory.

    Byte

    memory_totalspace

    userId, instanceId

    Maximum, Minimum, Average

    The total amount of memory on the server.

    This corresponds to MemTotal in /proc/meminfo.

    (Agent) memory.free.space

    Amount of free memory.

    Byte

    memory_freespace

    userId, instanceId

    Maximum, Minimum, Average

    The amount of available memory in the system.

    This corresponds to MemFree in /proc/meminfo.

    (Agent) memory.used.space

    Amount of used memory.

    Byte

    memory_usedspace

    userId, instanceId

    Maximum, Minimum, Average

    The amount of used memory in the system.

    Calculation method: total - free.

    (Agent) memory.actualused.space

    The amount of memory consumed by the user.

    Byte

    memory_actualusedspace

    userId, instanceId

    Maximum, Minimum, Average

    Calculation method:

    • If MemAvailable is present in /proc/meminfo: total - MemAvailable

    • If MemAvailable is not present in /proc/meminfo: used - buffers - cached

    Note

    On systems such as CentOS 7.2 and Ubuntu 16.04 or later that use a new Linux kernel, memory estimation is more accurate. For more information about the specific meaning of MemAvailable, see this commit.

    (Agent) memory.free.utilization

    Percentage of free memory.

    %

    memory_freeutilization

    userId, instanceId

    Maximum, Minimum, Average

    Calculation method:

    • If MemAvailable is present in /proc/meminfo: (MemAvailable / total) × 100%.

    • If MemAvailable is not present in /proc/meminfo: ((total - actualused) / total) × 100%.

    (Agent) memory.used.utilization

    Memory usage.

    %

    memory_usedutilization

    userId, instanceId

    Maximum, Minimum, Average

    Calculation method:

    • If MemAvailable is present in /proc/meminfo: ((total - MemAvailable) / total) × 100%.

    • If MemAvailable is not present in /proc/meminfo: ((total - free - buffers - cached) / total) × 100%.

  • System average load metrics

    • Windows

      The monitoring metric does not exist.

    • Linux

      For more information about the metrics in the following table, see the output of the top command. A higher value indicates a busier system.

    Metric Name

    Description

    Unit

    MetricName

    Dimensions

    Statistics

    (Agent) load.1m

    Average system load over the past 1 minute.

    None

    load_1m

    userId, instanceId

    Maximum, Minimum, Average

    (Agent) load.5m

    Average system load over the past 5 minutes.

    None

    load_5m

    userId, instanceId

    Maximum, Minimum, Average

    (Agent) load.15m

    Average system load over the past 15 minutes.

    None

    load_15m

    userId, instanceId

    Maximum, Minimum, Average

    (Agent) load.1m.percore

    Average system load per CPU core over the past 1 minute.

    None

    load_per_core_1m

    userId, instanceId

    Maximum, Minimum, Average

    (Agent) load.5m.percore

    Average system load per CPU core over the past 5 minutes.

    None

    load_per_core_5m

    userId, instanceId

    Maximum, Minimum, Average

    (Agent) load.15m.percore

    Average system load per CPU core over the past 15 minutes.

    None

    load_per_core_15m

    userId, instanceId

    Maximum, Minimum, Average

  • Disk-related metrics

    • Windows

      First, the `GetDiskFreeSpaceExA` function in `Kernel32.dll` is called to retrieve the available disk space. This provides the used storage space, disk usage, free storage space, and total storage space of the disk. Then, the `RegConnectRegistryA` function is called to connect to the `HKEY_PERFORMANCE_DATA` registry. Finally, the `RegQueryValueExA` function is called to query disk-related properties from the `HKEY_PERFORMANCE_DATA` registry. These properties include read count, write count, bytes written, bytes read, time spent reading, time spent writing, and disk usage time.

    • Linux

      For more information about disk usage and inode usage, see the output of the df command. For more information about disk reads and writes, see the output of the iostat command. This information helps you understand the metrics in the following table.

    Metric

    Description

    Unit

    MetricName

    Dimensions

    Statistics

    Host.diskusage.used

    Used disk storage space.

    Byte

    diskusage_used

    userId, instanceId, device

    Maximum, Minimum, Average

    Host.diskusage.utilization

    Disk usage for regular users.

    %

    diskusage_utilization

    userId, instanceId, device

    Maximum, Minimum, Average

    Host.diskusage.free

    Free disk storage space for regular users and superusers.

    Byte

    diskusage_free

    userId, instanceId, device

    Maximum, Minimum, Average

    (Agent) disk.usage.avail_device

    Free disk storage space for regular users.

    Byte

    diskusage_avail

    userId, instanceId, device

    Maximum, Minimum, Average

    Host.diskusage.total

    Total disk storage space.

    Byte

    diskusage_total

    userId, instanceId, device

    Maximum, Minimum, Average

    (Agent) disk.read.bps_device

    Bytes read from the disk per second.

    Byte/s

    disk_readbytes

    userId, instanceId, device

    Maximum, Minimum, Average

    (Agent) disk.write.bps_device

    Bytes written to the disk per second.

    Byte/s

    disk_writebytes

    userId, instanceId, device

    Maximum, Minimum, Average

    (Agent) disk.read.iops_device

    Number of read requests to the disk per second.

    counts/s

    disk_readiops

    userId, instanceId, device

    Maximum, Minimum, Average

    (Agent) disk.write.iops_device

    Number of write requests to the disk per second.

    counts/s

    disk_writeiops

    userId, instanceId, device

    Maximum, Minimum, Average

  • File system metrics

    • Windows

      The specified monitoring metric does not exist.

    • Linux

      For more information about the metrics in the following table, see the output of the df command.

    Monitoring Metric Name

    Description

    Unit

    MetricName

    Dimensions

    Statistics

    Description (Linux only)

    (Agent) fs.inode.utilization_device

    inode usage.

    %

    fs_inodeutilization

    userId, instanceId, device

    Maximum, Minimum, Average

    Linux systems use inode numbers instead of filenames to identify files. If the disk is not full but all inodes are allocated, you cannot create new files on the disk. Therefore, you need to monitor inode usage. The number of inodes represents the number of files in the file system. Many small files can lead to high inode usage.

  • Network-related metrics

    • Windows

      First, the `GetAdaptersAddresses` function in `iphlpapi.dll` is called to retrieve the adapter addresses on the local machine. Then, the `GetIfTable` function is called to retrieve network metrics for each interface. These metrics include bits received per second, bits sent per second, packets received per second, packets sent per second, received error packets, and sent error packets.

    • Linux

      • For more information about the collection of TCP connection counts, see the output of the ss command.

        Note

        The TCP connection count refers to all connections that use the TCP protocol on the ECS host.

        By default, the following TCP connection states are collected: TCP_TOTAL (total connections), ESTABLISHED (connections in the established state), and NON_ESTABLISHED (connections in non-established states, which includes all states other than ESTABLISHED).

      • For more information about the network-related metrics in the following table, see the output of the iftop command.

    Metric Name

    Description

    Unit

    MetricName

    Dimensions

    Statistics

    (Agent) network.in.rate_device

    Bits received by the network interface card (NIC) per second, which is the downstream bandwidth of the NIC.

    bit/s

    networkin_rate

    userId, instanceId, device

    Maximum, Minimum, Average

    (Agent) network.out.rate_device

    Bits sent by the NIC per second, which is the upstream bandwidth of the NIC.

    bit/s

    networkout_rate

    userId, instanceId, device

    Maximum, Minimum, Average

    (Agent) network.in.packages_device

    Packets received by the NIC per second.

    packets/s

    networkin_packages

    userId, instanceId, device

    Maximum, Minimum, Average

    (Agent) network.out.packages_device

    Packets sent by the NIC per second.

    packets/s

    networkout_packages

    userId, instanceId, device

    Maximum, Minimum, Average

    (Agent) network.in.errorpackages_device

    Number of received error packets detected by the device drive.

    packets/s

    networkin_errorpackages

    userId, instanceId, device

    Maximum, Minimum, Average

    (Agent) network.out.errorpackages_device

    Number of sent error packets detected by the device drive.

    packets/s

    networkout_errorpackages

    userId, instanceId, device

    Maximum, Minimum, Average

    (Agent) network.tcp.connection_state

    Number of TCP connections in various states, including the following: LISTEN, SYN_SENT, ESTABLISHED, SYN_RECV, FIN_WAIT1, CLOSE_WAIT, FIN_WAIT2, LAST_ACK, TIME_WAIT, CLOSING, and CLOSED.

    Count

    net_tcpconnection

    userId, instanceId, state

    Maximum, Minimum, Average

  • Top 5 process-related metrics

    • Windows

      • Query

        First, the `OpenProcess` function in `Kernel32.dll` is called to access the process. The `GetProcessTimes` function is called twice at an interval to calculate the CPU usage ratio. Then, the `RegConnectRegistryA` function is called to connect to the `HKEY_PERFORMANCE_DATA` registry. Finally, the `RegQueryValueExA` function is called to query the registry for process properties. These properties include process ID, parent process ID, priority, virtual memory, resident memory, shared memory, process name, number of open files, number of threads, page faults, bytes read, and bytes written.

      • Process count (Host.process.number)

        • The `OpenProcess` function is called to open the target process. The `NtQueryInformationProcess` function in `NTDLL` is called to retrieve `RTL_USER_PROCESS_PARAMETERS` information. The `ReadProcessMemory` function is called to retrieve the process command line. This action obtains the process arguments (args) and its root running path, which is the current working directory.

        • The `OpenProcessToken` function is called to retrieve the access token handle. The `GetTokenInformation` function is called to retrieve the token information. The `LookupAccountSid` function is called to obtain the process username and user group.

        • For each process, its arguments (args), root running path, username, and user group are matched against a keyword. If a match is found, a counter is incremented by 1.

    • Linux

      • For more information about process CPU and memory usage, see the output of the top command. The CPU usage reflects multi-core usage.

      • For more information about Host.process.openfile, see the output of the lsof command.

      • For more information about Host.process.number, see the output of the ps aux | grep '<keyword>' command.

    Metric

    Description

    Unit

    MetricName

    Dimensions

    Statistics

    Notes

    (Agent) process.cpu_pid

    Percentage of CPU consumed by a specific process.

    %

    process.cpu

    userId, instanceId, name, pid

    Average

    Alerting is not supported.

    (Agent) process.memory_pid

    Percentage of memory consumed by a specific process.

    %

    process.memory

    userId, instanceId, name, pid

    Average

    Alerting is not supported.

    (Agent) process.openfile_pid

    Number of files opened by the current process.

    Unit

    process.openfile

    userId, instanceId, name, pid

    Average

    Alerting is not supported.

    (Agent) process.count_processname

    Number of processes with the specified keyword.

    Unit

    process.number

    userId, instanceId, processName

    Average

    Alerting is not supported.

View operating system monitoring data

  1. Log on to the Cloud Monitor console.

  2. In the left-side navigation pane, choose Cloud Resource Monitoring > Host Monitoring.

  3. On the Host Monitoring page, click the host name or click Monitoring Charts in the Actions column of the host.

    On the OS Monitoring tab, view the operating system monitoring data of the target host. You can also set alert rules for metrics and view alerts. For more information, see Create an alert rule for a host and View alerts.

References