Host monitoring metrics include agent-collected metrics and ECS basic metrics. Agent-collected metrics are monitored by the CloudMonitor agent and monitoring data is collected every 15 seconds. The monitoring data of ECS basic metrics is collected every minute.
- Different monitoring frequencies
The monitoring data displayed on monitoring charts is the average value of the data collected during one statistical period. The statistical period of ECS basic monitoring data is one minute, whereas the statistical period of agent-collected monitoring data is 15 seconds. In the case of large monitoring data fluctuations, the value of ECS basic monitoring data is smaller than that of agent-collected data.
- Different monitoring perspectives
The network traffic data collected by monitoring ECS basic metrics is used for billing. It does not include the traffic between ECS and SLB because such traffic is not billed. However, the network traffic data collected through the CloudMonitor agent records the actual network traffic of each NIC. Therefore, the network traffic data collected through the agent is greater than that collected by monitoring ECS basic metrics (that is, the agent-collected data value is greater than the actually purchased bandwidth or traffic quota).
- CPU metrics
You can refer to the Linux top command to understand the meaning of the metrics listed in the following table.
Metric Definition Unit Description Host.cpu.idle The percentage of CPU currently not utilized % The percentage of CPU currently in the idle state. Host.cpu.system The percentage of CPU currently occupied by the kernel space % Measures the CPU occupied by system context switchover. A great value indicates that many processes or threads are running on the server. Host.cpu.user The percentage of CPU currently occupied by user processes % Measures the CPU occupied by user processes. Host.cpu.iowait The percentage of CPU currently waiting for I/O operations % A high value indicates frequent I/O operations. Host.cpu.other The percentage of CPU occupied by other operations % Calculation method: CPU usage of Nice + CPU usage of SoftIrq + CPU usage of Irq + CPU usage of Stolen. Host.cpu.totalUsed The percentage of CPU currently occupied % The sum of the preceding CPU consumption. It is usually used for alarm purposes.
- Memory metrics
You can refer to the free command to understand the meaning of the metrics listed in the following table.
Metric Definition Unit Description Host.mem.total Total memory Byte The total memory of the server. Host.mem.used The amount of memory in use Byte Calculation method: the memory used by user programs + buffers + cached. "buffers" is the memory space occupied by the buffer. "cached" is the memory space occupied by system cache. Host.mem.actualused The memory actually used by the user Byte
- Calculation method 1: the memory in use - buffers - cached.
- Calculation method 2: total memory - available memory. CentOS 7.2, Ubuntu 16.04, and later versions use the new Linux kernel, which is more accurate in memory estimation. For the specific meaning of the column of available, refer to Commit.
Host.mem.free The amount of the memory not in use Byte Calculation method: total memory - memory in use. Host.mem.freeutilization The percentage of available memory % Calculation method: available memory/total memory × 100%. Host.mem.usedutilization The memory usage % Calculation method: actually used memory/total memory × 100%.
- Metrics of average system loads
You can refer to the Linux TOP command to understand the meaning of the metrics listed in the following table. A higher value of a metric indicates a busier system.
Metric Definition Unit Host.load1 The average system loads over the past one minute. This metric is not available for Windows operating systems. None Host.load5 The average system loads over the past five minutes. This metric is not available for Windows operating systems. None Host.load15 The average system loads over the past 15 minutes. This metric is not available for Windows operating systems. None
- Disk metrics
- You can refer to the Linux df command to understand the disk usage and inode usage metrics.
- You can refer to the Linux iostat command to understand the disk read/write metrics.
Metric Definition Unit Host.diskusage.used The space of the disk in use Byte Host.disk.utilization The disk usage % Host.diskusage.free The remaining storage space of the disk Byte Host.diskussage.total The total disk storage Byte Host.disk.readbytes The number of bytes read per second on the disk Byte/s Host.disk.writebytes The number of bytes written per second on the disk Byte/s Host.disk.readiops The number of read requests received by the disk per second requests/s Host.disk.writeiops The number of write requests received by the disk per second requests/s
- File system metrics
Metric Definition Unit Description Host.fs.inode Inode usage % This metric is not available for Windows operating systems. Linux and UNIX systems use inode numbers, instead of file names, to identify files. When inode numbers are used up, new files cannot be created even if the disk space has not been filled up. Therefore, the inode usage must be monitored. The number of inode numbers indicates the number of files. A large number of small files can cause a high inode usage.
- Network metrics
- You can refer to the Linux iftop command to understand the network related metrics. You can refer to the Linux ss command for the collection of TCP connection data.
- The following TCP connection data is collected by default: TCP_TOTAL (the total number of connections), ESTABLISHED (the number of established connections), and NON_ESTABLISHED (the number of connections not in the established state). If you want to obtain such data, follow these steps:
Change the value of
netstat.tcp.disablein the configuration file cloudmonitor/config/conf.properties to
falseto collect the data. Then, restart the agent.
Change the value of
netstat.tcp.disablein the configuration file C:\”Program Files”\Alibaba\cloudmonitor\config to
falseto collect the data. Then, restart the agent.
Metric Definition Unit Host.netin.rate The number of bits received by the NIC per second, that is, the upstream bandwidth of the NIC bit/s Host.netout.rate The number of bits sent by the NIC per second, that is, the downstream bandwidth of the NIC bit/s Host.netin.packages The number of packets received by the NIC per second packets/s Host.netout.packages The number of packets sent by the NIC per second packets/s Host.netin.errorpackage The number of incoming error packets detected by the drive packets/s Host.netout.errorpackages The number of outgoing error packets detected by the drive packets/s Host.tcpconnection The number of TCP connections in various states, including LISTEN, SYN_SENT, ESTABLISHED, SYN_RECV, FIN_WAIT1, CLOSE_WAIT, FIN_WAIT2, LAST_ACK, TIME_WAIT, CLOSING, and CLOSED. None
- Process metrics
- For the CPU usage and memory usage of processes, refer to the Linux top command. The CPU usage indicates the consumption of multi-core CPUs.
- For details about Host.process.openfile, refer to the Linux lsof command.
- For details about Host.process.number, refer to the Linux ps aux |grep 'keyword' command.
Metric Definition Unit Host.process.cpu The CPU usage of a process % Host.process.memory The memory usage of a process % Host.process.openfile The number of files opened by the current process Count Host.process.number The number of processes that match the specified keyword Count
ECS basic metrics
If your host is an ECS server, the metrics listed in the following table are monitored automatically after you purchase the ECS instance. You do not need to install the agent. The monitoring frequency is one minute.
|ECS.InternetInRate||The average rate of inbound Internet traffic||bit/s|
|ECS.IntranetInRate||The average rate of inbound intranet traffic||bit/s|
|ECS.InternetOutRate||The average rate of outbound Internet traffic||bit/s|
|ECS.IntranetOutRate||The average rate of outbound intranet traffic||bit/s|
|ECS.SystemDiskReadbps||The number of bytes read on the system disk per second||Byte/s|
|ECS.SystemDiskWritebps||The number of bytes written on the system disk per second||Byte/s|
|ECS.SystemDiskReadOps||The read times of the system disk per second||packets/s|
|ECS.SystemDiskWriteOps||The write times of the system disk per second||times/s|
|ECS.InternetIn||Internet inbound traffic||Byte|
|ECS.InternetOut||Internet outbound traffic||Byte|
|ECS.IntranetIn||Intranet inbound traffic||Byte|
|ECS.IntranetOut||Intranet outbound traffic||Byte|