This topic describes the diagnostic items in the Elastic Compute Service (ECS) console and the API diagnostic metrics supported by the self-service diagnostics feature. This topic also describes diagnostic scope and recommended operations.
Diagnostic types in the ECS console
The instance health diagnostics feature supports the following types of diagnostics:
Diagnostics of computing service health: the underlying resources and virtualization layer of ECS. You can check whether the underlying services of ECS run as expected.
Diagnostics of network service health: the status of network components in an instance and the exceptions in the external network environment.
Diagnostics of storage service health: whether exceptions exist in the disks of an instance.
Diagnostics of instance configuration management health: whether an operation is preventing an instance from starting or running as expected.
Diagnostics of security control health: whether inbound traffic on common ports is allowed in all the security groups to which an instance belongs.
Diagnostics of billing health: whether you have overdue payments for an instance and its associated components such as the public IP address and elastic IP addresses (EIPs).
Diagnostics of resource quotas health: whether the quota usage of critical resources is approaching the upper limit.
Diagnostics of configurations in the Linux operating system of the instance: the system files, key processes, and usage status of common ports and firewalls in the instance operating system.
Diagnostics of configurations in the Windows operating system of the instance: the usage status of common ports and firewalls in the instance operating system.
Exceptions discovered in the diagnostics of computing service health, network service health, storage service health, and instance configuration management health are not real-time exceptions. The diagnostic results include the exceptions present within the last 12 hours. These exceptions may not need to be fixed in real time.
Exceptions discovered in the diagnostics of security control health, billing, resource quotas, and instance operating system configurations are real-time exceptions. We recommend that you fix these exceptions in real time.
Diagnostic items of computing service health
Diagnostic item in the ECS console | Description | Diagnostic scope and recommended operation |
Insufficient Resources | The instance cannot start due to insufficient CPU or memory resources. | Check whether the required physical CPU or memory resources are sufficient. If physical resources are insufficient when the system attempts to reallocate resources to the instance, such as when you start an instance stopped in economical mode, the instance cannot start. You can wait a few minutes and try again or create another instance in another zone or region. |
Exceptions in Instance Operating System | The instance operating system experiences a kernel panic exception, an out-of-memory (OOM) exception, or internal downtime. | Check whether faults, such as kernel panic, OOM exception, or internal downtime, exist in the instance operating system. These faults may be caused by improper configurations of the instance or user programs in the instance operating system. You can restart the instance for recovery. |
Exceptions on Instance Virtualization | The instance does not respond or unexpectedly stops during runtime. | Check whether exceptions exist in the core services at the underlying virtualization layer of the instance. If exceptions exist, the instance may not respond or may unexpectedly stops. You can restart the instance for recovery. |
Alerts for Instance Host | Alerts are triggered on the physical device that hosts the instance. | Check whether faults exist on the underlying physical server that hosts the instance. If faults exist on the underlying physical server, the running state or performance of the instance may be affected. You can restart the instance for recovery. |
Instance Performance Limited | The burstable instance is in standard mode. | Check whether the CPU credits of the burstable instance are sufficient to maintain high performance. If the CPU credits are insufficient, the instance cannot burst its performance and can deliver only baseline performance during peak hours. |
Instance CPU Exceptions | An exception occurred because instances compete for CPUs or because CPUs cannot be bound to the dedicated instance. | Check whether shared instances compete for CPUs at the underlying layer. If shared instances compete for CPUs at the underlying layer, the dedicated instance cannot obtain CPUs or other exceptions occur. You can restart the instance for recovery. |
Exceptions on Instance Management System | An exception occurred in the backend management system of the instance. | Check whether the backend management system of the instance works as expected. If the system is not working as expected, exceptions may occur on the instance. You can restart the instance for recovery. |
Instance Performance Temporarily Degraded | Check whether the performance of the instance is temporarily degraded due to issues with underlying software or hardware. | Check whether the performance of the instance is temporarily degraded due to issues with underlying software or hardware. If the performance of the instance is degraded, the time when the performance is degraded appears. You can view the historical events or system logs of the instance to identify the cause of the performance degradation. For more information, see View historical system events and View system logs and screenshots. |
Diagnostic items of network service health
Diagnostic item in the ECS console | Description | Diagnostic scope and recommended operation |
Packet Loss on Instance Network Link | Packets are lost on the physical devices or in the network service of the instance. | Check whether packet loss occurs on the network link of the instance. If the issue occurs, the network connectivity or throughput is affected. For example, the connection to the ECS instance fails or network access slow down. You can restart the instance for recovery. |
Inconsistent Network Configurations | The network configurations of the instance are inconsistent with those of the underlying service. | Check whether the network configurations of the instance are consistent with those of the service. If inconsistency exists, the instance network performance is affected. You can restart the instance for recovery. |
Exceptions on Instance Link Layer | An exception occurred at the link layer of the network interface controllers (NICs) of the instance. | Send Address Resolution Protocol (ARP) requests to NICs to check whether the basic network configuration of the instance is normal. If the requests fail, the instance is not started normally or the network configuration is abnormal. You can restart the instance for recovery. |
NIC Loading Exceptions | An exception occurred when the NIC of the instance is being loaded. | Check whether the NIC of the instance can be loaded. If the NIC cannot be loaded, the network connectivity of the instance is affected. For example, you cannot connect to the instance. You can restart the instance for recovery. |
Packet Loss on NIC | Inbound or outbound packet loss occurred on the NIC. | Check whether inbound or outbound packet loss has occurred on the NIC. If packet loss exists, the network connectivity or throughput of the instance is affected. For example, you cannot connect to the instance, or the network speed is slow. You can restart the instance for recovery. |
Network Connection Exceptions | NIC connections cannot be established or the maximum number of connections is reached. | Check whether connections can be established on the NIC of the instance. If connections cannot be established on the NIC or if the maximum number of connections is reached, the network connectivity or throughput of the instance is affected. For example, you cannot connect to the instance, or the network speed is slow. You can restart the instance for recovery. |
Abnormal DDoS Protection State | Check the DDoS protection state of the instance and check whether the public IP address of the instance suffers from DDoS attacks. | Check whether the public IP address of the instance suffers from DDoS attacks. The free Anti-DDoS Origin service provided by Alibaba Cloud can help you scrub malicious traffic and mitigate unavailability caused by DDoS attacks. If the amount of malicious traffic exceeds the protection capability of your instance, the instance becomes unavailable or inaccessible. For more information about DDoS attacks, see What is a DDoS attack? You can purchase other anti-DDoS services to protect your instance against DDoS attacks. For more information, see Comparison of Alibaba Cloud Anti-DDoS solutions. For information about the best practices for mitigating DDoS attacks, see Best practices for mitigating DDoS attacks. |
Burst Bandwidth Limited | Check whether the burst bandwidth of the instance is limited. | Check the burst bandwidth of the instance. If the burst bandwidth of the instance exceeds the upper limit allowed for the instance type, network performance becomes a business bottleneck. We recommend upgrading the instance to an instance type with higher bandwidth capabilities. For more information, see Change the instance type. Note For information about the burst bandwidth capabilities of various instance types, see Overview of instance families. |
Network Traffic Throttled | Check whether the total internal and public bandwidth of the instance has reached the maximum bandwidth allowed for the instance type. | Check the total internal and public bandwidth of the instance. If the total internal and public bandwidth exceeds the maximum baseline bandwidth supported by the instance type, network performance becomes a business bottleneck. We recommend upgrading the instance to an instance type with higher bandwidth capabilities. For more information, see Change the instance type. Note For information about the baseline network bandwidth capabilities of different instance types, see Overview of instance families. |
Storage service health diagnostic items
Diagnostic item in the ECS console | Description | Diagnostic scope and recommended operation |
Ineffective Disk Resizing Operation | After the disk of a Linux instance is resized in the ECS console, check whether further disk resize operations are required. | After the disk resize operation is performed in the ECS console, check whether the disk of the instance is resized. If not, run commands on the instance to extend the partitions and file systems of the disk. For more information, see Step 1: Resize a disk to extend the disk capacity. |
Disk I/O Hang | A disk on the instance is experiencing an I/O hang, and data cannot be read from or written to the disk. | Check whether an I/O hang occurred in the system disk of the instance. The file systems of the disk have a high read and write I/O latency, causing the instance to be unstable or break down. If a disk is experiencing an I/O hang, data cannot be read from or written to the disk. We recommend checking the performance metrics of the disk. For more information, see View the monitoring data of a cloud disk. For information about how to check for I/O hangs on instances that run Alibaba Cloud Linux 2, see Detect I/O hangs of file systems and block layers. |
Disk Loading Exceptions | An exception occurred when you create or attach a disk. | Check whether a disk can be attached when the instance is being started. If the disk cannot be attached to the instance, the instance may fail to start. Stop and restart the instance. You can also reattach the disk for instance recovery. For information about how to attach a disk, see Attach a data disk. |
Disk Read/Write Limited | The I/O latency of the disk on the instance is high or the disk IOPS has reached the upper limit. | Check whether the system disk of the instance has a read and write I/O latency and whether the disk has reached its maximum read and write IOPS. If a disk has reached its maximum read and write IOPS, the read and write operations on the disk are limited. For information about how to view disk metrics, see View the monitoring data of a cloud disk. To prevent the preceding issues, reduce the read and write frequency of the disk or upgrade the disk to a category that can deliver higher performance. For information about the read and write performance metrics of disk categories, see Block storage performance. |
Disk Resizing Exceptions | After the disk is resized, the operating system cannot adjust the size of the file systems. | Check whether the size of the file systems in the system disk of the instance is also resized after you resize the system disk. If the size of the file systems is not resized, the disk cannot be resized due to insufficient resources or other reasons. The disk cannot be used. You must resize the disk again. For information about how to resize disks in various operating systems and the limits that apply when you resize disks, see Overview. |
Diagnostic items of instance configuration management health
Diagnostic item in the ECS console | Description | Diagnostic scope and recommended operation |
Instance Startup Exceptions | The instance cannot be started by the management system. | Check whether you can perform the boot operation on the instance. If not, create another instance. |
Core Operation Error | The operation you performed on the instance failed. | Check whether operations that you recently performed on the instance are successful. The operations include starting and stopping the instance and upgrading its configurations. If the operations failed, repeat them. |
Image Loading Exceptions | The image used by the instance cannot be loaded. | Check whether the image used by the instance can be loaded on startup. The image may fail to be loaded due to system or image issues. You can restart the instance for recovery. |
Diagnostic items of security control health
Diagnostic item in the ECS console | Description | Diagnostic scope and recommended operation |
Status of Common Ports | For Windows and Linux instances, check whether traffic is allowed on ports 3389 and 22 in the security groups of the instance, respectively. | Check whether traffic on common ports is allowed in the security groups of the instance. If traffic on the common ports is denied, some services may not run as expected or the instance may not be accessible. Allow inbound traffic on the following ports:
|
Diagnostic items of billing health
Diagnostic item in the ECS console | Description | Diagnostic scope and recommended operation |
Expiration of Subscription Instance | Check whether the subscription instance has expired. | Check whether your subscription instance has expired. If your instance expires, it is stopped and cannot be accessed. For information about changes to resource states after a subscription instance expires, see Subscription. To recover the service, renew the instance. For more information, see Renew a subscription instance. |
Check Whether the Pay-as-you-go Instance Is Stopped Due to an Overdue Payment | Check whether the pay-as-you-go instance is stopped and cannot be used due to overdue payments. | Check whether your pay-as-you-go instance has overdue payments. If so, the instance is stopped and cannot be used. For information about changes to resource states after payments become overdue within your account, see Pay-as-you-go. You must add funds to your account and then reactivate the instance. |
Overdue Payments for Instance Components | Check whether the disks or network bandwidth of the instance is unavailable due to overdue payments within your account. | Check whether the pay-as-you-go disks attached to the subscription instance or the bandwidth is unavailable due to overdue payments within your account. If you have overdue payments for instance components, access to the instance is also affected. You must add funds to your account. |
Diagnostic items of resource quota health
Diagnostic item in the ECS console | Description | Diagnostic scope and recommended operation |
Insufficient Disk Capacity Quota | Your disk capacity is approaching the quota. | Log on the ECS console to request a quota increase. For more information, see ECS quota management. |
Insufficient Image Quota | The number of images in your account is approaching the quota. | To increase the image quota, go to the General Quotas of Elastic Compute Service page, and click Apply in the Actions column for Total number of custom images that current account can own. |
Insufficient ENI Quota | The number of secondary Elastic Network Interfaces (ENIs) in your account is approaching the quota. | Apply for a quota increase in the ECS console. For more information, see ECS quota management. |
Insufficient NIC Queue Quota | The instance has reached the maximum number of NIC queues. |
|
Insufficient Security Group Quota | The number of security groups in your account is approaching the quota. | To increase the security group quota, go to the General Quotas of Elastic Compute Service page, and click Apply in the Actions column for Maximum Number Of Security Groups. |
Insufficient Security Group Quota for Resource | The ENI is approaching the maximum number of security groups to which it can be added. | Apply for a quota increase in the ECS console. For more information, see Manage ECS quotas. If you adjust the limit on the number of security groups that an ECS instance or elastic network interface can join, the maximum number of rules in the security group will also change. For more information, see Security groups. |
Insufficient Rule Quota for Security Group | The number of rules in the security group is approaching the quota. | Apply for a quota increase in the ECS console. For more information, see Manage ECS quotas. If you adjust the maximum number of rules in a security group, the number of security groups that your ECS instance or elastic network interface can join will also change. For more information, see Security groups. |
Diagnostic items of Linux-related configurations
Diagnostic item in the ECS console | Description | Diagnostic scope and recommended operation |
Total CPU Utilization | The | Check the total CPU utilization of the instance. If the CPU utilization is high, identify the processes that use large amounts of CPU resources and determine whether they are normal. For information about how to query the usage of CPU resources, see Resolve high CPU utilization or load on a Linux instance. |
Inodes in Disks | Check whether disk inodes are sufficient. | Check the inode usage of disks on the instance. If the inode usage of a disk is high, files may fail to be created on the disk. Resize disks as needed. For more information, see Overview. |
DHCP Service | Check whether network-related processes exist when Dynamic Host Configuration Protocol (DHCP) is configured. If not, the IP address may be lost after the lease expires. | Check the DHCP process of the eth0 NIC on the instance. If the DHCP process does not exist, the IP address of the instance may fail to renew after the lease expires, which causes network interruptions. For information about how to enable DHCP, see Configure DHCP on a Linux instance. |
Devices in fstab | Check whether the fstab file contains the configurations of nonexistent devices. | Check the /etc/fstab file on the instance. If the /etc/fstab file contains the configurations of the nonexistent devices, the instance may fail to start. For more information about how to remove the configurations of the nonexistent devices from the /etc/fstab file, see What do I do if system startup exceptions occur due to configuration errors in the /etc/fstab file on a Linux instance? |
Mounting Status of Devices in fstab | Check whether devices in the fstab file are correctly mounted. | Check the /etc/fstab file on the instance. If devices are not configured to be mounted automatically in the /etc/fstab file, these devices cannot be used after the instance is restarted. You must run the mount command to manually mount the devices or configure the devices to be mounted automatically in the /etc/fstab file. For information about how to configure disks to be mounted automatically, see Automatically mount a data disk using a UUID in /etc/fstab. |
fstab File Format | Check whether the content of the fstab file is in the correct format. | Check the /etc/fstab file on the instance. If the /etc/fstab file has an invalid format, the instance may fail to start. For information about how to change the /etc/fstab file format, see What do I do if system startup exceptions occur due to configuration errors in the /etc/fstab file on a Linux instance? |
System Firewall Status | Check whether the system firewall is enabled. | Check the firewall configurations. If the firewall is enabled for your instance and has rules configured to block external access, you may fail to connect to the instance. For information about how to enable and disable a firewall, see Manage the system firewall on a Linux instance. |
System File Status | Check the status of critical system files. | The For information about how to check and repair file systems, see Check and repair the file systems on a Linux instance. |
Limits Configuration | Check whether the limits configuration is correct. | Check the /etc/security/limits.conf file on the instance. If the |
Memory Configuration | Check whether the configured huge page size is large. | Check the etc/sysctl.conf file on the instance. If the number of huge pages and the huge page size configured in the etc/sysctl.conf file are large, the total huge page size may exceed the total instance memory size. The total huge page size is calculated based on the following formula: Total huge page size = Number of huge pages × Size of each huge page. For information about how to adjust the huge page size, see How do I adjust the huge page size on a Linux ECS instance? |
Listening Status of Common Ports | Check whether common ports, such as port 22 and port 3389, are in the listening state. | Check the common ports of the instance. If the common ports are not in the listening state, applications on the instance may be inaccessible. For information about how to check and modify common ports, see Test methods for TCP and UDP ports in Linux. |
Processes with CPU Utilization Exceeding 50% | The | Check the CPU utilization of processes on the instance. If the CPU utilization of some processes is high, check whether the processes are normal. For information about how to check the CPU utilization, see Resolve high CPU utilization or load on a Linux instance. |
High Single-CPU Utilization | The | Check the single-CPU utilization of the instance over a period of time. If the single-CPU utilization is high, identify the processes that use large amounts of CPU resources and determine whether they are normal. For information about how to check the CPU utilization, see Resolve high CPU utilization or load on a Linux instance. |
Startup Status of Key System Processes | Check whether critical system processes are started. | Check the critical system processes of the instance. If the critical system processes are not in the Running state, the instance may be inaccessible. |
Kernel Parameters in NAT Environment | Check whether the kernel parameters in the NAT environment are valid. | Check the kernel parameters related to the NAT environment on the instance. If exceptions exist in the kernel parameters related to the NAT environment, the instance cannot be connected over SSH and exceptions occur when you access the HTTP service on the instance. Check and adjust the net.ipv4.tcp_tw_recycle and net.ipv4.tcp_timestamps values in /etc/sysctl.conf. For information about how to fix kernel parameters in the NAT environment, see Why am I unable to access an ECS instance or an ApsaraDB RDS instance after I configure NAT for my client? |
TCP SACK Configuration | Check whether TCP SACK is enabled. | Check whether TCP SACK is enabled for the instance. If TCP SACK is disabled, the network performance of the instance may be affected. For information about how to enable TCP SACK, see Enable TCP SACK on a Linux instance. |
Check Whether the Operating System is OOM | Check whether an OOM issue occurred in the instance operating system. | Check whether an OOM issue occurred in the instance operating system. If so, check whether the amount of available instance memory is sufficient to support the business that runs on the instance. If the amount of available memory is insufficient, upgrade the instance configurations to increase the memory size. For information about how to analyze the root cause of an OOM issue and resolve it, see How do I handle OOM errors on a Linux instance? |
Critical System File Format | Check the formats of critical system files. | Check whether critical system files on the instance are in the UNIX format. If not, you may fail to connect to the instance. For information about how to change the system file format, see Critical files in non-Unix formats on a Linux instance. |
SELinux Status | Check whether SELinux is enabled. | Check whether SELinux is enabled on the instance. If so, an error is reported when you connect to the instance over SSH. You can temporarily or permanently disable SELinux. For information about how to disable SELinux, see What do I do if an SSH connection to a Linux ECS instance becomes abnormal when SELinux is enabled? |
Status and Password Settings of Critical System Users | Check whether critical system users have passwords. Critical system users include the root user in Linux and the administrator user in Windows. | Check whether critical users exist for the instance operating system. If not, you may fail to connect to the instance. Check the status and password settings of critical users in /etc/passwd. For information about how to check a critical user, see A critical system user does not exist in a Linux instance |
SSH Access Permissions | Check whether the SSH access permissions are correctly configured. | Check the SSH access permissions of the instance. If the SSH access permissions are incorrectly configured, you may fail to connect to the instance. For information about how to modify the SSH access permissions, see A critical system user does not exist in a Linux instance |
Critical File Systems for SSH | Check whether critical files or directories for SSH access exist. | Check critical files or directories required by SSH. If the critical files or directories required by SSH do not exist, you may fail to connect to the instance over SSH. For information about how to fix critical files or directories required by SSH, see Check Linux instances for the required files or directories required by the SSH service. |
Whether SSH Allows Root Logon | Check whether SSH allows you to log on as the root user. | Check whether SSH allows you to log on as the root user. If SSH denies access from the root user, the Permission denied, please try again error message is returned when you attempt to connect to the instance as the root user over SSH. For information about how to fix the error, see Resolve the "Permission denied, please try again" error for SSH connections to a Linux instance |
NIC Multi-queue Status | Check whether NIC multi-queue is enabled. | Check whether NIC multi-queue is enabled for the NICs of the instance. If not, the network performance of the instance may be affected. For information about how to enable NIC multi-queue, see NIC multi-queue. |
Diagnostic items of Windows-related configurations
Diagnostic item in the ECS console | Description | Diagnostic scope and recommended operation |
Windows Operating System Version | Microsoft no longer provides support for Windows Server 2008 and earlier versions. | Check the Windows operating system version of the instance. Alibaba Cloud and Microsoft no longer provide support for Windows 2008 and earlier versions. We recommend installing an operating system version later than Windows Server 2008. For more information, see Replace the system disk (operating system). |
High Total CPU Utilization | Check whether the total CPU utilization of the Windows instance exceeds 85%. | Check the CPU utilization of the instance. If the total CPU utilization is high, identify the processes that use large amounts of CPU resources and determine whether they are normal. For information about how to check the CPU utilization, see What do I do if a Windows instance has high CPU utilization? |
High Single-CPU Utilization | Check whether the single-CPU utilization exceeds 80%. | Check the CPU utilization of the instance. If the single-CPU utilization is high, identify the processes that use large amounts of CPU resources and determine whether they are normal. For information about how to check the single-CPU utilization, see What do I do if a Windows instance has high CPU utilization? |
High Memory Usage | Check whether the memory usage of the Windows instance exceeds 80%. | Check whether the memory usage exceeds 80%. If so, the top five processes with the highest memory usage are displayed. Check whether the processes run as expected. For information about how to analyze the memory usage of Windows instances, see Memory analysis tools for Windows. |
Common Windows Service Port Status | Check whether port 3389 is enabled for the Windows instance. | Check port 3389 of the instance. If port 3389 is disabled, the instance cannot be accessed by using RDP. For information about how to enable port 3389 to allow remote desktop connections, see How do I enable Remote Desktop Services on a Windows ECS instance? |
Windows NIC Status | Check whether the NICs of the Windows instance are enabled. | Check the NICs of the instance. If the NICs are unavailable, the instance cannot be connected. For information about how to check and repair NICs, see Check network connectivity. |
IPv4 Addresses of NICs | Check whether the NICs of the Windows instance are assigned IPv4 addresses. | Check whether the NICs are assigned IPv4 addresses. If not, services on the instance may be inaccessible. Check whether DHCP is enabled for the instance or whether a static IP address is assigned to the instance. For information about how to enable DHCP, see Install and configure the DHCP server. |
Network Proxy Status | Check whether network proxy information is configured. | Check whether network proxy information is configured. If network proxy information is configured for the instance, services on the instance may be inaccessible. You must enable or disable the network proxies based on your business requirements. For information about how to disable the network proxies in Windows, see How to reset your Internet Explorer proxy settings. |
DHCP Configuration Status | Check whether DHCP is enabled for the NICs of the Windows instance. | Check the status of DHCP on the NICs. If DHCP is disabled for the NICs, services may be inaccessible. Modify the DHCP configurations of the NICs based on your business requirements. For information about how to enable and configure DHCP for Windows instances, see How To Install and Configure a DHCP Server in a Workgroup. |
Windows Virtual Disk Driver Status | Check the virtio driver version. | Check the virtio driver version of the instance. If the virtio driver is of an earlier version, disks attached to the instance cannot be resized online. For information about how to upgrade the virtio driver version, see Update the virtio driver for a Windows instance. |
Disk Capacity | Check whether the available capacity of the system disk C:\\ is less than 1 GB. | Check the available capacity of the system disk C:\\ on the instance. If the available capacity is less than 1 GB, the system may run slowly or the instance may fail to start. Resize the system disk based on your business requirements. For more information, see Overview. |
Windows Firewall Status | Check whether the Windows firewall is enabled. | Check whether the firewall is enabled for the instance. If so, services on the instance may be inaccessible. Modify the firewall policies. For information about how to configure a firewall policy, see Configure firewall rules for a Windows ECS instance. |
Crash Dump Configuration Status | Check whether crash dump collection is enabled for the instance. | Check whether crash dump collection is enabled for the instance. If not, the instance cannot save relevant information for recovery when it unexpectedly restarts or encounters a blue screen of death. Enable or disable crash dump collection based on your business requirements. For information about how to enable crash dump collection in Windows, see Enable or disable the kernel crash dump service for an instance. |
Administrator Account | Check whether the Administrator account exists. | Check whether the Administrator account exists. If not, services may be inaccessible. You can create the Administrator account based on your business requirements. For information about how to create an account in Windows, see How to add or remove an administrator by using the Management Console. |
API diagnostic metric categorization
Terms
Diagnostic metric (DiagnosticMetric): A unit that checks the status of an instance or account, such as CPU utilization.
Diagnostic item (Issue): An associated item discovered when a diagnostic metric is checked. The items are classified by severity level as
Info,Warn, orCritical. Each diagnostic metric may be associated with multiple diagnostic items. If no associated diagnostic items exist, no issues are found when the system checks the diagnostic metric. However, this does not mean that no actual issues with the diagnostic metric exist.Diagnostic metric set (DiagnosticMetricSet): A collection of diagnostic metrics that enables you to diagnose all metrics at a time.
ImportantDiagnostic results are only used as a reference. A normal diagnostic result does not imply that no issues with the related system metrics occur.
The following table describes the instance health diagnostics items classified by feature and module.
Category code | Category name | Description |
ECSService.ServiceHealth | Checks the physical server resources and virtualization layer of ECS. | |
ECSService.InstanceNetwork | Checks the status of network components on an instance and exceptions in the external network environment. | |
ECSService.InstanceStorage | Checks whether exceptions exist in the disks of an instance. | |
ECSService.InstanceConfigure | Checks whether an operation is preventing an instance from starting or running as expected. | |
ECSService.SecurityGroup | Checks whether inbound traffic on common ports is allowed in all security groups associated with an instance. | |
ECSService.AccountBalance | Checks whether you have overdue payments for an instance and its associated components such as the public IP address and EIP traffic. | |
ECSService.GuestOS | Checks the system files, key processes, and use status of common ports and firewalls in the instance operating system. | |
ECSService.GuestOS | Diagnostics of configurations in the Windows operating system | Checks the usage status of common ports and firewalls in the instance operating system. |
ECSService.ActionTrace | Audits and traces instance billing-related operations, security group-related operations, and instance state-related operations. |
Exceptions detected during diagnostics of computing service health, network service health, storage service health, and instance configuration management are non-real-time exceptions. The diagnostic results include exceptions that occurred within the past 12 hours for viewing historical issues and may not require immediate resolution.
Exceptions detected during diagnostics of security control health, billing, resource quotas, and configurations in instance operating systems are real-time exceptions. These exceptions exist at the time of diagnosis, and we recommend that you resolve them immediately.
Diagnostic items of computing service health
Diagnostic metric ID | Diagnostic metric description | Diagnostic result item ID | Diagnostic metric item description | Recommended operation |
Instance.ControllerError | Check whether the backend management system of the instance runs as expected. | Instance.ECSService.MngServiceException | The backend management system does not run as expected, which may cause the instance to run abnormally. | |
Instance.CPUException | Check whether shared instances compete for CPUs at the underlying layer. | Instance.ECSService.CPUBindFailure | CPU contention exists, which may cause the instance to be unable to obtain CPU resources or experience other exceptions. | |
Instance.CPUSplitLock | Check for an Intel CPU Split Lock issue. | Instance.ECSService.CPUSplitLock | The instance encounters an Intel CPU Split Lock issue. | Check whether your application on the ECS instance contains abnormal code that causes this issue and optimize the code. |
Instance.GuestOSCrash | Check whether the instance operating system has crashed. | Instance.ECSService.GuestOSCrashed | The operating system has crashed. | Check whether your application on the ECS instance contains abnormal code that causes this issue and optimize the code. |
Instance.HostDownAlert | Check whether faults exist in the underlying physical server that hosts the instance. | Instance.ECSService.HostDown | Faults exist in the underlying physical server. The status or performance of the instance may be affected. | |
Instance.PerformanceAffected | Check whether the instance performance is temporarily degraded due to issues with underlying software or hardware. | Instance.ECSService.PerformanceAffected | The performance of the instance is degraded. Check the historical system events or system logs of the instance to identify the cause. For more information, see View historical system events and View system logs and screenshots. | |
Instance.PerfRestrict | Check whether the CPU credits of the burstable instance are sufficient to maintain high performance. | Instance.ECSService.BurstPerformanceRestricted | If the CPU credits are insufficient, the burstable instance can deliver only baseline performance during peak hours and cannot burst its performance. | Check whether the instance meets your business requirements. If not, we recommend that you upgrade the instance type. For more information, see Upgrade the instance types of subscription instances or Change the instance type of a pay-as-you-go instance. |
Instance.ResourceNotEnough | Check whether the required physical CPU or memory resources are sufficient. | Instance.ECSService.ResourceOutOfStock | If physical resources are insufficient when the system attempts to reallocate resources to the instance, such as when you start an instance that was stopped in economical mode), the instance cannot start. | Wait a few minutes and try again or create another instance in another zone or region. |
Instance.SystemException | Check whether faults such as kernel panic, OOM exception, or internal downtime exist in the instance operating system. | Instance.ECSService.GuestOSException | Internal OS exceptions may be caused by improper instance configurations or improper program configurations in user space. | |
Instance.VirtException | Check whether exceptions exist in the core services at the underlying virtualization layer of the instance. | Instance.ECSService.VirtualizationException | This exception may cause the instance to stop responding or be unexpectedly stopped. | |
Instance.RecentUtilHigh | Check whether the historical load exceeds 80%. | Instance.UtilizationHigh.IntranetBandwidth | During the diagnostic period you selected, the internal bandwith utilization of the instance exceeds 80%. High internal bandwidth utilization indicates that your instance transfers a large amount of internal network traffic. | Alibaba Cloud cannot determine the specific process information. Analyze further based on your business. For detailed monitoring information, log on to the CloudMonitor console. |
Instance.UtilizationHigh.DiskIOPS | During the diagnostic period you selected, the IOPS utilization of the instance reached 80%. High IOPS utilization indicates that your instance is performing frequent I/O read and write operations. | Alibaba Cloud cannot determine the specific process information. Analyze further based on your business. For detailed monitoring information, log on to the CloudMonitor console. | ||
Instance.UtilizationHigh.DiskBPS | During the diagnostic period you selected, the BPSutilization of the instance reached 80%. High BPS utilization indicates that your instance is transferring a large amount of data. | Alibaba Cloud cannot determine the specific process information. Analyze further based on your business. For detailed monitoring information, log on to the CloudMonitor console. | ||
Instance.UtilizationHigh.CPU | During the diagnostic period you selected, the CPU utilization of the instance reached 80%. High CPU utilization indicates that your instance is performing high-frequency computing tasks. | For detailed monitoring information, log on to the CloudMonitor console. | ||
Instance.KMSInvalid | Check whether the KMS key is working properly. | Instance.KMSInvalid.SecretInvalid | The current instance uses the key service provided by Key Management Service (KMS) to encrypt the system disk or data disks, but the instance fails to start because the key is invalid. | You can log on to the KMS console to check the status of the key used for the instance's disks. If the instance has an overdue payment, renew your subscription and restart the instance. If the instance runs as expected, ignore this alert. |
Diagnostic items of network service health
Metric ID | Metric description | Result item ID | Result description | Recommended operation |
Instance.ArpPingError | Send ARP requests to NICs to check whether the basic network configurations of the instance are functioning properly. | Instance.ECSService.ARPPingIssue | An exception occurred at the link layer of the NICs of the instance. | If the requests fail, the instance does not start as expected or the network configuration is abnormal. Restart the instance. |
Instance.DDoSStatus | Check whether the public IP address of the instance experiences DDoS attacks. | Instance.Security.SufferDDoSAttacks | The following sample data is returned in the additional information of the item: Attributes in the returned result:
| The free Anti-DDoS Origin service provided by Alibaba Cloud can help you scrub malicious traffic and mitigate unavailability caused by DDoS attacks. If the amount of malicious traffic exceeds the protection capacity of your instance, the instance becomes unavailable or inaccessible. For more information, see What is a DDoS attack? You can purchase other anti-DDoS services to protect your instance. For more information, see Comparison of Alibaba Cloud Anti-DDoS solutions. For information about the best practices for mitigating DDoS attacks, see Best practices for mitigating DDoS attacks. |
Instance.NetworkBoundLimit | Check the total internal and public bandwidth of the instance. | Instance.Network.IOLimit | The total bandwidth exceeds the maximum baseline bandwidth that the instance type supports, causing network performance to become a bottleneck for your business. | Upgrade the instance to an instance type that provides higher bandwidth capabilities. For more information, see Change the instance type. |
Instance.NetworkBurstLimit | Check whether the burst bandwidth of the instance has reached the upper limit. | Instance.Network.BurstBoundLimit | The burst bandwidth exceeds the upper limit allowed for the instance type, causing network performance to become a bottleneck for your business. | Upgrade the instance to an instance type that provides higher bandwidth capabilities. For more information, see Change the instance type. |
Instance.NetworkLoadFailure | Check whether the NIC of the instance can be loaded. | Instance.Network.ENILoadFailure | If the NIC cannot be loaded, the network connectivity of the instance is affected. For example, you cannot connect to the instance. | |
Instance.NetworkSessionError | Check whether connections can be established on the NIC of the instance. | Instance.Network.SessionException | If connections cannot be established on the NIC or if the maximum number of connections is reached, the network connectivity or throughput of the instance is affected. For example, you cannot connect to the instance, or the network speed is slow. | |
Instance.PacketDrop | Check whether inbound or outbound packet loss has occurred on the NIC. | Instance.Network.PacketDrop | If packet loss exists, the network connectivity or throughput of the instance is affected. For example, you cannot connect to the instance, or the network speed is slow. | |
Instance.NetworkConfigConsistency | Check whether the network metrics of the instance are normal. | Instance.NetworkConfig.Inconsistent | The effective network configuration of the instance is inconsistent with the underlying service configuration, which may affect the network performance of the instance. |
|
Instance.NetworkLinkException | Check whether packet loss exists on the internal links of the instance. | Instance.Network.LinkException | The instance encounters packet loss on the underlying network links during the detection period, which may affect the performance of the instance. |
|
Diagnostic items of storage service health
Metric ID | Metric description | Result item ID | Result description | Recommended operation |
Instance.DiskLimit | Check whether the instance's system disk has read and write I/O latency and whether the read and write IOPS exceeds the upper limit of the disk. | Instance.Disk.IOLimit | The disk read and write IOPS has exceeded the upper limit, and read and write operations are restricted. For information about how to view disk metrics, see View the monitoring data of a cloud disk. | To prevent this issue from occurring, reduce the read and write frequency of the disk or upgrade it to a category that can deliver higher performance. For information about the read and write performance metrics of disk categories, see Block storage performance. |
Instance.DiskLoadFailure | Check whether a disk can be attached to the instance during instance startup. | Instance.Disk.EBSLoadFailure | The disk cannot be attached to the instance. The instance cannot be started. | Stop and then restart the instance. Alternatively, you can re-attach the disk for instance recovery. For information about how to attach a disk, see Attach a data disk. |
Instance.IOHang | Check whether an I/O hang occurred on the instance's system disk, such as when the disk's file systems have a high read and write I/O latency, causing instance instability or crash. | Instance.Disk.IOHang | The system disk experiences an I/O hang, and data cannot be read from or written to the disk. | We recommend that you check the performance metrics of the disk. For more information, see View the monitoring data of a cloud disk. For information about how to check for I/O hangs in instances that run Alibaba Cloud Linux 2, see Detect I/O hangs of file systems and block layers. |
Instance.ResizeFsFailure | Check whether the file systems on the system disk are also extended after you resize the system disk. | Instance.Disk.ResizeFailure | The file systems are not extended, and the newly resized disk cannot be used. | Resize the disk again. For information about how to resize disks in various operating systems and the limits that apply when you resize disks, see Overview. |
Instance.DiskFull | Check whether the disk usage reached 100% during a time period. | Instance.Disk.Full | The disk usage of the instance reached 100% during a specific period of time, which may cause instance exceptions. | Select one of the following solutions based on your needs to ensure that the system runs properly:
|
Diagnostic items of instance configuration management
Metric ID | Metric description | Result item ID | Result description | Recommended operation |
Instance.BootFailure | Check whether you can perform the boot operation on the instance. | Instance.ECSService.BootIssue | The instance cannot start. | |
Instance.ImageLoadFailure | Check whether the image used by the instance can be loaded on startup. | Instance.ECSService.ImageIssue | The image may fail to be loaded due to system or image issues. | |
Instance.OperationFailure | Check whether operations that you performed on the instance are successful. These operations include starting and stopping the instance and upgrading the configurations of the instance. | Instance.ECSService.OperationError | An operation fails. | Try again. |
Instance.BootScreenshot | Check whether the operating system boot failure is caused by operating system issues. | Instance.BootScreenshot.Exception | The instance operating system cannot start due to issues, such as abnormal configurations in the operating system or abnormal shutdown. | Log on to the instance by using VNC. |
Diagnostic items of security health
Metric ID | Metric description | Result item ID | Result description | Recommended operation |
Instance.SGIngress | Check whether inbound traffic on common ports is allowed in the security group rules of the instance NIC. | Instance.Network.SSHPortRuleDeny | The inbound SSH port 22 is not allowed. | To access the instance over SSH, configure an inbound rule in the security group to allow SSH access. For more information, see Add a security group rule. |
Instance.SgRule.PingPortDeny | The instance cannot be pinged. | To ping the instance, configure an inbound rule in the security group to allow the ping messages. For more information, see Add a security group rule. | ||
Instance.SgRule.WinRemotePortDeny | The instance cannot be connected over RDP. | To access the instance by using Remote Desktop, configure an inbound rule in the security group to allow remote desktop access. For more information, see Add a security group rule. | ||
Instance.SecurityRisk | Check whether security risks exist on the instance. | Instance.Security.Risk | The instance has security risks that may cause exceptions. | For more information about security risks, log on to the Security Center. |
Billing diagnostic items and results
Metric ID | Metric description | Result item ID | Result description | Recommended operation |
Instance.ExpenseException | Check whether the billing status of the ECS instance is abnormal. | Account.Balance.ExpenseException | Some resources of the instance have billing status exceptions (including subscription expiration or account overdue payment), which prevents connections to the instance or normal use of the instance. The resources with billing status exceptions are listed below. Renew the instance or add funds to your account, and then restart and log on to the instance.
Example: Example: Attributes in the returned result:
| For information about ECS billing, overdue payments, and renewal operations, see Billing overview. |
Diagnostic items and results of Linux operating system configurations
Metric ID | Metric description | Result item ID | Result description | Recommended operation |
GuestOS.CPUUtil | Check whether CPU utilization is too high. | GuestOS.CPU.HighUtilization | The total CPU utilization of the instance exceeds 80%. Check the following top 5 processes by CPU utilization. Attributes in the returned result:
| For information about how to query CPU utilization, see Resolve high CPU utilization or load on a Linux instance. |
GuestOS.CoreCPU.HighUtilization | One or more CPUs of the instance have utilization of more than 85%. Check the following processes whose CPU utilization exceeds 85%. Attributes in the returned result:
| For information about how to query CPU resource usage, see Resolve high CPU utilization or load on a Linux instance. | ||
GuestOS.MemUtil | Check whether the instance memory usage is too high. | GuestOS.Memory.HighUtilization | The total memory utilization of the instance exceeds 80%. Example of the top five processes with the highest memory usage: Attributes in the returned result:
| Disable unnecessary services or processes as needed. If this is caused by your normal business operations, we recommend that you upgrade your ECS configuration. For information about how to query memory usage, see What do I do if the memory usage of a Linux instance is high? |
GuestOS.DiskUtil | Check whether the system disk usage is too high. | GuestOS.SystemDisk.InsufficientSpace | The disk space or inode usage of some file systems on the instance's disks exceeds 80%. This may prevent new files from being created on these partitions. Example of disks with high inode usage: Attributes in the returned result:
| Resize your disk as needed. For more information, see Overview. For information about how to resolve inode capacity issues, see Resolve "no space left" issues on a Linux instance. |
GuestOS.SystemConfig | Check whether critical system configurations are correct. | GuestOS.AuditConfig.AutoShutdown | The Audit service configuration file of the instance has high-risk parameter configurations. When the file system storing Audit service logs runs out of space, the operating system automatically shuts down. After restart, the operating system may shut down repeatedly because Audit service logs are continuously generated. Attributes in the returned result:
| Modify the configuration items in the Audit service configuration as needed. For more information, see How do I modify the auditd service configuration to prevent automatic shutdown due to insufficient disk space? |
GuestOS.LimitsFile.UnreasonableConfig | Some configurations in the Examples of abnormal parameters: Attributes in the returned result:
| Modify the configurations in the | ||
GuestOS.EnormousPageSize.UnreasonableConfig | The number of huge pages in the system file Attributes in the returned result:
| Change the number of huge pages as needed. For more information, see How do I adjust the huge page size on a Linux ECS instance? | ||
GuestOS.SELinuxService.Enabled | The SELinux service is enabled on the instance, which may prevent SSH connections to the instance. | Temporarily or permanently disable the SELinux service. For more information, see What do I do if an SSH connection to a Linux ECS instance becomes abnormal when SELinux is enabled? | ||
GuestOS.NvmeIOTimeout.UnreasonableConfig | A short I/O read/write timeout period configured for Non-Volatile Memory Express (NVMe) disks in the system file of the instance. This may cause the NVMe disks to become read-only after an I/O timeout, resulting in data write failures. Attributes in the returned result:
| Change the value to 4294967295 as needed. For more information, see What do I do if a NVMe disk on a Linux ECS instance is unavailable due to an invalid I/O timeout parameter? | ||
GuestOS.SysctlUnknownNmiPanic.Enabled | The non-maskable interrupt configuration in the kernel of the instance is inappropriate. This can cause unexpected kernel panic and instance restart when the instance encounters a non-maskable interrupt. Attributes in the returned result:
| Change the value to 0 as needed. For more information, see Why does the setting of the kernel parameter kernel.unknown_nmi_panic cause an abnormal restart of a Linux instance? | ||
GuestOS.NetworkInterfaceMultiQueue.Disabled | The multi-queue feature is disabled for one or more NICs on the instance, which may affect network performance. Attributes in the returned result:
| Enable the multi-queue feature as needed. For more information, see NIC multi-queue. | ||
GuestOS.SysctlIPv4TCPSACK.Disabled | The Attributes in the returned result:
| Change the value to 1 as needed. For information about how to enable | ||
GuestOS.SysctlIPv4TCPTWRecycle.Enabled | The NAT-related kernel parameters are incorrectly configured on the instance. This prevents SSH connections to the instance and causes abnormal access to HTTP services on the instance. Attributes in the returned result:
| Change the value to 0 as needed. For information about how to fix kernel parameters in the NAT environment, see Common kernel network parameters of Linux ECS instances and FAQ. | ||
GuestOS.SysctlIPv4TCPTWReuse.Disabled | The TIME-WAIT sockets reuse feature is disabled for the instance. Sockets in the TIME-WAIT state cannot be used for new TCP connections. This may affect the network performance when the instance sends requests. Attribute in the returned result:
| Change the | ||
GuestOS.SysctlNetfilterNfMaxConnections.Unreasonable | The instance's historical system logs contain error logs within a period of time. This issue occurs when the full hash table space is used by the Attributes in the returned result:
| Change the values of these two parameters in the instance kernel configuration file as needed and system conditions. For more information, see Common kernel network parameters of Linux ECS instances and FAQ. | ||
GuestOS.PidMax.TooSmall | The number of running processes on the instance exceeds two-thirds of the maximum number of processes ( Attributes in the returned result:
| Increase the value of | ||
GuestOS.SysctlTcpMaxTwBuckets.Unreasonable | The instance's historical system logs contain error logs within a period of time. This issue occurs because too many TIME_WAIT connections on the instance may cause unexpected disconnections or failures to respond to new connections, affecting instance access or service response. Attributes in the returned result:
| This issue typically results from improper configuration of the | ||
GuestOS.SystemUserPwd | Check the system account and password settings. | GuestOS.SystemUser.MissingInfo | The system account of the instance does not exist, which may cause instance logon failures. Attributes in the returned result:
| Add the account information as needed. For information about how to check for missing system users, see A critical system user does not exist in a Linux instance. |
GuestOS.SystemUserFile.NotUnixFormat | The format of the system account file on the instance is incorrect, which may cause instance logon failures. Attribute in the returned result:
| Modify the file format as needed. For more information, see Critical files in non-Unix formats on a Linux instance. | ||
GuestOS.SystemUserFile.InvalidExtensionAttribute | The extended attributes of the system account file on the instance are incorrect. This may prevent some instance features from working as expected. For example, changes to the root account password in the ECS console may not take effect. Attributes in the returned result:
| Modify the file format as needed. For more information, see Critical files in non-Unix formats on a Linux instance. | ||
GuestOS.FileSystems | Check the file system status. | GuestOS.Filesystems.UUIDConflicts | The instance contains file systems with duplicate UUIDs, which may cause the system to automatically mount unexpected file systems during boot. This can lead to boot failures or unexpected behavior. Example of file systems with the identical UUID: Attributes in the returned result:
| Check the virtio driver version of the instance. For information about how to modify the UUID of a file system, see Modify the UUID of a disk. |
GuestOS.FstabFile.InvalidFormatExists | The Example: Attributes in the returned result:
| Modify the For information about how to modify the | ||
Windows Firewall Status Check | A device configured in the Attributes in the returned result:
| Remove non-existent devices from For more information about how to modify the | ||
GuestOS.FstabFile.LossMountDevice | The instance has disks for which automatic mounting is disabled in the Attributes in the returned result:
| Modify the recommended mount attributes for the disk. For more information, see What do I do if system startup exceptions occur due to configuration errors in the /etc/fstab file on a Linux instance? | ||
GuestOS.FileSystems.PartitionUnaligned | The disk of the instance has partitions that are not aligned to the recommended 2,048 sectors. When the disk is resized, the automatic partition extension operation in Linux may fail due to unaligned partitions, resulting in no increase in the available space of the file system. Attributes in the returned result:
| Fix the unaligned disk partition issue based on your business requirements. For more information, see How do I handle the failure to extend GPT partitions using growpart after resizing a cloud disk? | ||
GuestOS.FstabFile.IncorrectType | The file system configured for a device in the Attributes in the returned result:
| Modify the file system configured in the | ||
GuestOS.Mountpoint.Multiple | The Attributes in the returned result:
| Modify the | ||
GuestOS.NetworkStatus | Check the network configurations and status. | GuestOS.Network.InvalidNetmask | The IPv4 address or corresponding subnet mask of the instance is incorrectly configured, making the IP address configuration invalid and causing instance connection failures. Attribute in the returned result:
| Modify the subnet mask as needed. For more informatino, see How do I configure a static IP address for a Linux ECS instance? |
GuestOS.Network.InvalidDefaultRoute | No default route is configured for the instance, which may cause instance access failures. Attribute in the returned result:
| Modify the NIC configuration or system routing configuration to add the necessary routing rules. For more information, see What do I do if the "Network is unreachable" error message appears when I access a public IP address from a Linux instance? | ||
GuestOS.DHCPService.Disabled | The DHCP service process for the NICs on the instance is disabled. This may cause the instance's IP address to fail to renew after the lease expires, resulting in network interruption. The DHCP configuration for the Attributes in the returned result:
| Check the DHCP service-related configuration. For more information, see What do I do if network service exceptions occur on a Linux ECS instance? | ||
GuestOS.Udev.MacAddressNotExist | The udev rules for dynamic device management in the kernel of the instance contain residual entries in which MAC addresses do not match the actual configuration of the NICs. This inconsistency may cause the instance's network to malfunction or result in unexpected network device naming. Attributes in the returned result:
| Modify the udev rules to delete inconsistent configurations, such as the MAC addresses and network interface names. For more information, see How do I resolve network interface name drift in Linux instances with multiple network interfaces? | ||
GuestOS.DHCPService.CustomPort | ECS instances running specific versions of CentOS or RHEL 7 include DHClient versions earlier than 4.2.5-60. These earlier versions contain bugs and may listen on ports other than 67, 68, 546, and 547. If other services or processes on the ECS instances also use these ports, conflicts may occur, causing other services or processes to fail to start or become unavailable. Attributes in the returned result:
| Upgrade the DHClient service version at the earliest opportunity. For more information, see A port conflict occurs when you start a service or a process on a CentOS or RHEL 7 instance. | ||
GuestOS.NetworkConfig.InvalidInterface | The network configuration file of the instance specifies a nonexistent network interface, which may cause the system network service to fail to start or run abnormally. This issue occurs because a non-existent network interface is specified in the network configuration file. Possible causes include the following:
Attributes in the returned result:
| Add the required ENIs or delete the configuration files of nonexistent ENIs. | ||
GuestOS.Firewall | Check the system firewall status. | GuestOS.NetworkFirewall.Enabled | The firewall (iptables settings) of the instance is enabled. If the instance has enabled the firewall and set rules to block external access, connections to the instance may fail. | Modify the firewall configuration as needed. For more information, see Manage the system firewall on a Linux instance. |
GuestOS.CloudInitService | Check the cloud-init status. | GuestOS.CloudinitService.BadDriverStatus | The cloud-init driver of the instance is abnormal, which may prevent system configurations from being correctly executed during the system initialization phase, resulting in instance access failures. Attributes in the returned result:
| Check and start cloud-init as needed. For more information, see Install cloud-init. |
GuestOS.CloudinitService.StartFailed | The cloud-init service of the instance does not start as expected, which may cause system configuration failures and make the instance inaccessible. | Log on to the instance by using VNC, check cloud-init system logs, and restart the instance. | ||
GuestOS.SSHServiceStatus | Check the SSH service status. | GuestOS.SSH.ForbiddenRootLogin | The SSH service of the instance prohibits root account logon, preventing the root account from accessing the instance over SSH. Attribute in the returned result:
| Fix the root remote logon issue. For more information, see Resolve the "Permission denied, please try again" error for SSH connections to a Linux instance. |
GuestOS.SSH.MissingCriticalFileOrDirectory | Critical files or directories for the SSH service of the instance are missing, which prevents access to the instance over SSH. Attribute in the returned result:
| Reconfigure SSH-related directories and files. For more information, see Check Linux instances for the required files or directories required by the SSH service. | ||
GuestOS.SSH.IncorrectSSHFilePermission | The access permissions on files that the SSH service depends on are improperly configured, which prevents SSH access to the instance. Attributes in the returned result:
| Reconfigure SSH-related directories and files as needed. For more information, see Check Linux instances for the required files or directories required by the SSH service. | ||
GuestOS.SSH.ListeningPortMismatchWithConfig | The address and port that the sshd process is listening on do not match those in the configuration file. This mismatch may cause SSH connections to the expected address and port to fail. The address and port that the sshd process is listening on are not defined in the Attributes in the returned result:
| Change the listening address and port in the sshd configuration file based on your actual needs, then restart the sshd process to apply the changes. For more information, see Failed to remotely connect to a Linux instance due to an SSH access exception. | ||
GuestOS.TimeSyncService | Check the time synchronization service status. | GuestOS.TimeSyncService.Disabled | The time synchronization service of the instance is not working properly or is incorrectly configured. This may cause the system time to deviate from the actual standard time, affecting the normal operation of some applications on the instance. Attributes in the returned result:
| Modify the time synchronization service configuration as needed. For more information, see Clock synchronization. |
GuestOS.OSOOM | Check whether an OOM error occurs. | GuestOS.Memory.OOM | An OOM error occurs within the Guest OS of the instance. Example of the timestamp and detailed logs for the most recent OOM incident: | Check whether the current memory size of the instance is sufficient to support the business running on it. If necessary, upgrade the configuration to increase the instance memory. For more information about how to analyze the root cause of an OOM issue and resolve it, see How do I handle OOM errors on a Linux instance? |
Diagnostic items and results of Windows operating system configurations
Metric ID | Metric description | Result item ID | Item description | Recommended operation |
GuestOS.WinCPUUtil | Check whether CPU utilization is too high. | GuestOS.CPU.HighUtilization | The total CPU utilization of the instance exceeds 80%. The top five processes with the highest CPU utilization are listed below. Check whether they run as expected. Attributes in the returned result:
| Check whether CPU processes are abnormal. If this is caused by normal business operations, we recommend upgrading the ECS configuration. For more information about how to check high single-CPU utilization, see What do I do if a Windows instance has high CPU utilization? |
GuestOS.WinCoreCPU.HighUtilization | The instance has one or more CPUs with utilization exceeding 85%. Information about CPUs with utilization exceeding 85% is listed below. Check whether the processes run as expected. Attributes in the returned result:
| Check whether the processes run as expected. For more information about how to check high single-CPU utilization, see What do I do if a Windows instance has high CPU utilization? | ||
GuestOS.WinMemoryUtil | Check whether the memory usage is too high. | GuestOS.WinMemory.HighUtilization | The total memory usage of the instance exceeds 80%. Example of the top five processes with the highest memory usage: Attributes in the returned result:
| Disable unnecessary services or processes. For information about how to analyze high memory usage in Windows, see Memory analysis tools for Windows. |
GuestOS.WinMemory.LicenseCorrupted | The corruption or misconfiguration of the instance's Windows license database leads to abnormally high hardware-reserved memory displayed in Windows Task Manager compared with the available memory, causing high memory usage of the instance. Attribute in the returned result:
| Restore the Windows license database and then restart the instance. For information about how to fix a corrupted or improperly configured Windows license database, see What do I do if a Windows instance stutters due to excessive memory reserved for hardware? | ||
GuestOS.WinSysDiskUtil | Check whether the system disk usage is too high. | GuestOS.WinFileSystem.InsufficientSpace | The remaining space on the instance's system disk (C:) is insufficient, which may cause slow system responses or instance start failures. Attributes in the returned result:
| Resize the system disk or upgrade the instance type as needed. For more information, see Overview. |
GuestOS.WinSystemConfig | Check whether the critical system configurations are correct. | GuestOS.WinOSVersion.Low | The instance's guest operating system version is of a previous version that is no longer maintained by Alibaba Cloud and Microsoft. Attribute in the returned result:
| Re-install the system and upgrade to a later version of Windows. For information about how to reinstall the system, see Replace the system disk (operating system) or Operating system migration. |
GuestOS.VirtIOVersion.Low | The instance's operating system uses a previous version of the virtio driver, which prevents online disk resizing. The Attributes in the returned result:
| Upgrade the virtio version as needed. For more information, see Update the virtio driver for a Windows instance. | ||
GuestOS.WinCrashDump.Disabled | The crash dump feature is disabled for the instance. When the system experiences an abnormal restart or blue screen, it cannot save the error information for troubleshooting. Attribute in the returned result:
| Enable the crash dump feature as needed. For information about how to enable the feature in Windows, see Enable or disable the kernel crash dump service for an instance. | ||
GuestOS.KMSService.MismatchedKey | The instance uses KMS to activate the Windows operating system, but the activation key used by the KMS client does not match the Windows version, causing activation failures. Attribute in the returned result:
| Follow the Windows activation tutorials to select a key that matches your Windows version. For information about how to activate Windows by using KMS, see Activate a genuine Windows Server system on an ECS instance using KMS domain names. | ||
GuestOS.KMSService.Disconnected | The instance cannot connect to the KMS activation server, causing activation failure. Attribute in the returned result:
| Check whether the firewall configuration or third-party software on the instance blocks access to the KMS activation server, and modify the relevant configurations as needed. For information about how to check the KMS activation server, see Resolve Windows activation failure on an ECS instance. | ||
GuestOS.SPPSVCService.Unhealthy | The instance's Software Protection Platform Service (SPPSVC.exe) cannot start or run, which prevents Windows activation and access to activation settings. Attribute in the returned result:
| Follow the Windows activation tutorials to restart the SPPSVC.exe service and change its startup type to Automatic (Delayed Start) to ensure the service starts automatically next time. | ||
GuestOS.SystemPatch.Incorrect | Incorrect system patches are installed on the instance, which may cause abnormal restarts or system crashes. Example of an incorrect patch: Attribute in the returned result:
| Uninstall the incorrect patches during an appropriate period of time as needed. For more information, see How do I uninstall system patches from a Window ECS instance? | ||
GuestOS.WinFiles.Missing | Some critical system files are missing from the instance's system directory (
| Restore the system file as needed. For more information, see What do I do if a black screen appears and I cannot access the desktop when I remotely log on to a Windows instance? | ||
GuestOS.OperatingSystem.Unactivated | The instance's Windows operating system is not activated, which may cause unavailability of specific Windows personalization services. | Follow the Windows activation tutorials to activate the Windows operating system of the instance by using the correct KMS key. For more information, see Windows system ECS instance activation failed. | ||
GuestOS.WinSystemInit | Check the system initialization status. | GuestOS.SysPrepService.Interrupted | The system preparation service (SysPrep) initialization process is interrupted during instance creation because the instance is restarted too early. Some critical configurations of the operating system are incomplete, which may cause instance start failures. Attribute in the returned result:
| Due to incomplete system initialization, you must replace the system disk to reinstall the system or create another instance to replace this one. For more information, see Replace the system disk (operating system) or Re-initialize a system disk (reset the operating system). |
GuestOS.SysPrepService.InitFailed | The system initialization process is completed abnormally during instance creation, which may prevent the instance from working properly. The following error message appears: Attribute in the returned result:
| Replace the system disk to reinstall the system or create another instance to replace this one. For more information, see Replace the system disk (operating system) or Re-initialize a system disk (reset the operating system). | ||
GuestOS.WinSystemUser | Check the administrator account. | GuestOS.WinAdministrator.NotExist | The Administrator account does not exist, which may cause services to be inaccessible. Attribute in the returned result:
| Enable the Administrator account as needed. |
GuestOS.WinNetworkStatus | Check the network configuration and status. | GuestOS.WinNetworkInterfaceDriver.Disabled | The instance's NIC is unavailable, which may prevent connections to the instance. The NIC is disabled. Attributes in the returned result:
| Repair the NIC status as needed. For information about how to check and repair the NIC status, see Step 7: Check the network. |
GuestOS.WinRDPPort.Closed | The instance's system port is not open, or the firewall is enabled, preventing access to the instance over RDP. Attributes in the returned result:
| Change the open status of this port as needed. For information about how to enable port 3389 to allow RDP connections, see How do I enable Remote Desktop Services on a Windows ECS instance? | ||
GuestOS.WinDHCPService.Disabled | The DHCP configuration is disabled on the instance's NIC, which may cause services to be inaccessible. Attributes in the returned result:
| Change the open status of this port as needed. | ||
GuestOS.WinNetworkInterface.LackIPV4Address | The instance's NIC has no IPv4 address, which may cause services to be inaccessible. Attribute in the returned result:
| Check whether DHCP is enabled on the instance or a static IP address is configured. | ||
GuestOS.NetworkProxy.Enabled | Network proxies configured for the instance, which may cause services to be inaccessible. Attribute in the returned result:
| Disable the network proxies as needed. | ||
GuestOS.WinPort.Conflict | The instance's RDP port is used by another process, causing a port conflict that may prevent access to the instance over RDP. Attributes in the returned result:
| Log on to the instance by using VNC and modify the port for the Remote Desktop Service to work properly. For more information, see What do I do if port conflicts occur when I connect to a Windows ECS instance? | ||
GuestOS.WinDiskStatus | Check the Windows disk status. | GuestOS.SystemDisk.Corrupted | The instance's system disk (C:) is abnormal, which may cause instance restart failures or driver installation issues. Attribute in the returned result:
| Recover the system disk during an appropriate period of time by using one of the following methods:
|
GuestOS.VirtIODriver.DiskIDConflicts | The instance has duplicate disk IDs due to an outdated virtio driver version, which may cause data loss on disks during disk reset operations. Examples of disks with the same ID: Attribute in the returned result:
| Upgrade the virtio driver at the earliest opportunity. For more information, see Update the virtio driver for a Windows instance. | ||
GuestOS.WinFirewall | Check the Windows firewall status. | GuestOS.WinFirewall.Enabled | The firewall of the instance is enabled, which may cause services to be inaccessible. Attributes in the returned result:
| Modify the relevant firewall policy configurations as needed. For more information, see Configure firewall rules for a Windows ECS instance. |
GuestOS.WinDriverStatus | Check the critical Windows driver status. | GuestOS.DiskFilterDriver.Vestigital | The instance has residual disk filter driver files, which may prevent the instance from recognizing newly attached disks. Attributes in the returned result:
| Clear invalid disk filter drivers as needed and restart the instance. For more information, see How do I check for residual disk driver entries in the registry of a Windows ECS instance? |
GuestOS.VirtIODriver.Low | The instance's virtio driver version is {VirtioVersion}, which is outdated and may cause issues, such as blue screens, network packet loss, and disk data loss. | Upgrade the virtio driver version during an appropriate period of time. For more information, see Update the virtio driver for a Windows instance. | ||
Instance.Type.Xen | The instance type is outdated (based on Xen architecture), which may cause the operating system startup failures or device manager issues. Attribute in the returned result:
| Upgrade to a new-generation instance type as needed. For more information, see Upgrade the instance types of subscription instances or Change the instance type of a pay-as-you-go instance. | ||
GuestOS.WinSystemProcess | Check the critical Windows system process status. | GuestOS.RDPService.Unavailable | The instance's RDP service is disabled or corrupted, preventing access to the instance over RDP. | Restart or reinstall the RDP service as needed. For more informaiton, see How do I enable Remote Desktop Services on a Windows ECS instance? |
GuestOS.RDP.BlockedByFirewall | The instance's firewall blocks access to the RDP service, which may prevent connections to the instance over RDP. Attribute in the returned result:
| Disable the firewall or add a rule to allow RDP (port 3389) access in the firewall rules. For information about how to allow RDP access in Windows, see Configure firewall rules for a Windows ECS instance. | ||
GuestOS.WSUS.Disconnected | The instance's Windows Server Update Services (WSUS) cannot be connected, which may prevent normal product updates for the operating system. | Reconfigure WSUS as needed. | ||
GuestOS.Metaserver.Disconnected | The instance's metadata service (metaserver) cannot be connected or the connection times out, which may cause the instance's metadata to be inaccessible. | Check whether the instance's firewall configuration blocks 100.100.100.200. If so, allow it in the firewall settings before accessing the metadata service. For more information, see Instance metadata. | ||
GuestOS.WinLicence.Expired | The instance's license for the Remote Desktop Service has expired, causing the RDP service to malfunction and preventing access to the instance over RDP. | Log on to the instance by using VNC, and purchase the Microsoft Remote Desktop Service license or uninstall the Remote Desktop Service as needed. For information about how to fix Windows Remote Desktop license issues, see What do I do if I cannot connect to a Windows ECS instance by using RDP because no valid license is available for Remote Desktop Services? | ||
GuestOS.WinThirdPartSoftware | Check the third-party software installation status. | GuestOS.Operation.InfluencedByAntivirusProcess | Third-party antivirus software is installed on the instance, which may cause management operation failures (such as password reset and remote connection) and instance exceptions. Example of installed antivirus software: Attribute in the returned result:
| Uninstall the corresponding software as needed. |
Diagnostic items and results of user behavior tracking
Metric ID | Metric description | Result item ID | Result description | Recommended operation |
Instance.UnexpectedSgCreationOrDeletion | Query operations related to creating and deleting security groups within a specified time period based on the Resource Access Management (RAM) role. If the AliyunServiceRoleForECSSelfService role does not exist, the system automatically creates it. | Instance.UnexpectedSgCreationOrDeletion.Log | Display operations related to creating and deleting security groups. | View more details using ActionTrail. For more information, see Query events in the ActionTrail console. |
Instance.UnexpectedSgMember | Query operations related to instances' association with or disassociation from security groups within a specified time period based on the RAM role. If the AliyunServiceRoleForECSSelfService role does not exist, the system automatically creates it. | Instance.UnexpectedSgMember.Log | Display operations related to instances' association with or disassociation from security groups. | View more details using ActionTrail. For more information, see Query events by using the ActionTrail console. |
Instance.UnexpectedFee | Query operations related to instance billing within a specified time period based on the RAM role. If the AliyunServiceRoleForECSSelfService role does not exist, the system automatically creates it. | Instance.UnexpectedFee.Log | Display operations related to instance billing. | View more details using ActionTrail. For more information, see Query events in the ActionTrail console. |
Instance.UnexpectedCreationOrRelease | Query operations related to creating and deleting instances within a specified time period based on the RAM role. If the AliyunServiceRoleForECSSelfService role does not exist, the system automatically creates it. | Instance.UnexpectedCreationOrRelease.Log | Display operations related to creating and deleting instances. | View more details using ActionTrail. For more information, see Query events in the ActionTrail console. |
Instance.UnexpectedRunningStatus | Query operations that affect the instance running status within a specified time period based on the RAM role. If the AliyunServiceRoleForECSSelfService role does not exist, the system automatically creates it. | Instance.UnexpectedRunningStatus.Log | Display operations that affect the instance running status. | View more details using ActionTrail. For more information, see Query events in the ActionTrail console. |