All Products
Search
Document Center

Elastic Compute Service:Diagnostic items and results

Last Updated:Jan 13, 2026

This topic describes the diagnostic items in the Elastic Compute Service (ECS) console and the API diagnostic metrics supported by the self-service diagnostics feature. This topic also describes diagnostic scope and recommended operations.

Diagnostic types in the ECS console

The instance health diagnostics feature supports the following types of diagnostics:

Note
  • Exceptions discovered in the diagnostics of computing service health, network service health, storage service health, and instance configuration management health are not real-time exceptions. The diagnostic results include the exceptions present within the last 12 hours. These exceptions may not need to be fixed in real time.

  • Exceptions discovered in the diagnostics of security control health, billing, resource quotas, and instance operating system configurations are real-time exceptions. We recommend that you fix these exceptions in real time.

Diagnostic items of computing service health

Diagnostic item in the ECS console

Description

Diagnostic scope and recommended operation

Insufficient Resources

The instance cannot start due to insufficient CPU or memory resources.

Check whether the required physical CPU or memory resources are sufficient.

If physical resources are insufficient when the system attempts to reallocate resources to the instance, such as when you start an instance stopped in economical mode, the instance cannot start. You can wait a few minutes and try again or create another instance in another zone or region.

Exceptions in Instance Operating System

The instance operating system experiences a kernel panic exception, an out-of-memory (OOM) exception, or internal downtime.

Check whether faults, such as kernel panic, OOM exception, or internal downtime, exist in the instance operating system.

These faults may be caused by improper configurations of the instance or user programs in the instance operating system. You can restart the instance for recovery.

Exceptions on Instance Virtualization

The instance does not respond or unexpectedly stops during runtime.

Check whether exceptions exist in the core services at the underlying virtualization layer of the instance.

If exceptions exist, the instance may not respond or may unexpectedly stops. You can restart the instance for recovery.

Alerts for Instance Host

Alerts are triggered on the physical device that hosts the instance.

Check whether faults exist on the underlying physical server that hosts the instance.

If faults exist on the underlying physical server, the running state or performance of the instance may be affected. You can restart the instance for recovery.

Instance Performance Limited

The burstable instance is in standard mode.

Check whether the CPU credits of the burstable instance are sufficient to maintain high performance.

If the CPU credits are insufficient, the instance cannot burst its performance and can deliver only baseline performance during peak hours.

Instance CPU Exceptions

An exception occurred because instances compete for CPUs or because CPUs cannot be bound to the dedicated instance.

Check whether shared instances compete for CPUs at the underlying layer.

If shared instances compete for CPUs at the underlying layer, the dedicated instance cannot obtain CPUs or other exceptions occur. You can restart the instance for recovery.

Exceptions on Instance Management System

An exception occurred in the backend management system of the instance.

Check whether the backend management system of the instance works as expected.

If the system is not working as expected, exceptions may occur on the instance. You can restart the instance for recovery.

Instance Performance Temporarily Degraded

Check whether the performance of the instance is temporarily degraded due to issues with underlying software or hardware.

Check whether the performance of the instance is temporarily degraded due to issues with underlying software or hardware.

If the performance of the instance is degraded, the time when the performance is degraded appears. You can view the historical events or system logs of the instance to identify the cause of the performance degradation. For more information, see View historical system events and View system logs and screenshots.

Diagnostic items of network service health

Diagnostic item in the ECS console

Description

Diagnostic scope and recommended operation

Packet Loss on Instance Network Link

Packets are lost on the physical devices or in the network service of the instance.

Check whether packet loss occurs on the network link of the instance.

If the issue occurs, the network connectivity or throughput is affected. For example, the connection to the ECS instance fails or network access slow down. You can restart the instance for recovery.

Inconsistent Network Configurations

The network configurations of the instance are inconsistent with those of the underlying service.

Check whether the network configurations of the instance are consistent with those of the service.

If inconsistency exists, the instance network performance is affected. You can restart the instance for recovery.

Exceptions on Instance Link Layer

An exception occurred at the link layer of the network interface controllers (NICs) of the instance.

Send Address Resolution Protocol (ARP) requests to NICs to check whether the basic network configuration of the instance is normal.

If the requests fail, the instance is not started normally or the network configuration is abnormal. You can restart the instance for recovery.

NIC Loading Exceptions

An exception occurred when the NIC of the instance is being loaded.

Check whether the NIC of the instance can be loaded.

If the NIC cannot be loaded, the network connectivity of the instance is affected. For example, you cannot connect to the instance. You can restart the instance for recovery.

Packet Loss on NIC

Inbound or outbound packet loss occurred on the NIC.

Check whether inbound or outbound packet loss has occurred on the NIC. If packet loss exists, the network connectivity or throughput of the instance is affected. For example, you cannot connect to the instance, or the network speed is slow. You can restart the instance for recovery.

Network Connection Exceptions

NIC connections cannot be established or the maximum number of connections is reached.

Check whether connections can be established on the NIC of the instance.

If connections cannot be established on the NIC or if the maximum number of connections is reached, the network connectivity or throughput of the instance is affected. For example, you cannot connect to the instance, or the network speed is slow. You can restart the instance for recovery.

Abnormal DDoS Protection State

Check the DDoS protection state of the instance and check whether the public IP address of the instance suffers from DDoS attacks.

Check whether the public IP address of the instance suffers from DDoS attacks.

The free Anti-DDoS Origin service provided by Alibaba Cloud can help you scrub malicious traffic and mitigate unavailability caused by DDoS attacks. If the amount of malicious traffic exceeds the protection capability of your instance, the instance becomes unavailable or inaccessible. For more information about DDoS attacks, see What is a DDoS attack?

You can purchase other anti-DDoS services to protect your instance against DDoS attacks. For more information, see Comparison of Alibaba Cloud Anti-DDoS solutions.

For information about the best practices for mitigating DDoS attacks, see Best practices for mitigating DDoS attacks.

Burst Bandwidth Limited

Check whether the burst bandwidth of the instance is limited.

Check the burst bandwidth of the instance.

If the burst bandwidth of the instance exceeds the upper limit allowed for the instance type, network performance becomes a business bottleneck. We recommend upgrading the instance to an instance type with higher bandwidth capabilities. For more information, see Change the instance type.

Note

For information about the burst bandwidth capabilities of various instance types, see Overview of instance families.

Network Traffic Throttled

Check whether the total internal and public bandwidth of the instance has reached the maximum bandwidth allowed for the instance type.

Check the total internal and public bandwidth of the instance.

If the total internal and public bandwidth exceeds the maximum baseline bandwidth supported by the instance type, network performance becomes a business bottleneck. We recommend upgrading the instance to an instance type with higher bandwidth capabilities. For more information, see Change the instance type.

Note

For information about the baseline network bandwidth capabilities of different instance types, see Overview of instance families.

Storage service health diagnostic items

Diagnostic item in the ECS console

Description

Diagnostic scope and recommended operation

Ineffective Disk Resizing Operation

After the disk of a Linux instance is resized in the ECS console, check whether further disk resize operations are required.

After the disk resize operation is performed in the ECS console, check whether the disk of the instance is resized. If not, run commands on the instance to extend the partitions and file systems of the disk. For more information, see Step 1: Resize a disk to extend the disk capacity.

Disk I/O Hang

A disk on the instance is experiencing an I/O hang, and data cannot be read from or written to the disk.

Check whether an I/O hang occurred in the system disk of the instance. The file systems of the disk have a high read and write I/O latency, causing the instance to be unstable or break down.

If a disk is experiencing an I/O hang, data cannot be read from or written to the disk. We recommend checking the performance metrics of the disk. For more information, see View the monitoring data of a cloud disk. For information about how to check for I/O hangs on instances that run Alibaba Cloud Linux 2, see Detect I/O hangs of file systems and block layers.

Disk Loading Exceptions

An exception occurred when you create or attach a disk.

Check whether a disk can be attached when the instance is being started.

If the disk cannot be attached to the instance, the instance may fail to start. Stop and restart the instance. You can also reattach the disk for instance recovery. For information about how to attach a disk, see Attach a data disk.

Disk Read/Write Limited

The I/O latency of the disk on the instance is high or the disk IOPS has reached the upper limit.

Check whether the system disk of the instance has a read and write I/O latency and whether the disk has reached its maximum read and write IOPS.

If a disk has reached its maximum read and write IOPS, the read and write operations on the disk are limited. For information about how to view disk metrics, see View the monitoring data of a cloud disk. To prevent the preceding issues, reduce the read and write frequency of the disk or upgrade the disk to a category that can deliver higher performance. For information about the read and write performance metrics of disk categories, see Block storage performance.

Disk Resizing Exceptions

After the disk is resized, the operating system cannot adjust the size of the file systems.

Check whether the size of the file systems in the system disk of the instance is also resized after you resize the system disk.

If the size of the file systems is not resized, the disk cannot be resized due to insufficient resources or other reasons. The disk cannot be used. You must resize the disk again. For information about how to resize disks in various operating systems and the limits that apply when you resize disks, see Overview.

Diagnostic items of instance configuration management health

Diagnostic item in the ECS console

Description

Diagnostic scope and recommended operation

Instance Startup Exceptions

The instance cannot be started by the management system.

Check whether you can perform the boot operation on the instance.

If not, create another instance.

Core Operation Error

The operation you performed on the instance failed.

Check whether operations that you recently performed on the instance are successful. The operations include starting and stopping the instance and upgrading its configurations.

If the operations failed, repeat them.

Image Loading Exceptions

The image used by the instance cannot be loaded.

Check whether the image used by the instance can be loaded on startup.

The image may fail to be loaded due to system or image issues. You can restart the instance for recovery.

Diagnostic items of security control health

Diagnostic item in the ECS console

Description

Diagnostic scope and recommended operation

Status of Common Ports

For Windows and Linux instances, check whether traffic is allowed on ports 3389 and 22 in the security groups of the instance, respectively.

Check whether traffic on common ports is allowed in the security groups of the instance.

If traffic on the common ports is denied, some services may not run as expected or the instance may not be accessible. Allow inbound traffic on the following ports:

  • SSH port 22

  • Remote Desktop Protocol (RDP) port 3389

Diagnostic items of billing health

Diagnostic item in the ECS console

Description

Diagnostic scope and recommended operation

Expiration of Subscription Instance

Check whether the subscription instance has expired.

Check whether your subscription instance has expired.

If your instance expires, it is stopped and cannot be accessed. For information about changes to resource states after a subscription instance expires, see Subscription. To recover the service, renew the instance. For more information, see Renew a subscription instance.

Check Whether the Pay-as-you-go Instance Is Stopped Due to an Overdue Payment

Check whether the pay-as-you-go instance is stopped and cannot be used due to overdue payments.

Check whether your pay-as-you-go instance has overdue payments.

If so, the instance is stopped and cannot be used. For information about changes to resource states after payments become overdue within your account, see Pay-as-you-go. You must add funds to your account and then reactivate the instance.

Overdue Payments for Instance Components

Check whether the disks or network bandwidth of the instance is unavailable due to overdue payments within your account.

Check whether the pay-as-you-go disks attached to the subscription instance or the bandwidth is unavailable due to overdue payments within your account.

If you have overdue payments for instance components, access to the instance is also affected. You must add funds to your account.

Diagnostic items of resource quota health

Diagnostic item in the ECS console

Description

Diagnostic scope and recommended operation

Insufficient Disk Capacity Quota

Your disk capacity is approaching the quota.

Log on the ECS console to request a quota increase. For more information, see ECS quota management.

Insufficient Image Quota

The number of images in your account is approaching the quota.

To increase the image quota, go to the General Quotas of Elastic Compute Service page, and click Apply in the Actions column for Total number of custom images that current account can own.

Insufficient ENI Quota

The number of secondary Elastic Network Interfaces (ENIs) in your account is approaching the quota.

Apply for a quota increase in the ECS console. For more information, see ECS quota management.

Insufficient NIC Queue Quota

The instance has reached the maximum number of NIC queues.

Insufficient Security Group Quota

The number of security groups in your account is approaching the quota.

To increase the security group quota, go to the General Quotas of Elastic Compute Service page, and click Apply in the Actions column for Maximum Number Of Security Groups.

Insufficient Security Group Quota for Resource

The ENI is approaching the maximum number of security groups to which it can be added.

Apply for a quota increase in the ECS console. For more information, see Manage ECS quotas.

If you adjust the limit on the number of security groups that an ECS instance or elastic network interface can join, the maximum number of rules in the security group will also change. For more information, see Security groups.

Insufficient Rule Quota for Security Group

The number of rules in the security group is approaching the quota.

Apply for a quota increase in the ECS console. For more information, see Manage ECS quotas.

If you adjust the maximum number of rules in a security group, the number of security groups that your ECS instance or elastic network interface can join will also change. For more information, see Security groups.

Diagnostic items of Linux-related configurations

Diagnostic item in the ECS console

Description

Diagnostic scope and recommended operation

Total CPU Utilization

The top command output indicates that the CPU utilization of the instance exceeds 80%.

Check the total CPU utilization of the instance.

If the CPU utilization is high, identify the processes that use large amounts of CPU resources and determine whether they are normal. For information about how to query the usage of CPU resources, see Resolve high CPU utilization or load on a Linux instance.

Inodes in Disks

Check whether disk inodes are sufficient.

Check the inode usage of disks on the instance.

If the inode usage of a disk is high, files may fail to be created on the disk. Resize disks as needed. For more information, see Overview.

DHCP Service

Check whether network-related processes exist when Dynamic Host Configuration Protocol (DHCP) is configured. If not, the IP address may be lost after the lease expires.

Check the DHCP process of the eth0 NIC on the instance.

If the DHCP process does not exist, the IP address of the instance may fail to renew after the lease expires, which causes network interruptions. For information about how to enable DHCP, see Configure DHCP on a Linux instance.

Devices in fstab

Check whether the fstab file contains the configurations of nonexistent devices.

Check the /etc/fstab file on the instance.

If the /etc/fstab file contains the configurations of the nonexistent devices, the instance may fail to start. For more information about how to remove the configurations of the nonexistent devices from the /etc/fstab file, see What do I do if system startup exceptions occur due to configuration errors in the /etc/fstab file on a Linux instance?

Mounting Status of Devices in fstab

Check whether devices in the fstab file are correctly mounted.

Check the /etc/fstab file on the instance.

If devices are not configured to be mounted automatically in the /etc/fstab file, these devices cannot be used after the instance is restarted. You must run the mount command to manually mount the devices or configure the devices to be mounted automatically in the /etc/fstab file. For information about how to configure disks to be mounted automatically, see Automatically mount a data disk using a UUID in /etc/fstab.

fstab File Format

Check whether the content of the fstab file is in the correct format.

Check the /etc/fstab file on the instance.

If the /etc/fstab file has an invalid format, the instance may fail to start. For information about how to change the /etc/fstab file format, see What do I do if system startup exceptions occur due to configuration errors in the /etc/fstab file on a Linux instance?

System Firewall Status

Check whether the system firewall is enabled.

Check the firewall configurations.

If the firewall is enabled for your instance and has rules configured to block external access, you may fail to connect to the instance. For information about how to enable and disable a firewall, see Manage the system firewall on a Linux instance.

System File Status

Check the status of critical system files.

The fsck tool diagnoses exceptions in the file systems of the instance, which may cause data loss and lead to issues such as instance access failures.

For information about how to check and repair file systems, see Check and repair the file systems on a Linux instance.

Limits Configuration

Check whether the limits configuration is correct.

Check the /etc/security/limits.conf file on the instance.

If the nofile value in the /etc/security/limits.conf file is larger than expected, you may fail to connect to the instance. For information about how to modify the limits system parameters, see Resolve remote connection failures or "Too many open files" errors on a Linux instance

Memory Configuration

Check whether the configured huge page size is large.

Check the etc/sysctl.conf file on the instance.

If the number of huge pages and the huge page size configured in the etc/sysctl.conf file are large, the total huge page size may exceed the total instance memory size. The total huge page size is calculated based on the following formula: Total huge page size = Number of huge pages × Size of each huge page. For information about how to adjust the huge page size, see How do I adjust the huge page size on a Linux ECS instance?

Listening Status of Common Ports

Check whether common ports, such as port 22 and port 3389, are in the listening state.

Check the common ports of the instance.

If the common ports are not in the listening state, applications on the instance may be inaccessible. For information about how to check and modify common ports, see Test methods for TCP and UDP ports in Linux.

Processes with CPU Utilization Exceeding 50%

The top command output indicates that the CPU utilization of the instance exceeds 50%.

Check the CPU utilization of processes on the instance.

If the CPU utilization of some processes is high, check whether the processes are normal. For information about how to check the CPU utilization, see Resolve high CPU utilization or load on a Linux instance.

High Single-CPU Utilization

The top command output indicates that the single-CPU utilization exceeds 85%.

Check the single-CPU utilization of the instance over a period of time.

If the single-CPU utilization is high, identify the processes that use large amounts of CPU resources and determine whether they are normal. For information about how to check the CPU utilization, see Resolve high CPU utilization or load on a Linux instance.

Startup Status of Key System Processes

Check whether critical system processes are started.

Check the critical system processes of the instance.

If the critical system processes are not in the Running state, the instance may be inaccessible.

Kernel Parameters in NAT Environment

Check whether the kernel parameters in the NAT environment are valid.

Check the kernel parameters related to the NAT environment on the instance.

If exceptions exist in the kernel parameters related to the NAT environment, the instance cannot be connected over SSH and exceptions occur when you access the HTTP service on the instance. Check and adjust the net.ipv4.tcp_tw_recycle and net.ipv4.tcp_timestamps values in /etc/sysctl.conf. For information about how to fix kernel parameters in the NAT environment, see Why am I unable to access an ECS instance or an ApsaraDB RDS instance after I configure NAT for my client?

TCP SACK Configuration

Check whether TCP SACK is enabled.

Check whether TCP SACK is enabled for the instance.

If TCP SACK is disabled, the network performance of the instance may be affected. For information about how to enable TCP SACK, see Enable TCP SACK on a Linux instance.

Check Whether the Operating System is OOM

Check whether an OOM issue occurred in the instance operating system.

Check whether an OOM issue occurred in the instance operating system.

If so, check whether the amount of available instance memory is sufficient to support the business that runs on the instance. If the amount of available memory is insufficient, upgrade the instance configurations to increase the memory size. For information about how to analyze the root cause of an OOM issue and resolve it, see How do I handle OOM errors on a Linux instance?

Critical System File Format

Check the formats of critical system files.

Check whether critical system files on the instance are in the UNIX format.

If not, you may fail to connect to the instance. For information about how to change the system file format, see Critical files in non-Unix formats on a Linux instance.

SELinux Status

Check whether SELinux is enabled.

Check whether SELinux is enabled on the instance.

If so, an error is reported when you connect to the instance over SSH. You can temporarily or permanently disable SELinux. For information about how to disable SELinux, see What do I do if an SSH connection to a Linux ECS instance becomes abnormal when SELinux is enabled?

Status and Password Settings of Critical System Users

Check whether critical system users have passwords. Critical system users include the root user in Linux and the administrator user in Windows.

Check whether critical users exist for the instance operating system.

If not, you may fail to connect to the instance. Check the status and password settings of critical users in /etc/passwd. For information about how to check a critical user, see A critical system user does not exist in a Linux instance

SSH Access Permissions

Check whether the SSH access permissions are correctly configured.

Check the SSH access permissions of the instance.

If the SSH access permissions are incorrectly configured, you may fail to connect to the instance. For information about how to modify the SSH access permissions, see A critical system user does not exist in a Linux instance

Critical File Systems for SSH

Check whether critical files or directories for SSH access exist.

Check critical files or directories required by SSH.

If the critical files or directories required by SSH do not exist, you may fail to connect to the instance over SSH. For information about how to fix critical files or directories required by SSH, see Check Linux instances for the required files or directories required by the SSH service.

Whether SSH Allows Root Logon

Check whether SSH allows you to log on as the root user.

Check whether SSH allows you to log on as the root user.

If SSH denies access from the root user, the Permission denied, please try again error message is returned when you attempt to connect to the instance as the root user over SSH. For information about how to fix the error, see Resolve the "Permission denied, please try again" error for SSH connections to a Linux instance

NIC Multi-queue Status

Check whether NIC multi-queue is enabled.

Check whether NIC multi-queue is enabled for the NICs of the instance.

If not, the network performance of the instance may be affected. For information about how to enable NIC multi-queue, see NIC multi-queue.

Diagnostic items of Windows-related configurations

Diagnostic item in the ECS console

Description

Diagnostic scope and recommended operation

Windows Operating System Version

Microsoft no longer provides support for Windows Server 2008 and earlier versions.

Check the Windows operating system version of the instance.

Alibaba Cloud and Microsoft no longer provide support for Windows 2008 and earlier versions. We recommend installing an operating system version later than Windows Server 2008. For more information, see Replace the system disk (operating system).

High Total CPU Utilization

Check whether the total CPU utilization of the Windows instance exceeds 85%.

Check the CPU utilization of the instance.

If the total CPU utilization is high, identify the processes that use large amounts of CPU resources and determine whether they are normal. For information about how to check the CPU utilization, see What do I do if a Windows instance has high CPU utilization?

High Single-CPU Utilization

Check whether the single-CPU utilization exceeds 80%.

Check the CPU utilization of the instance.

If the single-CPU utilization is high, identify the processes that use large amounts of CPU resources and determine whether they are normal. For information about how to check the single-CPU utilization, see What do I do if a Windows instance has high CPU utilization?

High Memory Usage

Check whether the memory usage of the Windows instance exceeds 80%.

Check whether the memory usage exceeds 80%.

If so, the top five processes with the highest memory usage are displayed. Check whether the processes run as expected. For information about how to analyze the memory usage of Windows instances, see Memory analysis tools for Windows.

Common Windows Service Port Status

Check whether port 3389 is enabled for the Windows instance.

Check port 3389 of the instance.

If port 3389 is disabled, the instance cannot be accessed by using RDP. For information about how to enable port 3389 to allow remote desktop connections, see How do I enable Remote Desktop Services on a Windows ECS instance?

Windows NIC Status

Check whether the NICs of the Windows instance are enabled.

Check the NICs of the instance.

If the NICs are unavailable, the instance cannot be connected. For information about how to check and repair NICs, see Check network connectivity.

IPv4 Addresses of NICs

Check whether the NICs of the Windows instance are assigned IPv4 addresses.

Check whether the NICs are assigned IPv4 addresses.

If not, services on the instance may be inaccessible. Check whether DHCP is enabled for the instance or whether a static IP address is assigned to the instance. For information about how to enable DHCP, see Install and configure the DHCP server.

Network Proxy Status

Check whether network proxy information is configured.

Check whether network proxy information is configured.

If network proxy information is configured for the instance, services on the instance may be inaccessible. You must enable or disable the network proxies based on your business requirements. For information about how to disable the network proxies in Windows, see How to reset your Internet Explorer proxy settings.

DHCP Configuration Status

Check whether DHCP is enabled for the NICs of the Windows instance.

Check the status of DHCP on the NICs.

If DHCP is disabled for the NICs, services may be inaccessible. Modify the DHCP configurations of the NICs based on your business requirements. For information about how to enable and configure DHCP for Windows instances, see How To Install and Configure a DHCP Server in a Workgroup.

Windows Virtual Disk Driver Status

Check the virtio driver version.

Check the virtio driver version of the instance.

If the virtio driver is of an earlier version, disks attached to the instance cannot be resized online. For information about how to upgrade the virtio driver version, see Update the virtio driver for a Windows instance.

Disk Capacity

Check whether the available capacity of the system disk C:\\ is less than 1 GB.

Check the available capacity of the system disk C:\\ on the instance.

If the available capacity is less than 1 GB, the system may run slowly or the instance may fail to start. Resize the system disk based on your business requirements. For more information, see Overview.

Windows Firewall Status

Check whether the Windows firewall is enabled.

Check whether the firewall is enabled for the instance.

If so, services on the instance may be inaccessible. Modify the firewall policies. For information about how to configure a firewall policy, see Configure firewall rules for a Windows ECS instance.

Crash Dump Configuration Status

Check whether crash dump collection is enabled for the instance.

Check whether crash dump collection is enabled for the instance.

If not, the instance cannot save relevant information for recovery when it unexpectedly restarts or encounters a blue screen of death. Enable or disable crash dump collection based on your business requirements. For information about how to enable crash dump collection in Windows, see Enable or disable the kernel crash dump service for an instance.

Administrator Account

Check whether the Administrator account exists.

Check whether the Administrator account exists.

If not, services may be inaccessible. You can create the Administrator account based on your business requirements. For information about how to create an account in Windows, see How to add or remove an administrator by using the Management Console.

API diagnostic metric categorization

Terms

  • Diagnostic metric (DiagnosticMetric): A unit that checks the status of an instance or account, such as CPU utilization.

  • Diagnostic item (Issue): An associated item discovered when a diagnostic metric is checked. The items are classified by severity level as Info, Warn, or Critical. Each diagnostic metric may be associated with multiple diagnostic items. If no associated diagnostic items exist, no issues are found when the system checks the diagnostic metric. However, this does not mean that no actual issues with the diagnostic metric exist.

  • Diagnostic metric set (DiagnosticMetricSet): A collection of diagnostic metrics that enables you to diagnose all metrics at a time.

    Important

    Diagnostic results are only used as a reference. A normal diagnostic result does not imply that no issues with the related system metrics occur.

The following table describes the instance health diagnostics items classified by feature and module.

Category code

Category name

Description

ECSService.ServiceHealth

Computing service health diagnostics

Checks the physical server resources and virtualization layer of ECS.

ECSService.InstanceNetwork

Diagnostics of network service health

Checks the status of network components on an instance and exceptions in the external network environment.

ECSService.InstanceStorage

Diagnostics of storage service health

Checks whether exceptions exist in the disks of an instance.

ECSService.InstanceConfigure

Diagnostics of instance configuration management health

Checks whether an operation is preventing an instance from starting or running as expected.

ECSService.SecurityGroup

Diagnostics of security control health

Checks whether inbound traffic on common ports is allowed in all security groups associated with an instance.

ECSService.AccountBalance

Diagnostics of billing health

Checks whether you have overdue payments for an instance and its associated components such as the public IP address and EIP traffic.

ECSService.GuestOS

Diagnostics of configurations in the Linux operating system

Checks the system files, key processes, and use status of common ports and firewalls in the instance operating system.

ECSService.GuestOS

Diagnostics of configurations in the Windows operating system

Checks the usage status of common ports and firewalls in the instance operating system.

ECSService.ActionTrace

User behavior tracking diagnostics

Audits and traces instance billing-related operations, security group-related operations, and instance state-related operations.

Note
  • Exceptions detected during diagnostics of computing service health, network service health, storage service health, and instance configuration management are non-real-time exceptions. The diagnostic results include exceptions that occurred within the past 12 hours for viewing historical issues and may not require immediate resolution.

  • Exceptions detected during diagnostics of security control health, billing, resource quotas, and configurations in instance operating systems are real-time exceptions. These exceptions exist at the time of diagnosis, and we recommend that you resolve them immediately.

Diagnostic items of computing service health

Diagnostic metric ID

Diagnostic metric description

Diagnostic result item ID

Diagnostic metric item description

Recommended operation

Instance.ControllerError

Check whether the backend management system of the instance runs as expected.

Instance.ECSService.MngServiceException

The backend management system does not run as expected, which may cause the instance to run abnormally.

Restart the instance.

Instance.CPUException

Check whether shared instances compete for CPUs at the underlying layer.

Instance.ECSService.CPUBindFailure

CPU contention exists, which may cause the instance to be unable to obtain CPU resources or experience other exceptions.

Restart the instance.

Instance.CPUSplitLock

Check for an Intel CPU Split Lock issue.

Instance.ECSService.CPUSplitLock

The instance encounters an Intel CPU Split Lock issue.

Check whether your application on the ECS instance contains abnormal code that causes this issue and optimize the code.

Instance.GuestOSCrash

Check whether the instance operating system has crashed.

Instance.ECSService.GuestOSCrashed

The operating system has crashed.

Check whether your application on the ECS instance contains abnormal code that causes this issue and optimize the code.

Instance.HostDownAlert

Check whether faults exist in the underlying physical server that hosts the instance.

Instance.ECSService.HostDown

Faults exist in the underlying physical server. The status or performance of the instance may be affected.

Restart the instance.

Instance.PerformanceAffected

Check whether the instance performance is temporarily degraded due to issues with underlying software or hardware.

Instance.ECSService.PerformanceAffected

The performance of the instance is degraded. Check the historical system events or system logs of the instance to identify the cause. For more information, see View historical system events and View system logs and screenshots.

Restart the instance.

Instance.PerfRestrict

Check whether the CPU credits of the burstable instance are sufficient to maintain high performance.

Instance.ECSService.BurstPerformanceRestricted

If the CPU credits are insufficient, the burstable instance can deliver only baseline performance during peak hours and cannot burst its performance.

Check whether the instance meets your business requirements. If not, we recommend that you upgrade the instance type. For more information, see Upgrade the instance types of subscription instances or Change the instance type of a pay-as-you-go instance.

Instance.ResourceNotEnough

Check whether the required physical CPU or memory resources are sufficient.

Instance.ECSService.ResourceOutOfStock

If physical resources are insufficient when the system attempts to reallocate resources to the instance, such as when you start an instance that was stopped in economical mode), the instance cannot start.

Wait a few minutes and try again or create another instance in another zone or region.

Instance.SystemException

Check whether faults such as kernel panic, OOM exception, or internal downtime exist in the instance operating system.

Instance.ECSService.GuestOSException

Internal OS exceptions may be caused by improper instance configurations or improper program configurations in user space.

Restart the instance.

Instance.VirtException

Check whether exceptions exist in the core services at the underlying virtualization layer of the instance.

Instance.ECSService.VirtualizationException

This exception may cause the instance to stop responding or be unexpectedly stopped.

Restart the instance.

Instance.RecentUtilHigh

Check whether the historical load exceeds 80%.

Instance.UtilizationHigh.IntranetBandwidth

During the diagnostic period you selected, the internal bandwith utilization of the instance exceeds 80%. High internal bandwidth utilization indicates that your instance transfers a large amount of internal network traffic.

Alibaba Cloud cannot determine the specific process information. Analyze further based on your business. For detailed monitoring information, log on to the CloudMonitor console.

Instance.UtilizationHigh.DiskIOPS

During the diagnostic period you selected, the IOPS utilization of the instance reached 80%. High IOPS utilization indicates that your instance is performing frequent I/O read and write operations.

Alibaba Cloud cannot determine the specific process information. Analyze further based on your business. For detailed monitoring information, log on to the CloudMonitor console.

Instance.UtilizationHigh.DiskBPS

During the diagnostic period you selected, the BPSutilization of the instance reached 80%. High BPS utilization indicates that your instance is transferring a large amount of data.

Alibaba Cloud cannot determine the specific process information. Analyze further based on your business. For detailed monitoring information, log on to the CloudMonitor console.

Instance.UtilizationHigh.CPU

During the diagnostic period you selected, the CPU utilization of the instance reached 80%. High CPU utilization indicates that your instance is performing high-frequency computing tasks.

For detailed monitoring information, log on to the CloudMonitor console.

Instance.KMSInvalid

Check whether the KMS key is working properly.

Instance.KMSInvalid.SecretInvalid

The current instance uses the key service provided by Key Management Service (KMS) to encrypt the system disk or data disks, but the instance fails to start because the key is invalid.

You can log on to the KMS console to check the status of the key used for the instance's disks. If the instance has an overdue payment, renew your subscription and restart the instance.

If the instance runs as expected, ignore this alert.

Diagnostic items of network service health

Metric ID

Metric description

Result item ID

Result description

Recommended operation

Instance.ArpPingError

Send ARP requests to NICs to check whether the basic network configurations of the instance are functioning properly.

Instance.ECSService.ARPPingIssue

An exception occurred at the link layer of the NICs of the instance.

If the requests fail, the instance does not start as expected or the network configuration is abnormal. Restart the instance.

Instance.DDoSStatus

Check whether the public IP address of the instance experiences DDoS attacks.

Instance.Security.SufferDDoSAttacks

The following sample data is returned in the additional information of the item:

{
 "Status": "DDoSDefense",
 "StartTime": "2022-07-07T02:25:20Z"
}

Attributes in the returned result:

  • ${Status}: the event that occurred, which can be DDoSDefense or DDoSHole, indicating that the instance is under DDoS attack and has entered defense or triggered blackhole filtering.

  • ${StartTime}: the time when the event occurred.

The free Anti-DDoS Origin service provided by Alibaba Cloud can help you scrub malicious traffic and mitigate unavailability caused by DDoS attacks. If the amount of malicious traffic exceeds the protection capacity of your instance, the instance becomes unavailable or inaccessible. For more information, see What is a DDoS attack?

You can purchase other anti-DDoS services to protect your instance. For more information, see Comparison of Alibaba Cloud Anti-DDoS solutions.

For information about the best practices for mitigating DDoS attacks, see Best practices for mitigating DDoS attacks.

Instance.NetworkBoundLimit

Check the total internal and public bandwidth of the instance.

Instance.Network.IOLimit

The total bandwidth exceeds the maximum baseline bandwidth that the instance type supports, causing network performance to become a bottleneck for your business.

Upgrade the instance to an instance type that provides higher bandwidth capabilities. For more information, see Change the instance type.

Instance.NetworkBurstLimit

Check whether the burst bandwidth of the instance has reached the upper limit.

Instance.Network.BurstBoundLimit

The burst bandwidth exceeds the upper limit allowed for the instance type, causing network performance to become a bottleneck for your business.

Upgrade the instance to an instance type that provides higher bandwidth capabilities. For more information, see Change the instance type.

Instance.NetworkLoadFailure

Check whether the NIC of the instance can be loaded.

Instance.Network.ENILoadFailure

If the NIC cannot be loaded, the network connectivity of the instance is affected. For example, you cannot connect to the instance.

Restart the instance.

Instance.NetworkSessionError

Check whether connections can be established on the NIC of the instance.

Instance.Network.SessionException

If connections cannot be established on the NIC or if the maximum number of connections is reached, the network connectivity or throughput of the instance is affected. For example, you cannot connect to the instance, or the network speed is slow.

Restart the instance.

Instance.PacketDrop

Check whether inbound or outbound packet loss has occurred on the NIC.

Instance.Network.PacketDrop

If packet loss exists, the network connectivity or throughput of the instance is affected. For example, you cannot connect to the instance, or the network speed is slow.

Restart the instance.

Instance.NetworkConfigConsistency

Check whether the network metrics of the instance are normal.

Instance.NetworkConfig.Inconsistent

The effective network configuration of the instance is inconsistent with the underlying service configuration, which may affect the network performance of the instance.

  • If the instance runs as expected, ignore the alert.

  • If packet loss exists on the instance, restart the instance at an appropriate time.

Instance.NetworkLinkException

Check whether packet loss exists on the internal links of the instance.

Instance.Network.LinkException

The instance encounters packet loss on the underlying network links during the detection period, which may affect the performance of the instance.

  • If the instance runs as expected, ignore the alert.

  • If packet loss issues persist on the instance, restart the instance at an appropriate time.

Diagnostic items of storage service health

Metric ID

Metric description

Result item ID

Result description

Recommended operation

Instance.DiskLimit

Check whether the instance's system disk has read and write I/O latency and whether the read and write IOPS exceeds the upper limit of the disk.

Instance.Disk.IOLimit

The disk read and write IOPS has exceeded the upper limit, and read and write operations are restricted. For information about how to view disk metrics, see View the monitoring data of a cloud disk.

To prevent this issue from occurring, reduce the read and write frequency of the disk or upgrade it to a category that can deliver higher performance. For information about the read and write performance metrics of disk categories, see Block storage performance.

Instance.DiskLoadFailure

Check whether a disk can be attached to the instance during instance startup.

Instance.Disk.EBSLoadFailure

The disk cannot be attached to the instance. The instance cannot be started.

Stop and then restart the instance. Alternatively, you can re-attach the disk for instance recovery. For information about how to attach a disk, see Attach a data disk.

Instance.IOHang

Check whether an I/O hang occurred on the instance's system disk, such as when the disk's file systems have a high read and write I/O latency, causing instance instability or crash.

Instance.Disk.IOHang

The system disk experiences an I/O hang, and data cannot be read from or written to the disk.

We recommend that you check the performance metrics of the disk. For more information, see View the monitoring data of a cloud disk. For information about how to check for I/O hangs in instances that run Alibaba Cloud Linux 2, see Detect I/O hangs of file systems and block layers.

Instance.ResizeFsFailure

Check whether the file systems on the system disk are also extended after you resize the system disk.

Instance.Disk.ResizeFailure

The file systems are not extended, and the newly resized disk cannot be used.

Resize the disk again. For information about how to resize disks in various operating systems and the limits that apply when you resize disks, see Overview.

Instance.DiskFull

Check whether the disk usage reached 100% during a time period.

Instance.Disk.Full

The disk usage of the instance reached 100% during a specific period of time, which may cause instance exceptions.

Select one of the following solutions based on your needs to ensure that the system runs properly:

Diagnostic items of instance configuration management

Metric ID

Metric description

Result item ID

Result description

Recommended operation

Instance.BootFailure

Check whether you can perform the boot operation on the instance.

Instance.ECSService.BootIssue

The instance cannot start.

Restart the instance.

Instance.ImageLoadFailure

Check whether the image used by the instance can be loaded on startup.

Instance.ECSService.ImageIssue

The image may fail to be loaded due to system or image issues.

Restart the instance.

Instance.OperationFailure

Check whether operations that you performed on the instance are successful. These operations include starting and stopping the instance and upgrading the configurations of the instance.

Instance.ECSService.OperationError

An operation fails.

Try again.

Instance.BootScreenshot

Check whether the operating system boot failure is caused by operating system issues.

Instance.BootScreenshot.Exception

The instance operating system cannot start due to issues, such as abnormal configurations in the operating system or abnormal shutdown.

Log on to the instance by using VNC.

Diagnostic items of security health

Metric ID

Metric description

Result item ID

Result description

Recommended operation

Instance.SGIngress

Check whether inbound traffic on common ports is allowed in the security group rules of the instance NIC.

Instance.Network.SSHPortRuleDeny

The inbound SSH port 22 is not allowed.

{
 "Policy": "accept",
 "Port": "22",
 "Service": "SSH",
 "Protocol": "TCP",
 "Direction": "ingress"
}

To access the instance over SSH, configure an inbound rule in the security group to allow SSH access. For more information, see Add a security group rule.

Instance.SgRule.PingPortDeny

The instance cannot be pinged.

{
 "Policy": "accept",
 "Port": "-1",
 "Service": "PING",
 "Protocol": "ICMP",
 "Direction": "ingress"
}

To ping the instance, configure an inbound rule in the security group to allow the ping messages. For more information, see Add a security group rule.

Instance.SgRule.WinRemotePortDeny

The instance cannot be connected over RDP.

{
  "Policy": "drop",
  "Port": "3389",
  "Service": "WIN-REMOTE-DESKTOP",
  "Protocol": "TCP",
  "Direction": "ingress"
}

To access the instance by using Remote Desktop, configure an inbound rule in the security group to allow remote desktop access. For more information, see Add a security group rule.

Instance.SecurityRisk

Check whether security risks exist on the instance.

Instance.Security.Risk

The instance has security risks that may cause exceptions.

For more information about security risks, log on to the Security Center.

Billing diagnostic items and results

Metric ID

Metric description

Result item ID

Result description

Recommended operation

Instance.ExpenseException

Check whether the billing status of the ECS instance is abnormal.

Account.Balance.ExpenseException

Some resources of the instance have billing status exceptions (including subscription expiration or account overdue payment), which prevents connections to the instance or normal use of the instance.

The resources with billing status exceptions are listed below. Renew the instance or add funds to your account, and then restart and log on to the instance.

{$InstanceId}/{$Ip} is in the {status} state.

Example:

{
 "InstanceId":"i-bp1amip45xxxxxxxx",
 "Status":"AccountNotEnough/AccountNotEnough/Expired/NotExpired"   
}

Example:

{
 "Ip":"123.x.x.x",
 "Status":"AccountEnough/AccountNotEnough/Expired/NotExpired"
}

Attributes in the returned result:

  • AccountEnough: The instance has no overdue payment.

  • AccountNotEnough: The instance has an overdue payment.

  • Expired: The subscription instance expired.

  • NotExpired: The subscription instance does not expire.

For information about ECS billing, overdue payments, and renewal operations, see Billing overview.

Diagnostic items and results of Linux operating system configurations

Metric ID

Metric description

Result item ID

Result description

Recommended operation

GuestOS.CPUUtil

Check whether CPU utilization is too high.

GuestOS.CPU.HighUtilization

The total CPU utilization of the instance exceeds 80%.

Check the following top 5 processes by CPU utilization.

{
  "ProcessCPUUsageTop5": [
    {
      "Pid": "1234",
      "CommandName": "/usr/bin/cpu_load.py",
      "AverageCPU": 80
    }
  ]
}

Attributes in the returned result:

  • ${ProcessCPUUsageTop5}: the top 5 processes by CPU utilization.

  • ${Pid}: the process ID.

  • ${CommandName}: the process name.

  • ${AverageCPU}: the average CPU utilization.

For information about how to query CPU utilization, see Resolve high CPU utilization or load on a Linux instance.

GuestOS.CoreCPU.HighUtilization

One or more CPUs of the instance have utilization of more than 85%.

Check the following processes whose CPU utilization exceeds 85%.

{
  "CPUCoreUsage": [
    {
      "Processor": 1,
      "AverageCPU": 80
    }
  ]
}

Attributes in the returned result:

  • ${CPUCoreUsage}: the CPU cores with utilization exceeding 85%.

  • ${Processor}: the number of CPU cores.

  • ${AverageCPU}: the CPU core utilization.

For information about how to query CPU resource usage, see Resolve high CPU utilization or load on a Linux instance.

GuestOS.MemUtil

Check whether the instance memory usage is too high.

GuestOS.Memory.HighUtilization

The total memory utilization of the instance exceeds 80%.

Example of the top five processes with the highest memory usage:

{
  "TotalPercent": 95,
  "TopUtilizationProcesses": [
    {
      "Pid": "1223",
      "CommandName": "/usr/bin/mem.py",
      "PhysicalMemoryPercent": 50
    }
  ]
}

Attributes in the returned result:

  • ${TotalPercent}: the overall memory usage.

  • ${TopUtilizationProcesses}: the top 5 processes by memory usage.

  • ${Pid}: the process ID.

  • ${CommandName}: the process name.

  • ${PhysicalMemoryPercent}: the memory usage of the current process.

Disable unnecessary services or processes as needed. If this is caused by your normal business operations, we recommend that you upgrade your ECS configuration.

For information about how to query memory usage, see What do I do if the memory usage of a Linux instance is high?

GuestOS.DiskUtil

Check whether the system disk usage is too high.

GuestOS.SystemDisk.InsufficientSpace

The disk space or inode usage of some file systems on the instance's disks exceeds 80%. This may prevent new files from being created on these partitions.

Example of disks with high inode usage:

[
  {
    "FilesystemName": "ext4",
    "FilesystemType": "ext4",
    "MountPoint": "/root",
    "SpaceUsedPercent": 10,
    "InodeUsedPercent": 50
  }
]

Attributes in the returned result:

  • ${FilesystemName}: the file system name.

  • ${FilesystemType}: the file system type.

  • ${SpaceUsedPercent}: the percentage of used disk space.

  • ${InodeUsedPercent}: The percentage of used inodes.

Resize your disk as needed. For more information, see Overview.

For information about how to resolve inode capacity issues, see Resolve "no space left" issues on a Linux instance.

GuestOS.SystemConfig

Check whether critical system configurations are correct.

GuestOS.AuditConfig.AutoShutdown

The Audit service configuration file of the instance has high-risk parameter configurations. When the file system storing Audit service logs runs out of space, the operating system automatically shuts down. After restart, the operating system may shut down repeatedly because Audit service logs are continuously generated.

{
  "ActionValue": "halt",
  "ConfigPath": "/etc/audit/auditd.conf",
  "ActionKey": "space_left_action"
}

Attributes in the returned result:

  • ${ConfigPath}: the Audit service configuration file.

  • ${ActionKey} = ${ActionValue}: a high-risk parameter configuration that causes the automatic operating system shutdown when the file system runs out of space.

Modify the configuration items in the Audit service configuration as needed. For more information, see How do I modify the auditd service configuration to prevent automatic shutdown due to insufficient disk space?

GuestOS.LimitsFile.UnreasonableConfig

Some configurations in the /etc/security/limits.conf system file of the instance exceed the default values, which may prevent connections to the instance.

Examples of abnormal parameters:

[
  {
    "LimitDomain": "unused",
    "SysctlValue": 1048576,
    "LimitItem": "nofile",
    "LimitType": "hard",
    "LimitValue": 1048577
  }
]

Attributes in the returned result:

  • ${LimitItem}: the system file path.

  • ${LimitDomain}: the domain.

  • ${LimitValue}: the value.

  • ${LimitType}: the type.

  • ${SysctlValue}: the system configuration value (nr_open).

Modify the configurations in the limits.conf file. For more information, see Resolve remote connection failures or "Too many open files" errors on a Linux instance

GuestOS.EnormousPageSize.UnreasonableConfig

The number of huge pages in the system file /etc/sysctl.conf of the instance is incorrectly configured, which may prevent connections to the instance.

{
  "SysctlNrenormouspages": 10,
  "Enormouspagesize": 100,
  "TotalMemory": 1024000
}

Attributes in the returned result:

  • ${SysctlNrenormouspages}: the number of huge pages.

  • ${Enormouspagesize}: the size of each memory page, in KB.

  • ${TotalMemory}: the total memory size of the instance, in KB.

Change the number of huge pages as needed. For more information, see How do I adjust the huge page size on a Linux ECS instance?

GuestOS.SELinuxService.Enabled

The SELinux service is enabled on the instance, which may prevent SSH connections to the instance.

Temporarily or permanently disable the SELinux service. For more information, see What do I do if an SSH connection to a Linux ECS instance becomes abnormal when SELinux is enabled?

GuestOS.NvmeIOTimeout.UnreasonableConfig

A short I/O read/write timeout period configured for Non-Volatile Memory Express (NVMe) disks in the system file of the instance. This may cause the NVMe disks to become read-only after an I/O timeout, resulting in data write failures.

{
 "File": "/proc/sys/nvme_core/io_timeout",
 "CurrentSetting": 100
}

Attributes in the returned result:

  • ${File}: the configuration file.

  • ${CurrentSetting}: the I/O timeout period.

Change the value to 4294967295 as needed. For more information, see What do I do if a NVMe disk on a Linux ECS instance is unavailable due to an invalid I/O timeout parameter?

GuestOS.SysctlUnknownNmiPanic.Enabled

The non-maskable interrupt configuration in the kernel of the instance is inappropriate. This can cause unexpected kernel panic and instance restart when the instance encounters a non-maskable interrupt.

{
 "File": "/proc/sys/kernel/unknown_nmi_panic",
 "CurrentSetting": 100
}

Attributes in the returned result:

  • ${File}: the configuration file.

  • ${CurrentSetting}: the parameter value.

Change the value to 0 as needed. For more information, see Why does the setting of the kernel parameter kernel.unknown_nmi_panic cause an abnormal restart of a Linux instance?

GuestOS.NetworkInterfaceMultiQueue.Disabled

The multi-queue feature is disabled for one or more NICs on the instance, which may affect network performance.

[
  {
    "InterfaceName": "eth1",
    "Status": "disable"
  }
]

Attributes in the returned result:

  • ${InterfaceName}: the NIC name.

  • ${Status}: the multi-queue status.

Enable the multi-queue feature as needed. For more information, see NIC multi-queue.

GuestOS.SysctlIPv4TCPSACK.Disabled

The tcp_sack feature is not enabled on the instance, which may affect the network performance.

[
  {
    "File": "/proc/sys/net/ipv4/tcp_sack",
    "Value": 0
  }
]

Attributes in the returned result:

  • ${File}: the configuration file.

  • ${Value}: the parameter value.

Change the value to 1 as needed. For information about how to enable tcp_sack, see Enable TCP SACK on a Linux instance.

GuestOS.SysctlIPv4TCPTWRecycle.Enabled

The NAT-related kernel parameters are incorrectly configured on the instance. This prevents SSH connections to the instance and causes abnormal access to HTTP services on the instance.

[
  {
    "File": "/proc/sys/net/ipv4/tcp_tw_recycle",
    "Value": 1
  }
]

Attributes in the returned result:

  • ${File}: the configuration file path.

  • ${Value}: the parameter value.

Change the value to 0 as needed. For information about how to fix kernel parameters in the NAT environment, see Common kernel network parameters of Linux ECS instances and FAQ.

GuestOS.SysctlIPv4TCPTWReuse.Disabled

The TIME-WAIT sockets reuse feature is disabled for the instance. Sockets in the TIME-WAIT state cannot be used for new TCP connections. This may affect the network performance when the instance sends requests.

{
  "CurrentSetting": 0
}

Attribute in the returned result:

${CurrentSetting}: the value of the net.ipv4.tcp_tw_reuse kernel parameter.

Change the net.ipv4.tcp_tw_reuse value to 1 to enable the TIME_WAIT socket reuse feature. For more information, see Common kernel network parameters of Linux ECS instances and FAQ.

GuestOS.SysctlNetfilterNfMaxConnections.Unreasonable

The instance's historical system logs contain error logs within a period of time. This issue occurs when the full hash table space is used by the nf_conntrack kernel module which tracks network connection entries to support NAT address translation. This may cause intermittent network packet loss on the instance.

{
  "Timeout": 432000,
  "KernelMessages": [
    {
      "source": "dmesg command",
      "message": "[14124341.747244] nf_conntrack: table full, dropping packet"
    },
    {
      "source": "/var/log/messages",
      "message": "Nov 15 23:51:16 iZm5efna3fievtdlq82p1mZ kernel: nf_conntrack: table full, dropping packet"
    }
  ],
  "ConnectionMax": 65536
}

Attributes in the returned result:

  • ${Timeout}: the value of net.netfilter.nf_conntrack_tcp_timeout_established.

  • ${ConnectionMax}: the value of net.netfilter.nf_conntrack_max.

  • ${KernelMessages.source}: the historical system log.

  • ${KernelMessages.message}: the error log content.

Change the values of these two parameters in the instance kernel configuration file as needed and system conditions. For more information, see Common kernel network parameters of Linux ECS instances and FAQ.

GuestOS.PidMax.TooSmall

The number of running processes on the instance exceeds two-thirds of the maximum number of processes (kernel.pid_max), which may prevent the system from creating new processes.

{
  "PidMax": 900,
  "ProcessCount": 615
}

Attributes in the returned result:

  • ${PidMax}: the value of kernel.pid_max.

  • ${ProcessCount}: the number of processes in the system.

Increase the value of kernel.pid_max. For more information, see How do I handle the "task: Cannot allocate memory" error when starting a service in a Linux system?

GuestOS.SysctlTcpMaxTwBuckets.Unreasonable

The instance's historical system logs contain error logs within a period of time. This issue occurs because too many TIME_WAIT connections on the instance may cause unexpected disconnections or failures to respond to new connections, affecting instance access or service response.

{
  "TwBuckets": 262144,
  "KernelMessages": [
    {
      "source": "dmesg command",
      "message": "[336877.139205] TCP: time wait bucket table overflow"
    },
    {
      "source": "/var/log/messages",
      "message": "Nov  1 14:08:32 iZbp13lj7h3lh086kdl7kpZ TCP: time wait bucket table overflow"
    }
  ]
}

Attributes in the returned result:

  • ${KernelMessages.source}: the historical system log.

  • ${KernelMessages.message}: the error log content.

  • ${TwBuckets}: the value of net.ipv4.tcp_max_tw_buckets.

This issue typically results from improper configuration of the net.ipv4.tcp_max_tw_buckets kernel parameter. Change the value to accelerate connection closure as needed. For more information, see Common kernel network parameters of Linux ECS instances and FAQ.

GuestOS.SystemUserPwd

Check the system account and password settings.

GuestOS.SystemUser.MissingInfo

The system account of the instance does not exist, which may cause instance logon failures.

[
  {
    "MissingUsername": "postfix",
    "Source": "/etc/passwd"
  }
]

Attributes in the returned result:

  • ${Source}: the configuration file path.

  • ${MissingUsername}: the system account.

Add the account information as needed. For information about how to check for missing system users, see A critical system user does not exist in a Linux instance.

GuestOS.SystemUserFile.NotUnixFormat

The format of the system account file on the instance is incorrect, which may cause instance logon failures.

[
  {
    "File": "/etc/passwd"
  }
]

Attribute in the returned result:

${File}: the invalid file path.

Modify the file format as needed. For more information, see Critical files in non-Unix formats on a Linux instance.

GuestOS.SystemUserFile.InvalidExtensionAttribute

The extended attributes of the system account file on the instance are incorrect. This may prevent some instance features from working as expected. For example, changes to the root account password in the ECS console may not take effect.

[
  {
    "CorrectAttribute": "e",
    "File": "/etc/passwd",
    "CurrentAttribute": "ie"
  }
]

Attributes in the returned result:

  • ${File}: the file path.

  • ${CurrentAttribute}: the current parameter value.

  • ${CorrectAttribute}: the correct parameter value.

Modify the file format as needed. For more information, see Critical files in non-Unix formats on a Linux instance.

GuestOS.FileSystems

Check the file system status.

GuestOS.Filesystems.UUIDConflicts

The instance contains file systems with duplicate UUIDs, which may cause the system to automatically mount unexpected file systems during boot. This can lead to boot failures or unexpected behavior.

Example of file systems with the identical UUID:

[
  {
    "CorrectAttribute": "e",
    "File": "/etc/passwd",
    "CurrentAttribute": "ie"
  }
]

Attributes in the returned result:

  • ${FirstDevice}: the conflicting device 1.

  • ${SecondDevice}: the conflicting device 2.

  • ${UUID}: the conflicting UUID.

Check the virtio driver version of the instance.

For information about how to modify the UUID of a file system, see Modify the UUID of a disk.

GuestOS.FstabFile.InvalidFormatExists

The /etc/fstab file on the instance contains format errors that may prevent the instance from starting.

Example:

[
  {
    "Line": 10,
    "File": "/dev/vdb1"
  }
]

Attributes in the returned result:

  • ${File}: the file path.

  • ${Line}: the number of the line with a format error.

Modify the /etc/fstab file as needed.

For information about how to modify the /etc/fstab file, see What do I do if system startup exceptions occur due to configuration errors in the /etc/fstab file on a Linux instance?

Windows Firewall Status Check

A device configured in the /etc/fstab file of the instance does not exist, which may prevent the instance from starting.

[
  {
    "MountPoint": "/mnt",
    "Device": "UUID=48609326-10e3-40c2-93b3-3f0d9798d7a9"
  }
]

Attributes in the returned result:

  • ${Device}: the nonexistent device UUID.

  • ${MountPoint}: the device mount point.

Remove non-existent devices from /etc/fstab as needed.

For more information about how to modify the /etc/fstab file format, see What do I do if system startup exceptions occur due to configuration errors in the /etc/fstab file on a Linux instance?

GuestOS.FstabFile.LossMountDevice

The instance has disks for which automatic mounting is disabled in the /etc/fstab file, which may prevent the instance from starting.

[
  {
    "Device": "z",
    "MountAttribute": "-rw"
  }
]

Attributes in the returned result:

  • ${Device}: the device for which automatic mounting is disabled.

  • ${MountAttribute}: the recommended mount attributes.

Modify the recommended mount attributes for the disk. For more information, see What do I do if system startup exceptions occur due to configuration errors in the /etc/fstab file on a Linux instance?

GuestOS.FileSystems.PartitionUnaligned

The disk of the instance has partitions that are not aligned to the recommended 2,048 sectors. When the disk is resized, the automatic partition extension operation in Linux may fail due to unaligned partitions, resulting in no increase in the available space of the file system.

[
  {
    "DeviceStart": 512,
    "Unit": "kB",
    "DeviceName": "/dev/vdb"
  }
]

Attributes in the returned result:

  • ${DeviceName}: the disk device name.

  • ${DeviceStart}: the starting position of the first partition on this device.

  • ${Unit}: the unit.

Fix the unaligned disk partition issue based on your business requirements. For more information, see How do I handle the failure to extend GPT partitions using growpart after resizing a cloud disk?

GuestOS.FstabFile.IncorrectType

The file system configured for a device in the /etc/fstab file of the instance does not match the device's actual file system. This mismatch may prevent the instance from starting or cause the device to fail to mount.

{
  "ConfigFileSystem": "extext",
  "Device": "UUID=b9a7ad07-b910-4ba6-9582-e88bf440479c",
  "RealFileSystem": "ext4"
}

Attributes in the returned result:

  • ${Device}: the device.

  • ${RealFileSystem}: the actual file system.

  • ${ConfigFileSystem}: the configured file system.

Modify the file system configured in the /etc/fstab file to match the actual file system of the device. For more information, see How do I fix system startup exceptions caused by incorrect /etc/fstab file configurations in Linux instances?

GuestOS.Mountpoint.Multiple

The /etc/fstab file of the instance contains configuration records in which the same file system is mounted to multiple mount points, which may cause file system read or write conflicts.

[ {
    "Device" : {
      "filesystemFeatures" : [ "has_journal", "ext_attr", "resize_inode", "dir_index", "filetype", "needs_recovery", "extent", "64bit", "flex_bg", "sparse_super", "large_file", "enormous_file", "uninit_bg", "dir_nlink", "extra_isize" ],
      "name" : "/dev/vdb1",
      "type" : "ext4",
      "uuid" : "b055d7bb-2801-40d2-9ddb-1b6fd9b208bc"
    },
    "ConfigPath" : "/etc/fstab",
    "Entries" : [ {
      "mountPoint" : "/usr/local/attachment",
      "options" : "defaults",
      "name" : "/dev/vdb1",
      "passNumberOnParallelFsck" : 0,
      "type" : "ext4",
      "dumpFrequency" : 0
    }, {
      "mountPoint" : "/home/sunmooc",
      "options" : "defaults",
      "name" : "/dev/vdb1",
      "passNumberOnParallelFsck" : 0,
      "type" : "ext4",
      "dumpFrequency" : 0
    } ]
  } ]

Attributes in the returned result:

  • {ConfigPath}: the file path.

  • {Device.name}: the device name.

  • {Entries.mountPoint}: the mount point.

Modify the /etc/fstab file to ensure a one-to-one correspondence between file systems and mount points. For more information, see How do I fix system startup exceptions caused by incorrect /etc/fstab file configurations in Linux instances?

GuestOS.NetworkStatus

Check the network configurations and status.

GuestOS.Network.InvalidNetmask

The IPv4 address or corresponding subnet mask of the instance is incorrectly configured, making the IP address configuration invalid and causing instance connection failures.

[
  {
    "IP": "10.0.0.1"
  }
]

Attribute in the returned result:

${IP}: the IP address without a configured subnet mask.

Modify the subnet mask as needed. For more informatino, see How do I configure a static IP address for a Linux ECS instance?

GuestOS.Network.InvalidDefaultRoute

No default route is configured for the instance, which may cause instance access failures.

{
  "Status": "unconfigured"
}

Attribute in the returned result:

${Status}: the default route configuration status.

Modify the NIC configuration or system routing configuration to add the necessary routing rules. For more information, see What do I do if the "Network is unreachable" error message appears when I access a public IP address from a Linux instance?

GuestOS.DHCPService.Disabled

The DHCP service process for the NICs on the instance is disabled. This may cause the instance's IP address to fail to renew after the lease expires, resulting in network interruption.

The DHCP configuration for the {InterfaceName} NIC is abnormal.

[
  {
    "Status": "enable",
    "InterfaceName": "eth0"
  }
]

Attributes in the returned result:

  • ${Status}: the NIC's DHCP status.

  • ${InterfaceName}: the NIC name.

Check the DHCP service-related configuration. For more information, see What do I do if network service exceptions occur on a Linux ECS instance?

GuestOS.Udev.MacAddressNotExist

The udev rules for dynamic device management in the kernel of the instance contain residual entries in which MAC addresses do not match the actual configuration of the NICs. This inconsistency may cause the instance's network to malfunction or result in unexpected network device naming.

{
  "MacAddress": "00:00:00:01:01:02",
  "DeviceName": "eth${fakeMaxInterfaceNumber}"
}

Attributes in the returned result:

  • ${MacAddress}: the MAC address configured in the udev rule.

  • ${DeviceName}: the network interface name.

Modify the udev rules to delete inconsistent configurations, such as the MAC addresses and network interface names. For more information, see How do I resolve network interface name drift in Linux instances with multiple network interfaces?

GuestOS.DHCPService.CustomPort

ECS instances running specific versions of CentOS or RHEL 7 include DHClient versions earlier than 4.2.5-60. These earlier versions contain bugs and may listen on ports other than 67, 68, 546, and 547. If other services or processes on the ECS instances also use these ports, conflicts may occur, causing other services or processes to fail to start or become unavailable.

[
  {
    "OccupiedPort": 31045,
    "DhclientVersion": "isc-dhclient-4.2.5"
  },
  {
    "OccupiedPort": 38964,
    "DhclientVersion": "isc-dhclient-4.2.5"
  }
]

Attributes in the returned result:

  • ${OccupiedPort: The instance's DHClient service is using a non-default port.

  • ${DhclientVersion}: the version of the instance's DHClient service.

Upgrade the DHClient service version at the earliest opportunity. For more information, see A port conflict occurs when you start a service or a process on a CentOS or RHEL 7 instance.

GuestOS.NetworkConfig.InvalidInterface

The network configuration file of the instance specifies a nonexistent network interface, which may cause the system network service to fail to start or run abnormally.

This issue occurs because a non-existent network interface is specified in the network configuration file. Possible causes include the following:

  • The specified number of ENIs is not configured when you create or configure the instance.

  • The corresponding configuration file is not deleted after a secondary ENI is detached from the instance.

  • The corresponding configuration file is not deleted when a custom image is created.

[
  {
    "ConfigFile": "/etc/sysconfig/network-scripts/ifcfg-eth101",
    "DeviceName": "eth101"
  }
]

Attributes in the returned result:

  • ${ConfigFile}: the ENI configuration file.

  • ${DeviceName}: the specified ENI.

Add the required ENIs or delete the configuration files of nonexistent ENIs.

GuestOS.Firewall

Check the system firewall status.

GuestOS.NetworkFirewall.Enabled

The firewall (iptables settings) of the instance is enabled. If the instance has enabled the firewall and set rules to block external access, connections to the instance may fail.

Modify the firewall configuration as needed. For more information, see Manage the system firewall on a Linux instance.

GuestOS.CloudInitService

Check the cloud-init status.

GuestOS.CloudinitService.BadDriverStatus

The cloud-init driver of the instance is abnormal, which may prevent system configurations from being correctly executed during the system initialization phase, resulting in instance access failures.

{
  "CloudinitEnabled": "enabled",
  "CloudInitSupport": "vpc",
  "GrowpartInstall": "installed",
  "CloudinitInstall": "installed"
}

Attributes in the returned result:

  • ${CloudInitInstall}: the cloud-init installation status.

  • ${CloudInitEnabled}: whether cloud-init is enabled.

  • ${GrowpartInstall}: the GrowPart installation status.

  • ${CloudInitSupport}: the network type supported by cloud-init.

Check and start cloud-init as needed. For more information, see Install cloud-init.

GuestOS.CloudinitService.StartFailed

The cloud-init service of the instance does not start as expected, which may cause system configuration failures and make the instance inaccessible.

Log on to the instance by using VNC, check cloud-init system logs, and restart the instance.

GuestOS.SSHServiceStatus

Check the SSH service status.

GuestOS.SSH.ForbiddenRootLogin

The SSH service of the instance prohibits root account logon, preventing the root account from accessing the instance over SSH.

{
 "File" : "/etc/ssh/sshd_config"
}

Attribute in the returned result:

${File}: the configuration file that prohibits root account logon.

Fix the root remote logon issue. For more information, see Resolve the "Permission denied, please try again" error for SSH connections to a Linux instance.

GuestOS.SSH.MissingCriticalFileOrDirectory

Critical files or directories for the SSH service of the instance are missing, which prevents access to the instance over SSH.

{
  "Files": [
    {
      "File": "/var/empty/*"
    }
  ]
}

Attribute in the returned result:

${File}: the missing critical file or directory.

Reconfigure SSH-related directories and files. For more information, see Check Linux instances for the required files or directories required by the SSH service.

GuestOS.SSH.IncorrectSSHFilePermission

The access permissions on files that the SSH service depends on are improperly configured, which prevents SSH access to the instance.

{
  "Files": [
    {
      "File": "/etc/ssh/ssh_host_ecdsa_key",
      "CurrentPermission": "0777"
    }
  ]
}

Attributes in the returned result:

  • ${Files}: the list of files with incorrect permission configuration.

  • ${File}: the path of the file with incorrect permission configuration.

  • ${CurrentPermission}: the current file permission configuration.

Reconfigure SSH-related directories and files as needed. For more information, see Check Linux instances for the required files or directories required by the SSH service.

GuestOS.SSH.ListeningPortMismatchWithConfig

The address and port that the sshd process is listening on do not match those in the configuration file. This mismatch may cause SSH connections to the expected address and port to fail.

The address and port that the sshd process is listening on are not defined in the /etc/ssh/sshd_config sshd configuration file.

[
  {
    "Address": "0.0.0.0",
    "Port": 2223
  }
]

Attributes in the returned result:

  • ${Address}: the listening address.

  • ${Port}: the listening port.

Change the listening address and port in the sshd configuration file based on your actual needs, then restart the sshd process to apply the changes.

For more information, see Failed to remotely connect to a Linux instance due to an SSH access exception.

GuestOS.TimeSyncService

Check the time synchronization service status.

GuestOS.TimeSyncService.Disabled

The time synchronization service of the instance is not working properly or is incorrectly configured. This may cause the system time to deviate from the actual standard time, affecting the normal operation of some applications on the instance.

[
  {
    "Status": "disabled",
    "ServiceName": "chronyd"
  }
]

Attributes in the returned result:

  • ${ServiceName}: the service name.

  • ${Status}: the service status.

Modify the time synchronization service configuration as needed.

For more information, see Clock synchronization.

GuestOS.OSOOM

Check whether an OOM error occurs.

GuestOS.Memory.OOM

An OOM error occurs within the Guest OS of the instance.

Example of the timestamp and detailed logs for the most recent OOM incident:

[
  {
    "Message": "Mar 25 15:54:50 iZm5ej4ue05oijaudem8shZuser.err: Out of memory testing"
  }
]

Check whether the current memory size of the instance is sufficient to support the business running on it. If necessary, upgrade the configuration to increase the instance memory.

For more information about how to analyze the root cause of an OOM issue and resolve it, see How do I handle OOM errors on a Linux instance?

Diagnostic items and results of Windows operating system configurations

Metric ID

Metric description

Result item ID

Item description

Recommended operation

GuestOS.WinCPUUtil

Check whether CPU utilization is too high.

GuestOS.CPU.HighUtilization

The total CPU utilization of the instance exceeds 80%.

The top five processes with the highest CPU utilization are listed below. Check whether they run as expected.

{
  "ProcessCPUUsageTop5": [
    {
      "Pid": "1234",
      "CommandName": "/usr/bin/cpu_load.py",
      "AverageCPU": 80
    }
  ]
}

Attributes in the returned result:

  • ${ProcessCPUUsageTop5}: the top five processes with the highest CPU utilization in the system.

  • ${Pid}: the process ID.

  • ${CommandName}: the process name.

  • ${AverageCPU}: the average CPU utilization.

Check whether CPU processes are abnormal. If this is caused by normal business operations, we recommend upgrading the ECS configuration.

For more information about how to check high single-CPU utilization, see What do I do if a Windows instance has high CPU utilization?

GuestOS.WinCoreCPU.HighUtilization

The instance has one or more CPUs with utilization exceeding 85%.

Information about CPUs with utilization exceeding 85% is listed below. Check whether the processes run as expected.

{
  "CPUCoreUsage": [
    {
      "Processor": 1,
      "AverageCPU": 80
    }
  ]
}

Attributes in the returned result:

  • ${CPUCoreUsage}: the CPU cores with utilization exceeding 85%.

  • ${Processor}: the number of CPU cores.

  • ${AverageCPU}: the CPU core utilization.

Check whether the processes run as expected. For more information about how to check high single-CPU utilization, see What do I do if a Windows instance has high CPU utilization?

GuestOS.WinMemoryUtil

Check whether the memory usage is too high.

GuestOS.WinMemory.HighUtilization

The total memory usage of the instance exceeds 80%.

Example of the top five processes with the highest memory usage:

{
  "TopUtilizationProcesses": [
    {
      "TotalMemory": 134389760,
      "Pid": "4560",
      "CommandName": "powershell"
    }
  ],
  "AverageMemory": 87.0
}

Attributes in the returned result:

  • ${CPUCoreUsage}: the CPU cores with utilization exceeding 85%.

  • ${Processor}: the number of CPU cores.

  • ${AverageCPU}: the CPU core utilization.

Disable unnecessary services or processes. For information about how to analyze high memory usage in Windows, see Memory analysis tools for Windows.

GuestOS.WinMemory.LicenseCorrupted

The corruption or misconfiguration of the instance's Windows license database leads to abnormally high hardware-reserved memory displayed in Windows Task Manager compared with the available memory, causing high memory usage of the instance.

{
  "MemoryForHardware": 19723407362
}

Attribute in the returned result:

${MemoryForHardware}: the hardware memory size, in bytes.

Restore the Windows license database and then restart the instance.

For information about how to fix a corrupted or improperly configured Windows license database, see What do I do if a Windows instance stutters due to excessive memory reserved for hardware?

GuestOS.WinSysDiskUtil

Check whether the system disk usage is too high.

GuestOS.WinFileSystem.InsufficientSpace

The remaining space on the instance's system disk (C:) is insufficient, which may cause slow system responses or instance start failures.

[
  {
    "FreeSize": 2860625,
    "FilesystemName": "C:"
  }
]

Attributes in the returned result:

  • ${FilesystemName}: the drive letter of the system disk.

  • ${FreeSize}: the remaining space, in bytes.

Resize the system disk or upgrade the instance type as needed.

For more information, see Overview.

GuestOS.WinSystemConfig

Check whether the critical system configurations are correct.

GuestOS.WinOSVersion.Low

The instance's guest operating system version is of a previous version that is no longer maintained by Alibaba Cloud and Microsoft.

{
  "Version": "10.0.14393"
}

Attribute in the returned result:

${Version}: the operating system version.

Re-install the system and upgrade to a later version of Windows. For information about how to reinstall the system, see Replace the system disk (operating system) or Operating system migration.

GuestOS.VirtIOVersion.Low

The instance's operating system uses a previous version of the virtio driver, which prevents online disk resizing.

The {Device} device runs the virtio version {Version}, which does not support online disk resizing.

[
  {
    "Version": 58010,
    "Device": "Red Hat VirtIO Ethernet Adapter"
  }
]

Attributes in the returned result:

  • ${Device}: the driver name.

  • ${Version}: the version.

Upgrade the virtio version as needed.

For more information, see Update the virtio driver for a Windows instance.

GuestOS.WinCrashDump.Disabled

The crash dump feature is disabled for the instance. When the system experiences an abnormal restart or blue screen, it cannot save the error information for troubleshooting.

{
    "Status" : "disable"
}

Attribute in the returned result:

${Status}: the crash dump status.

Enable the crash dump feature as needed.

For information about how to enable the feature in Windows, see Enable or disable the kernel crash dump service for an instance.

GuestOS.KMSService.MismatchedKey

The instance uses KMS to activate the Windows operating system, but the activation key used by the KMS client does not match the Windows version, causing activation failures.

{
    "PartialProductKey" : "4M64B"
}

Attribute in the returned result:

${PartialProductKey}: the last 5 digits of the KMS Client Setup key.

Follow the Windows activation tutorials to select a key that matches your Windows version.

For information about how to activate Windows by using KMS, see

Activate a genuine Windows Server system on an ECS instance using KMS domain names.

GuestOS.KMSService.Disconnected

The instance cannot connect to the KMS activation server, causing activation failure.

{
  "KMSServerStatus": "KmsServerStable"
}

Attribute in the returned result:

${KMSServerStatus}: the KMS server status.

Check whether the firewall configuration or third-party software on the instance blocks access to the KMS activation server, and modify the relevant configurations as needed.

For information about how to check the KMS activation server, see Resolve Windows activation failure on an ECS instance.

GuestOS.SPPSVCService.Unhealthy

The instance's Software Protection Platform Service (SPPSVC.exe) cannot start or run, which prevents Windows activation and access to activation settings.

{
  "SppsvcStatus": "Disabled"
}

Attribute in the returned result:

${SppsvcStatus}: the SPPSVC service status.

Follow the Windows activation tutorials to restart the SPPSVC.exe service and change its startup type to Automatic (Delayed Start) to ensure the service starts automatically next time.

GuestOS.SystemPatch.Incorrect

Incorrect system patches are installed on the instance, which may cause abnormal restarts or system crashes.

Example of an incorrect patch:

{
  "IncorrectHotfixName": "KB5009547"
}

Attribute in the returned result:

${IncorrectHotfixName}: the patch name.

Uninstall the incorrect patches during an appropriate period of time as needed.

For more information, see How do I uninstall system patches from a Window ECS instance?

GuestOS.WinFiles.Missing

Some critical system files are missing from the instance's system directory (C:\Windows\), which may cause a black screen or abnormal operation after instance logon.

{
    "MissingFile" : "C:\\Windows\\write.exe"
  }

${MissingFile}: the missing system file.

Restore the system file as needed. For more information, see What do I do if a black screen appears and I cannot access the desktop when I remotely log on to a Windows instance?

GuestOS.OperatingSystem.Unactivated

The instance's Windows operating system is not activated, which may cause unavailability of specific Windows personalization services.

Follow the Windows activation tutorials to activate the Windows operating system of the instance by using the correct KMS key. For more information, see Windows system ECS instance activation failed.

GuestOS.WinSystemInit

Check the system initialization status.

GuestOS.SysPrepService.Interrupted

The system preparation service (SysPrep) initialization process is interrupted during instance creation because the instance is restarted too early. Some critical configurations of the operating system are incomplete, which may cause instance start failures.

{
  "ImageState": "IMAGE_STATE_COMPLETE1"
}

Attribute in the returned result:

${ImageState}: the image status.

Due to incomplete system initialization, you must replace the system disk to reinstall the system or create another instance to replace this one.

For more information, see Replace the system disk (operating system) or Re-initialize a system disk (reset the operating system).

GuestOS.SysPrepService.InitFailed

The system initialization process is completed abnormally during instance creation, which may prevent the instance from working properly.

The following error message appears:

{
  "Events": "install_virtio_error "
}

Attribute in the returned result:

${Events}: the event.

Replace the system disk to reinstall the system or create another instance to replace this one.

For more information, see Replace the system disk (operating system) or Re-initialize a system disk (reset the operating system).

GuestOS.WinSystemUser

Check the administrator account.

GuestOS.WinAdministrator.NotExist

The Administrator account does not exist, which may cause services to be inaccessible.

{
  "Status": "disable"
}

Attribute in the returned result:

${Status}: the administrator status.

Enable the Administrator account as needed.

GuestOS.WinNetworkStatus

Check the network configuration and status.

GuestOS.WinNetworkInterfaceDriver.Disabled

The instance's NIC is unavailable, which may prevent connections to the instance.

The NIC is disabled.

[
  {
    "Status": "not OK",
    "Device": "Red Hat VirtIO Ethernet Adapter"
  }
]

Attributes in the returned result:

  • ${Device}: the NIC driver name.

  • ${Status}: the status.

Repair the NIC status as needed.

For information about how to check and repair the NIC status, see Step 7: Check the network.

GuestOS.WinRDPPort.Closed

The instance's system port is not open, or the firewall is enabled, preventing access to the instance over RDP.

[
  {
    "Status": "disable",
    "Port": 3387
  },
  {
    "Status": "disable",
    "Port": 3388
  }
]

Attributes in the returned result:

  • ${Port}: the port.

  • ${Status}: the status.

Change the open status of this port as needed.

For information about how to enable port 3389 to allow RDP connections, see How do I enable Remote Desktop Services on a Windows ECS instance?

GuestOS.WinDHCPService.Disabled

The DHCP configuration is disabled on the instance's NIC, which may cause services to be inaccessible.

[
  {
    "Status": "enable",
    "Device": "Red Hat VirtIO Ethernet Adapter"
  }
]

Attributes in the returned result:

  • ${Device}: the device name.

  • ${Status}: the device status.

Change the open status of this port as needed.

GuestOS.WinNetworkInterface.LackIPV4Address

The instance's NIC has no IPv4 address, which may cause services to be inaccessible.

[
  {
    "Name": "eth0"
  }
]

Attribute in the returned result:

${Name}: the NIC name.

Check whether DHCP is enabled on the instance or a static IP address is configured.

GuestOS.NetworkProxy.Enabled

Network proxies configured for the instance, which may cause services to be inaccessible.

[
  {
    "Name": "ie"
  }
]

Attribute in the returned result:

${Name}: the proxy configured on the NIC.

Disable the network proxies as needed.

GuestOS.WinPort.Conflict

The instance's RDP port is used by another process, causing a port conflict that may prevent access to the instance over RDP.

{
  "ConflictPort": "3389",
  "ConflictServer": "svchost node"
}

Attributes in the returned result:

  • ${ConflictPort}: the service port.

  • ${ConflictServer}: the service that uses the port.

Log on to the instance by using VNC and modify the port for the Remote Desktop Service to work properly. For more information, see What do I do if port conflicts occur when I connect to a Windows ECS instance?

GuestOS.WinDiskStatus

Check the Windows disk status.

GuestOS.SystemDisk.Corrupted

The instance's system disk (C:) is abnormal, which may cause instance restart failures or driver installation issues.

{
 "Result": "Detection result or error message"
}

Attribute in the returned result:

${Result}: the system disk check result.

Recover the system disk during an appropriate period of time by using one of the following methods:

GuestOS.VirtIODriver.DiskIDConflicts

The instance has duplicate disk IDs due to an outdated virtio driver version, which may cause data loss on disks during disk reset operations.

Examples of disks with the same ID:

{
 "DiskUniqueIds": "List of unique disk IDs"
}

Attribute in the returned result:

${DiskUniqueIds}: the disk IDs.

Upgrade the virtio driver at the earliest opportunity.

For more information, see Update the virtio driver for a Windows instance.

GuestOS.WinFirewall

Check the Windows firewall status.

GuestOS.WinFirewall.Enabled

The firewall of the instance is enabled, which may cause services to be inaccessible.

[
  {
    "Status": "enabled",
    "Name": "Public"
  }
]

Attributes in the returned result:

  • ${Name}: the firewall name.

  • ${Status}: the status.

Modify the relevant firewall policy configurations as needed. For more information, see Configure firewall rules for a Windows ECS instance.

GuestOS.WinDriverStatus

Check the critical Windows driver status.

GuestOS.DiskFilterDriver.Vestigital

The instance has residual disk filter driver files, which may prevent the instance from recognizing newly attached disks.

{
 "UpperFilters": "Test"
}

Attributes in the returned result:

  • ${LowerFilters}: the name of the lower filter driver.

  • ${UpperFilters}: the name of the upper filter driver.

Clear invalid disk filter drivers as needed and restart the instance. For more information, see How do I check for residual disk driver entries in the registry of a Windows ECS instance?

GuestOS.VirtIODriver.Low

The instance's virtio driver version is {VirtioVersion}, which is outdated and may cause issues, such as blue screens, network packet loss, and disk data loss.

{
 "VirtioVersion": "virtio driver version",
 "RecommendedVersion":"Recommended version"
}

Upgrade the virtio driver version during an appropriate period of time.

For more information, see Update the virtio driver for a Windows instance.

Instance.Type.Xen

The instance type is outdated (based on Xen architecture), which may cause the operating system startup failures or device manager issues.

{
 "Status" : "disable"
}

Attribute in the returned result:

${Status}: the Xen driver residue, which may cause system startup failure or device manager issues.

Upgrade to a new-generation instance type as needed.

For more information, see Upgrade the instance types of subscription instances or Change the instance type of a pay-as-you-go instance.

GuestOS.WinSystemProcess

Check the critical Windows system process status.

GuestOS.RDPService.Unavailable

The instance's RDP service is disabled or corrupted, preventing access to the instance over RDP.

Restart or reinstall the RDP service as needed.

For more informaiton, see How do I enable Remote Desktop Services on a Windows ECS instance?

GuestOS.RDP.BlockedByFirewall

The instance's firewall blocks access to the RDP service, which may prevent connections to the instance over RDP.

[
  {
    "Rule": "v2.29|Action=Block|Active=TRUE|Dir=In|Protocol=6|Profile=Public|LPort=3389|Name=RDPPORTLatest-TCP-In|"
  }
]

Attribute in the returned result:

${Rule}: the firewall rule.

Disable the firewall or add a rule to allow RDP (port 3389) access in the firewall rules.

For information about how to allow RDP access in Windows, see Configure firewall rules for a Windows ECS instance.

GuestOS.WSUS.Disconnected

The instance's Windows Server Update Services (WSUS) cannot be connected, which may prevent normal product updates for the operating system.

Reconfigure WSUS as needed.

GuestOS.Metaserver.Disconnected

The instance's metadata service (metaserver) cannot be connected or the connection times out, which may cause the instance's metadata to be inaccessible.

Check whether the instance's firewall configuration blocks 100.100.100.200. If so, allow it in the firewall settings before accessing the metadata service.

For more information, see Instance metadata.

GuestOS.WinLicence.Expired

The instance's license for the Remote Desktop Service has expired, causing the RDP service to malfunction and preventing access to the instance over RDP.

Log on to the instance by using VNC, and purchase the Microsoft Remote Desktop Service license or uninstall the Remote Desktop Service as needed.

For information about how to fix Windows Remote Desktop license issues, see What do I do if I cannot connect to a Windows ECS instance by using RDP because no valid license is available for Remote Desktop Services?

GuestOS.WinThirdPartSoftware

Check the third-party software installation status.

GuestOS.Operation.InfluencedByAntivirusProcess

Third-party antivirus software is installed on the instance, which may cause management operation failures (such as password reset and remote connection) and instance exceptions.

Example of installed antivirus software:

{
 "AntivirusName": "QQPCRTP"
}

Attribute in the returned result:

${AntivirusName}: the antivirus software name.

Uninstall the corresponding software as needed.

Diagnostic items and results of user behavior tracking

Metric ID

Metric description

Result item ID

Result description

Recommended operation

Instance.UnexpectedSgCreationOrDeletion

Query operations related to creating and deleting security groups within a specified time period based on the Resource Access Management (RAM) role. If the AliyunServiceRoleForECSSelfService role does not exist, the system automatically creates it.

Instance.UnexpectedSgCreationOrDeletion.Log

Display operations related to creating and deleting security groups.

[
{
"accountId":"11111174379****",
"requestId":"8EB3E59F-878C-5613-8EB3-FE59FDBA****",
"eventSource":"ecs-unit-share.cn-hangzhou.aliyuncs.com",
"eventTime":"2022-11-29 14:51:00",
"eventName":"CreateSecurityGroup",
"sourceIpAddress":"cloudmonitor.aliyuncs.com",
"eventType":"ApiCall",
"referencedResources":"[[i-bp17557glrxatoi4****]]",
"userName":"AliyunServiceRoleForCloudMonitor:cloudmonitor"
}
]

View more details using ActionTrail. For more information, see Query events in the ActionTrail console.

Instance.UnexpectedSgMember

Query operations related to instances' association with or disassociation from security groups within a specified time period based on the RAM role. If the AliyunServiceRoleForECSSelfService role does not exist, the system automatically creates it.

Instance.UnexpectedSgMember.Log

Display operations related to instances' association with or disassociation from security groups.

[
{
"accountId":"11111174379****",
"requestId":"8EB3E59F-878C-5613-8EB3-FE59FDBA****",
"eventSource":"ecs-unit-share.cn-hangzhou.aliyuncs.com",
"eventTime":"2022-11-29 14:51:00",
"eventName":"JoinSecurityGroup",
"sourceIpAddress":"cloudmonitor.aliyuncs.com",
"eventType":"ApiCall",
"referencedResources":"[[i-bp17557glrxatoi4****]]",
"userName":"AliyunServiceRoleForCloudMonitor:cloudmonitor"
}
]

View more details using ActionTrail. For more information, see Query events by using the ActionTrail console.

Instance.UnexpectedFee

Query operations related to instance billing within a specified time period based on the RAM role. If the AliyunServiceRoleForECSSelfService role does not exist, the system automatically creates it.

Instance.UnexpectedFee.Log

Display operations related to instance billing.

[
{
"accountId":"11111174379****",
"requestId":"8EB3E59F-878C-5613-8EB3-FE59FDBA****",
"eventSource":"ecs-unit-share.cn-hangzhou.aliyuncs.com",
"eventTime":"2022-11-29 14:51:00",
"eventName":"Runinstances",
"sourceIpAddress":"cloudmonitor.aliyuncs.com",
"eventType":"ApiCall",
"referencedResources":"[[i-bp17557glrxatoi4****]]",
"userName":"AliyunServiceRoleForCloudMonitor:cloudmonitor"
}
]

View more details using ActionTrail. For more information, see Query events in the ActionTrail console.

Instance.UnexpectedCreationOrRelease

Query operations related to creating and deleting instances within a specified time period based on the RAM role. If the AliyunServiceRoleForECSSelfService role does not exist, the system automatically creates it.

Instance.UnexpectedCreationOrRelease.Log

Display operations related to creating and deleting instances.

[
{
"accountId":"11111174379****",
"requestId":"8EB3E59F-878C-5613-8EB3-FE59FDBA****",
"eventSource":"ecs-unit-share.cn-hangzhou.aliyuncs.com",
"eventTime":"2022-11-29 14:51:00",
"eventName":"Runinstances",
"sourceIpAddress":"cloudmonitor.aliyuncs.com",
"eventType":"ApiCall",
"referencedResources":"[[i-bp17557glrxatoi4****]]",
"userName":"AliyunServiceRoleForCloudMonitor:cloudmonitor"
}
]

View more details using ActionTrail. For more information, see Query events in the ActionTrail console.

Instance.UnexpectedRunningStatus

Query operations that affect the instance running status within a specified time period based on the RAM role. If the AliyunServiceRoleForECSSelfService role does not exist, the system automatically creates it.

Instance.UnexpectedRunningStatus.Log

Display operations that affect the instance running status.

[
{
"accountId":"11111174379****",
"requestId":"8EB3E59F-878C-5613-8EB3-FE59FDBA****",
"eventSource":"ecs-unit-share.cn-hangzhou.aliyuncs.com",
"eventTime":"2022-11-29 14:51:00",
"eventName":"Runinstances",
"sourceIpAddress":"cloudmonitor.aliyuncs.com",
"eventType":"ApiCall",
"referencedResources":"[[i-bp17557glrxatoi4****]]",
"userName":"AliyunServiceRoleForCloudMonitor:cloudmonitor"
}
]

View more details using ActionTrail. For more information, see Query events in the ActionTrail console.