The comprehensive instance diagnosis feature checks the system, network, and disk status of an instance. This helps you understand the health of your instance and promptly identify and resolve common issues.
Prerequisites
When you use the Instance Fee and Security Behavior Audit diagnosis feature, the system checks whether the current account has the AliyunServiceRoleForECSSelfService service-linked role. If the role does not exist, a prompt appears. After you confirm the prompt, the system automatically creates the AliyunServiceRoleForECSSelfService service-linked role.
The AliyunServiceRoleForECSSelfService role includes the AliyunServiceRolePolicyForECSSelfService system access policy. You cannot add, modify, or delete the permissions granted by this policy.
If you use a Resource Access Management (RAM) user to run Instance Fee and Security Behavior Audit diagnostics, contact the Alibaba Cloud account owner to grant the RAM user permission to create service-linked roles. For more information, see Create custom policies in edit mode and Grant permissions to a RAM user.
The following policy document grants a RAM user permission to use the self-service instance troubleshooting feature. The <account ID> is a variable. Replace it with the UID of your Alibaba Cloud account.
{ "Statement": [ { "Action": [ "ram:CreateServiceLinkedRole" ], "Resource": "acs:ram:*:<account ID>:role/*", "Effect": "Allow", "Condition": { "StringEquals": { "ram:ServiceName": [ "selfservice.ecs.aliyuncs.com" ] } } } ], "Version": "1" }If you are running a comprehensive diagnosis or diagnosing a network anomaly, ensure that the instance meets the following conditions:
Instance type: The instance belongs to an instance family that is available for purchase. For more information, see Instance families.
NoteDiscontinued instance families do not support the instance health diagnosis feature.
Instance status: The instance is in the Running state.
Operating system: If the selected scenario involves checking configurations within the instance's operating system, ensure that the operating system meets the conditions in the following table.
System architecture
Operating system version
Configuration within the operating system
x86 64-bit
Windows Server 2008 and later
Alibaba Cloud Linux 2/3
AlmaLinux 8.x and later
Anolis OS 7.x/8.x
CentOS 7.x/8.x
CentOS Stream 8 and later
Debian 8.x and later
Fedora 33/34
OpenSUSE 15.x/42.x
Rocky Linux 8.x and later
SUSE Linux Enterprise Server 12.x/15.x
Ubuntu 16.04/18.04/20.04/24.04
Python version: Python 3.6 or later
The Cloud Assistant Agent is installed. For more information, see Install the Cloud Assistant Agent.
NoteOperating system distributions not listed in the table are not supported. The diagnostic performance on unsupported distributions is not guaranteed.
If the scenario is Instance fails to start, ensure that the instance meets the following conditions:
Instance status: The instance is in the Stopped state.
Operating system: The selected scenario involves checking configurations within the instance's operating system. Ensure that the operating system meets the conditions in the following table.
System architecture
Operating system version
x86 64-bit
Windows Server 2008 and later
Alibaba Cloud Linux 2/3
AlmaLinux 8.x and later
Anolis OS 7.x/8.x
CentOS 7.x/8.x
CentOS Stream 8 and later
Debian 8.x and later
Fedora 33/34
OpenSUSE 15.x/42.x
Rocky Linux 8.x and later
SUSE Linux Enterprise Server 12.x/15.x
Ubuntu 16.04/18.04/20.04/24.04
NoteOperating system distributions not listed in the table are not supported. The diagnostic performance on unsupported distributions is not guaranteed.
Scenarios
Use the comprehensive instance diagnosis feature in the following scenarios to understand the health of your instance:
Troubleshoot issues: Run targeted diagnostics to find solutions for problems you encounter, such as a failed network connection.
Perform regular checks: Understand the overall health of your instance during routine operations and maintenance (O&M). This helps you promptly detect and handle issues to prevent business disruptions.
The instance health diagnosis feature provides problem descriptions and recommended solutions for each diagnostic item. For more information, see Diagnostic items and results.
Procedure
ECS console
Create an instance diagnosis
Log on to the ECS console.
In the navigation pane on the left, choose .
In the upper-left corner of the top menu bar, select a region.
Select a time and an instance ID, and then click Start.
NoteOnly one diagnostic task can be in progress for an instance at a time. The interval between two consecutive diagnoses must be more than 5 minutes.
Problem type
Description
Instance Performance Issues
Diagnose issues such as high CPU load, high memory usage, high bandwidth usage, high disk BPS or IOPS, or degraded performance on an ECS instance.
Instance Connection Errors or Startup Exceptions
Diagnose issues such as failed remote connections over the Secure Shell Protocol (SSH) or VNC, an instance that is down, or an instance's operating system failing to start.
Network Issues
Diagnose issues such as degraded network performance or ping failures on an ECS instance.
Ineffective Instance Operation
Diagnose issues where an operation on an ECS instance did not take effect, such as a disk expansion that was not applied.
Insufficient Resource Quota
Diagnose issues that occur because an ECS resource quota is reached. Examples include an insufficient disk capacity quota, an insufficient image quota, or reaching the maximum number of Elastic Network Interfaces (ENIs) or security groups.
Check for Security Risks
Diagnose security risks on an ECS instance, such as system vulnerabilities, security alerts, or malicious processes.
Instance Billing and Security Audit
Audit and trace operations related to ECS instance status, instance fees, and security groups.
NoteTo use the instance fee and security behavior audit feature, you must have the service-linked role and permissions for self-service instance troubleshooting. For more information, see Service-linked role AliyunServiceRoleForECSSelfService.
Instance Device Check
Check whether devices such as GPUs on an instance are running properly.
Others
You can directly enter the issue details, instance ID, and the corresponding troubleshooting epoch.
The actual diagnostic items may vary. In the diagnostic report, click the tabs under Diagnostic Item Details to view the items and their progress. The diagnosis takes a few minutes. You can view the progress on the current page or close the dialog box and check the diagnostic task list for the progress and the report.
View the diagnostic report.
The diagnostic report contains the following information:
Basic Information: Includes the diagnosis time range, resource ID, report ID, and diagnosis time.
Diagnosis Result: If all checks are normal, the result is No exceptions are detected on the instance. If any abnormal items are found, the specific items are displayed with recommended solutions. You can follow the recommendations to resolve the issues.
Diagnostic Item Details: Includes the results for each diagnostic item, with severity levels of Critical, Warning, and Passed.
NoteWhen you use the instance fee and security behavior audit feature, you can also obtain more information in the following ways:
To query more audit information, go to the ActionTrail console.
To query billing information, go to Billing Details.
You can use the diagnostic report to resolve issues.
For common issues, you can find solutions in the documentation. For more information, see Common issues and solutions for the guest OS of an ECS instance.
For instance startup failures, you can log on to the ECS instance and use the attached repair disk to fix the issue.
View diagnostic history
To review the historical health status of an instance, you can view its diagnostic history.
Log on to the ECS console.
View the instance's diagnostic history.
In the navigation pane on the left, choose .
In the top navigation bar, select a region.
On the Instance Troubleshooting tab, click View History.
On the Check History page, click the Instance Health Diagnosis tab, enter a resource ID or report ID, and then click the
icon.
NoteIn the diagnostic history report list, you can click the
icon to the left of Actions and select a status to filter the list.For a single diagnostic history entry, you can click View Report to view the detailed report, or click Re-diagnose to start a new diagnosis.
OpenAPI
You can query diagnostic metrics.
Call DescribeDiagnosticMetrics to query diagnostic metrics. For a list of available diagnostic metrics, see Diagnostic items and results.
You can manage diagnostic metric collections.
There are two types of diagnostic metric collections. You can use them to create diagnostic reports.
Public diagnostic metric collections: Public diagnostic metric collections are based on common user issues and help simplify the diagnosis process.
Public diagnostic metric collections are maintained by Alibaba Cloud. You cannot modify them. You can call DescribeDiagnosticMetricSets to query public diagnostic metric collections. The currently supported public diagnostic metric collections are as follows.
Metric name
Description
Scenario
dms-instancedefault
Default diagnostic collection
Used for a comprehensive check of an ECS instance.
Custom diagnostic metric collections: If you want to check only specific diagnostic metrics, you can call CreateDiagnosticMetricSet to create a custom diagnostic metric collection. After the collection is created, you can call DescribeDiagnosticMetricSets to query it.
The following sample response indicates that a custom diagnostic metric collection named test has been created.
{ "RequestId": "6AF68D67-601A-5278-AB10-4195CCA7****", "MetricSets": [ { "Type": "User", "MetricIds": [ "Instance.ControllerError", "Instance.CPUException", "Instance.CPUSplitLock" ], "MetricSetId": "dms-uf6ck3iljpbft15i****", "ResourceType": "instance", "MetricSetName": "test" } ] }
You can create a diagnostic report.
You can call CreateDiagnosticReport to create a diagnostic report using a custom or public diagnostic metric collection.
The following sample response indicates that the diagnostic report was successfully created.
{ "RequestId": "A1283ACE-2F19-54B9-9464-401EBD1A****", "ReportId": "dr-uf6aacg5g2fjp64i****" }You can query a diagnostic report.
You can call DescribeDiagnosticReports to query the details of a diagnostic report. The response returns the diagnosis result for each diagnostic metric in the collection. For more information about the results of diagnostic items, see Diagnostic items and results.
The following sample response indicates that the diagnosis is normal and no issues were found.
{ "RequestId": "20381C19-C31B-52AE-AC9B-8AD672E4****", "NextToken": "", "Reports": [ { "Status": "Finished", "EndTime": "2022-09-07T15:36Z", "ResourceId": "i-uf653eye7pkftni****", "MetricSetId": "dms-uf6ck3iljpbft15i****", "Issues": [], "StartTime": "2022-09-05T15:36Z", "CreationTime": "2022-09-07T15:36Z", "ReportId": "dr-uf6aacg5g2fjp64i****", "ResourceType": "instance", "Severity": "Normal", "FinishedTime": "2022-09-07T15:36Z" } ] }
References
DescribeDiagnosticMetrics - Query a list of diagnostic metrics.
DescribeDiagnosticReportAttributes - Query the details of a resource diagnostic report.
DeleteDiagnosticReports - Delete resource diagnostic reports.
ModifyDiagnosticMetricSet - Modify a resource diagnostic metric collection.
DeleteDiagnosticMetricSets - Delete resource diagnostic metric collections.