Category | Jump links |
Log collection | |
Application monitoring | |
Alibaba Cloud Prometheus monitoring | |
Open-source Prometheus monitoring (ack-prometheus-operator component) | |
Alert management | |
Other issues |
Log collection
How do I troubleshoot container log collection exceptions?
Issue
Container log collection exceptions occur, preventing new log content from being reported.
Solution
Check whether the machine group heartbeat is abnormal.
You can check the heartbeat status of a machine group to check whether Logtail is successfully installed.
Check the heartbeat status of the machine group.
Log on to the Simple Log Service console.
In the Projects section, click the one you want to manage.

In the left-side navigation pane, choose .
In the Machine Groups list, click the machine group whose heartbeat status you want to check.
On the Machine Group Configurations page, check the machine group status and record the number of nodes whose heartbeat status is OK.
Count the number of worker nodes in the cluster to which your container belongs.
Run the following command to view the number of worker nodes in the cluster:
kubectl get node | grep -v masterThe system returns information that is similar to the following code:
NAME STATUS ROLES AGE VERSION cn-hangzhou.i-bp17enxc2us3624wexh2 Ready <none> 238d v1.10.4 cn-hangzhou.i-bp1ad2b02jtqd1shi2ut Ready <none> 220d v1.10.4
Check whether the number of nodes whose heartbeat status is OK is equal to the number of worker nodes in the cluster. Then, select a troubleshooting method based on the check result.
The heartbeat status of all nodes in the machine group is Failed.
If you collect logs from standard Docker containers, check whether the values of the
${your_region_name},${your_aliyun_user_id}, and${your_machine_group_user_defined_id}parameters are valid. For more information, see Collect logs from standard Docker containers.If you use a Container Service for Kubernetes (ACK) cluster, submit a ticket. For more information, see Install Logtail.
If you use a self-managed Kubernetes cluster, check whether the values of the
{your-project-suffix},{regionId},{aliuid},{access-key-id}, and{access-key-secret}parameters are valid. For more information, see Collect text logs from Kubernetes containers in Sidecar mode.If the values are invalid, run the
helm del --purge alibaba-log-controllercommand to delete the installation package and then re-install the package.
The number of nodes whose heartbeat status is OK in the machine group is less than the number of worker nodes in the cluster.
Check whether a YAML file is used to deploy the required DaemonSet.
Run the following command. If a response is returned, the DaemonSet is deployed by using the YAML file.
kubectl get po -n kube-system -l k8s-app=logtailDownload the latest version of the Logtail DaemonSet template.
Configure the ${your_region_name}, ${your_aliyun_user_id}, and ${your_machine_group_name} parameters based on your business requirements.
Run the following command to update the YAML file:
kubectl apply -f ./logtail-daemonset.yaml
In other cases, submit a ticket.
Check whether container log collection is abnormal.
If no logs exist in the Consumption Preview section or on the query and analysis page of the related Logstore when you query data in the Simple Log Service console, Simple Log Service does not collect logs from your container. In this case, check the status of your container and perform the following operations.
ImportantTake note of the following items when you collect logs from container files:
Logtail collects only incremental logs. If a log file on your server is not updated after a Logtail configuration is delivered and applied to the server, Logtail does not collect logs from the file. For more information, see Read log files.
Logtail collects logs only from files in the default storage of containers or in the file systems that are mounted on containers. Other storage methods are not supported.
After logs are collected to a Logstore, you must create indexes. Then, you can query and analyze the logs in the Logstore. For more information, see Create indexes.
Check whether the heartbeat status of your machine group is normal. For more information, see Troubleshoot an error that occurs due to the abnormal heartbeat status of a machine group.
Check whether the Logtail configuration is valid.
Check whether the settings of the following parameters in the Logtail configuration meet your business requirements: IncludeLabel, ExcludeLabel, IncludeEnv, and ExcludeEnv.
NoteContainer labels are retrieved by running the docker inspect command. Container labels are different from Kubernetes labels.
To check whether logs can be collected as expected, you can temporarily remove the settings of the IncludeLabel, ExcludeLabel, IncludeEnv, and ExcludeEnv parameters from the Logtail configuration. If logs can be collected, the preceding parameters are invalid.
Other operations and maintenance operations.
View Logtail logs
The logs of Logtail are stored in the
ilogtail.LOGandlogtail_plugin.LOGfiles in the/usr/local/ilogtail/directory of a Logtail container.Log on to the Logtail container. For more information, see Log on to a Logtail container.
Go to the /usr/local/ilogtail/ directory.
cd /usr/local/ilogtailView the ilogtail.LOG and logtail_plugin.LOG files.
cat ilogtail.LOG cat logtail_plugin.LOG
Description of the standard output (stdout) of the Logtail container
The stdout of the container does not provide reference for troubleshooting. Ignore the following stdout:
start umount useless mount points, /shm$|/merged$|/mqueue$ umount: /logtail_host/var/lib/docker/overlay2/3fd0043af174cb0273c3c7869500fbe2bdb95d13b1e110172ef57fe840c82155/merged: must be superuser to unmount umount: /logtail_host/var/lib/docker/overlay2/d5b10aa19399992755de1f85d25009528daa749c1bf8c16edff44beab6e69718/merged: must be superuser to unmount umount: /logtail_host/var/lib/docker/overlay2/5c3125daddacedec29df72ad0c52fac800cd56c6e880dc4e8a640b1e16c22dbe/merged: must be superuser to unmount ...... xargs: umount: exited with status 255; aborting umount done start logtail ilogtail is running logtail status: ilogtail is runningView the status of Simple Log Service-related components in a Kubernetes cluster
Run the following command to view the status and information of the
alibaba-log-controllerDeployment:kubectl get deploy alibaba-log-controller -n kube-systemResult:
NAME READY UP-TO-DATE AVAILABLE AGE alibaba-log-controller 1/1 1 1 11dRun the following command to view the status and information of the logtail-ds DaemonSet:
kubectl get ds logtail-ds -n kube-systemResult:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE logtail-ds 2 2 2 2 2 **ux 11dView the version number, IP address, and startup time of Logtail
The related information is stored in the
/usr/local/ilogtail/app_info.jsonfile of your Logtail container.kubectl exec logtail-ds-****k -n kube-system cat /usr/local/ilogtail/app_info.jsonThe system returns information that is similar to the following code:
{ "UUID" : "", "hostname" : "logtail-****k", "instance_id" : "0EB****_172.20.4.2_1517810940", "ip" : "172.20.4.2", "logtail_version" : "0.16.2", "os" : "Linux; 3.10.0-693.2.2.el7.x86_64; #1 SMP Tue Sep 12 22:26:13 UTC 2017; x86_64", "update_time" : "2018-02-05 06:09:01" }
Why can't I delete a project?
Issue
How do I delete a project, or what should I do if I receive an "insufficient permissions" error message when I try to delete a project?
Solution
For more information about how to delete a project or Logstore, see Manage a project and Manage a Logstore. If a project fails to be deleted, see What do I do if the "Operation denied, insufficient permissions" error message is returned when I delete a project?
Common error types in Simple Log Service data collection
If you encounter other errors that are not described in this topic, submit a ticket.
Error | Description | Solution |
LOG_GROUP_WAIT_TOO_LONG_ALARM | After a data packet is generated, the system waits for a long period of time to send the packet. | Check whether the system sends packets as expected, the data volume exceeds the default limit, the quota is insufficient, or network errors occur. Note If you need to collect logs from a large number of log files and the log files occupy a large amount of memory, you can modify the startup parameters of Logtail. |
LOGFILE_PERMINSSION_ALARM | Logtail does not have the permissions to read the specified file. | Check whether the startup account of Logtail on the server is root. We recommend that you use the root account. |
SPLIT_LOG_FAIL_ALARM | Logtail failed to split logs into lines because the regular expression that is specified to match the beginning of the first line of a log did not match the content of the logs. | Check whether the regular expression is correct. If you want to collect single-line logs, you can specify |
MULTI_CONFIG_MATCH_ALARM | By default, you can use only one Logtail configuration to collect logs from a log file. If you use multiple Logtail configurations to collect logs from a log file, only one Logtail configuration takes effect. Note You can use multiple Logtail configurations to collect the standard output (stdout) or standard errors (stderr) of Docker containers. |
|
REGEX_MATCH_ALARM | In full regex mode, the log content does not match the specified regular expression. | Copy the sample log in the error details to generate a new regular expression. |
PARSE_LOG_FAIL_ALARM | In modes such as JSON and delimiter, Logtail failed to parse logs because the log format did not conform to the defined format. | Click the error link to view details. |
CATEGORY_CONFIG_ALARM | The Logtail configuration is invalid. | In most cases, this issue occurs if the specified regular expression fails to extract a part of the file path as a topic. If this issue occurs due to other causes, submit a ticket.
|
LOGTAIL_CRASH_ALARM | Logtail stops responding because the resource usage of the server on which Logtail runs exceeds the upper limit. | Increase the upper limits of CPU utilization and memory usage. For more information, see Configure the startup parameters of Logtail. |
REGISTER_INOTIFY_FAIL_ALARM | Logtail failed to register the log listener in Linux. This error may occur because Logtail does not have the permissions to access the folder on which Logtail listens or the folder has been deleted. | Check whether Logtail has the permissions to access the folder or the folder is deleted. |
DISCARD_DATA_ALARM | The CPU resources that are configured for Logtail are insufficient, or throttling is triggered when network data is sent. | Increase the upper limits of CPU utilization or concurrent operations that can be performed to send network data. For more information, see Configure the startup parameters of Logtail. |
SEND_DATA_FAIL_ALARM |
|
|
REGISTER_INOTIFY_FAIL_ALARM | Logtail failed to register the inotify watcher for the log directory. | Check whether the log directory exists and check the permission settings of the directory. |
SEND_QUOTA_EXCEED_ALARM | The log write traffic exceeds the limit. | Increase the number of shards in the Simple Log Service console. For more information, see Split a shard. |
READ_LOG_DELAY_ALARM | Log collection lags behind log generation. In most cases, this error occurs because the CPU resources that are configured for Logtail are insufficient or throttling is triggered when network data is sent. | Increase the upper limits of CPU utilization or concurrent operations that can be performed to send network data. For more information, see Configure the startup parameters of Logtail. When you import historical data, a large amount of data is collected in a short period of time. You can ignore this error. |
DROP_LOG_ALARM | Log collection lags behind log generation, and the number of log files that are generated during rotation and are not parsed exceeds 20. In most cases, this error occurs because the CPU resources that are configured for Logtail are insufficient or throttling is triggered when network data is sent. | Increase the upper limits of CPU utilization or concurrent operations that can be performed to send network data. For more information, see Configure the startup parameters of Logtail. |
LOGDIR_PERMINSSION_ALARM | Logtail does not have the permissions to read the log directory. | Check whether the log directory exists. If the log directory exists, check the permission settings of the directory. |
ENCODING_CONVERT_ALARM | Encoding failed. | Check whether the configuration for log encoding is consistent with the actual implementation of log encoding. |
OUTDATED_LOG_ALARM | The logs are expired. The log time lags behind the collection time for more than 12 hours. Possible causes:
|
|
STAT_LIMIT_ALARM | The number of files in the log directory that is specified in the Logtail configuration exceeds the limit. | Check whether the log directory contains a large number of files and subdirectories. Reconfigure the log directory for monitoring and the maximum number of levels of subdirectories that you want to monitor. You can also modify the mem_usage_limit parameter. For more information, see Configure the startup parameters of Logtail. |
DROP_DATA_ALARM | When the Logtail process exits, logs are dumped to the local disk. However, the dump operation times out. As a result, logs that are not dumped to the local disk are discarded. | In most cases, this error occurs because collection is severely blocked. Increase the upper limits of CPU utilization or concurrent operations that can be performed to send network data. For more information, see Configure the startup parameters of Logtail. |
INPUT_COLLECT_ALARM | An error occurred when data was collected from the input data source. | Fix the error based on the error details. |
HTTP_LOAD_ADDRESS_ALARM | The specified Addresses parameter in the Logtail configuration that is used to collect HTTP data is invalid. | Specify a valid value for the Addresses parameter. |
HTTP_COLLECT_ALARM | An error occurred when HTTP data was collected. | Fix the error based on the error details. In most cases, this error is caused by timeout. |
FILTER_INIT_ALARM | An error occurred when the filter was initialized. | In most cases, this error is caused by the invalid regular expressions of the filter. Fix the error based on the error details. |
INPUT_CANAL_ALARM | An error occurred in the plug-in that is used to collect MySQL binary logs. | Fix the error based on the error details. When a Logtail configuration is updated, the canal service may restart. If the error is caused by the service restart, you can ignore the error. |
CANAL_INVALID_ALARM | The plug-in that is used to collect MySQL binary logs is abnormal. | In most cases, this error is caused by inconsistent metadata. Metadata inconsistency may occur due to table scheme changes during running. Check whether table schemas are changed in the period during which the error is repeatedly reported. In other cases, submit a ticket. |
MYSQL_INIT_ALARM | An error occurred when MySQL was initialized. | Fix the error based on the error details. |
MYSQL_CHECKPOING_ALARM | The format of the checkpoints that are used to collect MySQL data is invalid. | Check whether to modify the checkpoint-related settings in the Logtail configuration. In other cases, submit a ticket. |
MYSQL_TIMEOUT_ALARM | The MySQL query times out. | Check whether the MySQL server is properly connected to the network. |
MYSQL_PARSE_ALARM | The MySQL query results failed to be parsed. | Check whether the format of the checkpoints that are used to collect MySQL data matches the format of the required fields. |
AGGREGATOR_ADD_ALARM | The system failed to add data to the queue. | Data is sent too fast. If large amounts of data need to be sent, you can ignore this error. |
ANCHOR_FIND_ALARM | An error occurred in the processor_anchor plug-in, an error occurred in the Logtail configuration, or logs that do not match the Logtail configuration exist. | Click the error link to view the sub-type of the error. The following sub-types are available. You can check the settings based on the error details of each sub-type.
|
ANCHOR_JSON_ALARM | An error occurred in the processor_anchor plug-in. The plug-in failed to expand the JSON data that is extracted based on the values of the Start and Stop parameters. | Click the error link to view details. View collected logs and related configurations to check whether configuration errors or invalid logs exist. |
CANAL_RUNTIME_ALARM | An error occurred in the plug-in that is used to collect MySQL binary logs. | Click the error link to view details and perform troubleshooting based on the details. In most cases, this error is related to the primary ApsaraDB RDS for MySQL instance that is connected. |
CHECKPOINT_INVALID_ALARM | The system failed to parse checkpoints. | Click the error link to view details and perform troubleshooting based on the details and the key-value pairs of the checkpoints in the details. The values of the checkpoints are indicated by the first 1,024 bytes in the checkpoint file. |
DIR_EXCEED_LIMIT_ALARM | The number of directories on which Logtail listens at the same time exceeds the limit. | Check whether the Logtail configurations whose data is saved to the current Logstore and other Logtail configurations on the server on which Logtail is installed involve a large number of subdirectories. Reconfigure the log directory for monitoring and the maximum number of levels of subdirectories that you want to monitor for each Logtail configuration.
|
DOCKER_FILE_MAPPING_ALARM | The system failed to add a Docker file mapping by running a Logtail command. | Click the error link to view details and perform troubleshooting based on the details and the command in the details. |
DOCKER_FILE_MATCH_ALARM | The specified file cannot be found in the Docker container. | Click the error link to view details and perform troubleshooting based on the container information and the file path that is used for the search. |
DOCKER_REGEX_COMPILE_ALARM | An error occurred in the service_docker_stdout plug-in. The system failed to compile data based on the value of the BeginLineRegex parameter provided in the Logtail configuration. | Click the error link to view details and check whether the regular expression in the details is correct. |
DOCKER_STDOUT_INIT_ALARM | The service_docker_stdout plug-in failed to be initialized. | Click the error link to view the sub-type of the error. The following sub-types are available:
|
DOCKER_STDOUT_START_ALARM | The stdout size exceeds the limit when the service_docker_stdout plug-in is used to collect data. | In most cases, this error occurs because the stdout already exists the first time you use the plug-in. You can ignore this error. |
DOCKER_STDOUT_STAT_ALARM | The service_docker_stdout plug-in cannot find the stdout. | In most cases, this error occurs because no stdout is available when a container is terminated. You can ignore this error. |
FILE_READER_EXCEED_ALARM | The number of files that Logtail opens at the same time exceeds the limit. | In most cases, this error occurs because Logtail is collecting logs from a large number of files. Check whether the settings of the Logtail configuration are proper. |
GEOIP_ALARM | An error occurred in the processor_geoip plug-in. | Click the error link to view the sub-type of the error. The following sub-types are available:
|
HTTP_INIT_ALARM | An error occurred in the metric_http plug-in. The plug-in failed to compile the regular expression that is specified by the ResponseStringMatch parameter in the Logtail configuration. | Click the error link to view details and check whether the regular expression in the details is correct. |
HTTP_PARSE_ALARM | An error occurred in the metric_http plug-in. The plug-in failed to obtain HTTP responses. | Click the error link to view details and check the Logtail configuration or the requested HTTP server based on the details. |
INIT_CHECKPOINT_ALARM | An error occurred in the plug-in that is used to collect binary logs. The plug-in failed to load the checkpoint file and started to process data from the beginning without a checkpoint. | Click the error link to view details and determine whether to ignore the error based on the details. |
LOAD_LOCAL_EVENT_ALARM | Logtail performs local event handling. | In most cases, this error does not occur. If this error is caused by a non-human operation, you must perform troubleshooting. You can click the error link to view details and perform troubleshooting based on the file name, Logtail configuration name, project, and Logstore in the details. |
LOG_REGEX_FIND_ALARM | Errors occur in the processor_split_log_regex and processor_split_log_string plug-ins. The plug-ins cannot find a match for the SplitKey parameter in logs. | Click the error link to view details and check whether configuration errors exist. |
LUMBER_CONNECTION_ALARM | An error occurred in the service_lumberjack plug-in. The server was shut down while the plug-in was stopped. | Click the error link to view details and perform troubleshooting based on the details. In most cases, you can ignore this error. |
LUMBER_LISTEN_ALARM | An error occurred in the service_lumberjack plug-in. The plug-in failed to perform listening during initialization. | Click the error link to view the sub-type of the error. The following sub-types are available:
|
LZ4_COMPRESS_FAIL_ALARM | An error occurred when Logtail performed LZ4 compression. | Click the error link to view details and perform troubleshooting based on the values of log lines, project, category, and region in the details. |
MYSQL_CHECKPOINT_ALARM | An error occurred in the plug-in that is used to collect MySQL data. The error is related to checkpoints. | Click the error link to view the sub-type of the error. The following sub-types are available:
|
NGINX_STATUS_COLLECT_ALARM | An error occurred in the nginx_status plug-in. The plug-in failed to obtain status information. | Click the error link to view details and perform troubleshooting based on the details and the URLs in the details. |
NGINX_STATUS_INIT_ALARM | An error occurred in the nginx_status plug-in. The plug-in failed to initialize the URLs specified in parsing configurations. | Click the error link to view details and check whether the URLs in the details are correct. |
OPEN_FILE_LIMIT_ALARM | Logtail failed to open the file because the number of opened files exceeded the limit. | Click the error link to view details and perform troubleshooting based on the file path, project, and Logstore in the details. |
OPEN_LOGFILE_FAIL_ALARM | An error occurred when Logtail opened the file. | Click the error link to view details and perform troubleshooting based on the file path, project, and Logstore in the details. |
PARSE_DOCKER_LINE_ALARM | An error occurred in the service_docker_stdout plug-in. The plug-in failed to parse the log. | Click the error link to view the sub-type of the error. The following sub-types are available:
|
PLUGIN_ALARM | An error occurred when plug-ins were initialized and called. | Click the error link to view the sub-type of the error. The following sub-types are available. You can perform troubleshooting based on the error details.
|
PROCESSOR_INIT_ALARM | An error occurred in the processor_regex plug-in. The plug-in failed to compile the regular expression that is specified in the Logtail configuration. | Click the error link to view details and check whether the regular expression in the details is correct. |
PROCESS_TOO_SLOW_ALARM | Logtail parses logs too slowly. |
|
REDIS_PARSE_ADDRESS_ALARM | An error occurred in the Redis plug-in. The plug-in failed to parse the value of the ServerUrls parameter that is provided in the Logtail configuration. | Click the error link to view details. Check the URLs for which the error is reported. |
REGEX_FIND_ALARM | An error occurred in the processor_regex plug-in. The plug-in failed to find the fields that are specified by the SourceKey parameter in logs. | Click the error link to view details. Check whether the SourceKey parameter is correctly configured or whether the logs are valid. |
REGEX_UNMATCHED_ALARM | An error occurred in the processor_regex plug-in. The match operation of the plug-in failed. | Click the error link to view the sub-type of the error. The following sub-types are available. You can perform troubleshooting based on the error details.
|
SAME_CONFIG_ALARM | Duplicate Logtail configurations are found for a Logstore. The most recent Logtail configuration that is found is discarded. | Click the error link to view details. Check whether configuration errors exist based on the details and the Logtail configuration path in the details. |
SPLIT_FIND_ALARM | Errors occurred in the split_char and split_string plug-ins. The plug-ins failed to find the fields that are specified by the SourceKey parameter in logs. | Click the error link to view details. Check whether the SourceKey parameter is correctly configured or whether the logs are valid. |
SPLIT_LOG_ALARM | Errors occurred in the processor_split_char and processor_split_string plug-ins. The number of parsed fields is different from the number of fields that are specified by the SplitKeys parameter. | Click the error link to view details. Check whether the SourceKey parameter is correctly configured or whether the logs are valid. |
STAT_FILE_ALARM | An error occurred when the LogFileReader object was used to collect data from a file. | Click the error link to view details and perform troubleshooting based on the details and the file path in the details. |
SERVICE_SYSLOG_INIT_ALARM | An error occurred in the service_syslog plug-in. The plug-in failed to be initialized. | Click the error link to view details. Check whether the Address parameter in the Logtail configuration is correctly configured. |
SERVICE_SYSLOG_STREAM_ALARM | An error occurred in the service_syslog plug-in. The plug-in failed to collect data over TCP. | Click the error link to view the sub-type of the error. The following sub-types are available. You can perform troubleshooting based on the error details.
|
SERVICE_SYSLOG_PACKET_ALARM | An error occurred in the service_syslog plug-in. The plug-in failed to collect data over UDP. | Click the error link to view the sub-type of the error. The following sub-types are available. You can perform troubleshooting based on the error details.
|
PARSE_TIME_FAIL_ALARM | The system failed to parse the log time. | You can use one of the following methods to identify the cause of the error and fix the error:
|
Application monitoring
Why is there no monitoring data after installing the agent for an ACK cluster application?
Cause
Application monitoring is suspended.
The ARMS agent is not loaded as expected at the pod where the application resides.
Solution
Check whether application monitoring is suspended.
Log on to the ARMS console. In the left-side navigation pane, choose .
On the Application List page, select a region in the top navigation bar and click the name of the application.
If the application is not found, proceed to Step 2: Check whether the ARMS agent is loaded as expected.
If you are using the new Application Real-Time Monitoring Service (ARMS) console, choose in the top navigation bar of the application details page. In the Probe switch settings section, check whether application monitoring is suspended.
If Pause application monitoring is turned on, turn off the switch and click Save.
If Pause application monitoring is turned off, proceed to Step 2: Check whether the ARMS agent is loaded as expected.
If you are using the old ARMS console, click Application Settings in the left-side navigation pane of the application details page. On the page that appears, click the Custom Configuration tab. In the Agent Switch Settings section, check whether Probe Master Switch is turned on.
If Probe Master Switch is turned off, turn on Probe Master Switch and click Save in the lower part of the page.
If Probe Master Switch is turned on, proceed to Step 2: Check whether the ARMS agent is loaded as expected.
Check whether the agent is correctly loaded.
Log on to the ACK console. In the left-side navigation pane, click Clusters. On the Clusters page, click the name of the cluster to go to the cluster details page.
In the left-side navigation pane, choose .
On the Pods page, select the namespace in which your application resides, find the application, and then click Edit YAML in the Actions column.
In the Edit YAML dialog box, check whether the YAML file contains initContainers.

In the left-side navigation pane of the cluster details page, choose . On the page that appears, set the Namespace parameter to ack-onepilot. Check if any pod named
ack-onepilot-*with completed rolling updates exists in the pod list.If the specified pod exists, perform Step 6.
If the specified pod does not exist, install the ack-onepilot component from the application market. For more information, see How do I install ack-onepilot and uninstall arms-pilot?
In the left-side navigation pane of the cluster details page, choose Workloads > Deployments or StatefulSets. On the page that appears, find the application and choose in the Actions column. In the Edit YAML dialog box, check whether the YAML file contains the following labels in the spec.template.metadata section:
labels: armsPilotAutoEnable: "on" armsPilotCreateAppName: "<your-deployment-name>" # Replace <your-deployment-name> with the actual application name. armsSecAutoEnable: "on" # If you want to connect the application to Application Security, you must configure this parameter.If the YAML file contains the labels, perform Step 7.
If the YAML file does not contain the labels, perform the following operations: In the Edit YAML dialog box, add the labels to the spec > template > metadata section and replace <your-deployment-name> with the actual application name. Then, click Update.
In the left-side navigation pane of the cluster details page, choose . On the page that appears, find the pod and choose in the Actions column to check whether the pod logs of ack-onepilot report a Security Token Service (STS) error in the
"Message":"STS error"format.If the error is reported, authorize the cluster of the application and restart the pod of the application. For more information, see Install the ARMS agent for Java applications deployed in ACK.
If the error is not reported, submit a ticket.
In the left-side navigation pane of the cluster details page, choose , find the pod and click Edit YAML in the Actions column. In the Edit YAML dialog box, check whether the YAML file contains the following javaagent parameter:
-javaagent:/home/admin/.opt/ArmsAgent/aliyun-java-agent.jarNoteIf you use an ARMS agent earlier than 2.7.3.5, replace aliyun-java-agent.jar in the preceding code with arms-bootstrap-1.7.0-SNAPSHOT.jar. We recommend that you upgrade the agent to the latest version at the earliest opportunity.
If the YAML file contains the parameter, find the pod on the Pods page and click Terminal in the Actions column to go to the command line page. Run the following command to check whether the logs file contains a log file with the .log file extension. Then, submit a ticket.
cd /home/admin/.opt/ArmsAgent/logsIf the YAML file does not contain the parameter, submit a ticket.
ARMS Addon Token does not exist in the cluster
Issue
ARMS Addon Token does not exist in the cluster.
Log on to the ACK console. In the left-side navigation pane, click Clusters. On the Clusters page, click the name of the cluster to go to the cluster details page.
In the left-side navigation pane, choose .
In the upper part of the page, select kube-system from the Namespace drop-down list to check whether addon.arms.token is enabled.

Solution
Grant ARMS access permissions to Container Service for Kubernetes.
Why is monitoring data abnormal after changing the cluster or namespace of an application?
Issue
The value displayed in the namespace column on the custom dashboard is not updated after you change the namespace of your application.

After you change the cluster of your application, the data for rate, errors, and duration (RED) metrics is displayed normally but no data is displayed for container monitoring metrics, such as CPU and memory.
Cause
Container-related parameters, such as Namespace and ClusterId, are configured when the application is created and the values of these parameters cannot be automatically updated. If you change the cluster or namespace of your application, the container-related data may fail to be queried or displayed.
Solution
Delete the application, recreate the application, and then report monitoring data again. For more information, see Delete applications.
This method causes the loss of historical data.
Submit a ticket.
How do I customize the Java agent mount path?
Background
Typically, the ack-onepilot component specifies the mount path for Application Real-Time Monitoring Service (ARMS) agents for Java by injecting the environment variable JAVA_TOOL_OPTIONS. However, you may need to customize this path for scenarios such as:
Centralized configuration management
Manage the mount path through a Kubernetes ConfigMap to ensure environment consistency.
Persistent storage
Store agent files in a custom persistent volume claim (PVC) to meet enterprise security or O&M requirements.
Solution
To customize the mount path for ARMS agents for Java, these version requirements must be met:
ack-onepilot: V4.1.0 or later.
ARMS agent for Java: V4.2.2 or later. You can control the version of your ARMS agent for Java.
This configuration also applies to Microservice Engine (MSE) due to shared ack-onepilot integration.
Add the
disableJavaToolOptionsInjectionannotation to the Kubernetes workload, such as a deployment, that requires a custom mount path.The ack-onepilot component will not automatically set the mount path or other Java Virtual Machine (JVM) parameters using the environment variable
JAVA_TOOL_OPTIONS.To view the YAML file of the deployment, run the following command:
kubectl get deployment {Deployment name} -o yamlNoteIf you're not sure about the deployment name, run the following command to list all deployments:
kubectl get deployments --all-namespaceThen, find the one you want in the results and view its YAML file.
Run the following command to edit the YAML file:
kubectl edit deployment {Deployment name} -o yamlIn the YAML file, add the following labels to
spec.template.metadata:labels: armsPilotAutoEnable: "on" armsPilotCreateAppName: "<your-deployment-name>" # The name of your deployment. disableJavaToolOptionsInjection: "true" # If you want to customize the mount path for the ARMS agent for Java, set this parameter to true.
Replace the default mount path
/home/admin/.opt/AliyunJavaAgent/aliyun-java-agent.jarin your Java startup script or command with your custom path:java -javaagent:/home/admin/.opt/AliyunJavaAgent/aliyun-java-agent.jar ... ... -jar xxx.jarOther information such as the reporting region and license key is provided by ack-onepilot through environment variables.
How do I report data across regions in an ACK cluster?
Issue
How do you report data from Region A to Region B across regions?
Solution
Update the ack-onepilot component to V4.0.0 or later.
Add the ARMS_REPORT_REGION environment variable to the ack-onepilot-ack-onepilot application in the ack-onepilot namespace. The value must be the ID of a region where ARMS is available. Example: cn-hangzhou or cn-beijing.
Restart the existing application or deploy a new application to report data across regions.
NoteAfter the environment variable is added, all applications deployed in the cluster report data to the region specified in the previous step.
How do I uninstall arms-pilot and install ack-onepilot?
Background
The old Application Monitoring agent arms-pilot is no longer maintained. You can install the new agent ack-onepilot to monitor your applications. ack-onepilot is fully compatible with arms-pilot. You can seamlessly install ack-onepilot without the need to modify application configurations. This topic describes how to uninstall arms-pilot and install ack-onepilot.
Solution
You must install ack-onepilot in an ACK cluster V1.16 or later. If your cluster is earlier than V1.16, upgrade the cluster first. For more information, see Update the Kubernetes version of an ACK cluster.
You must uninstall arms-pilot before installing ack-onepilot. If you have both ack-onepilot and arms-pilot installed, the ARMS agent cannot be mounted. If arms-pilot is not completely uninstalled, ack-onepilot does not work because it regards that arms-pilot is still working in the environment.
When you uninstall arms-pilot and install ack-onepilot, the configurations of arms-pilot cannot be automatically synchronized to ack-onepilot. We recommend that you record the configurations and then manually configure ack-onepilot.
Uninstall arms-pilot.
Log on to the ACK console. On the Clusters page, click the name of the cluster.
In the left-side navigation pane, choose .
On the Helm page, find arms-pilot and click Delete in the Actions column.
In the Delete message, click OK.
Check whether arms-pilot is uninstalled.
Go to the cluster details page of the ACK console. In the left-side navigation pane, choose . On the Deployments page, select arms-pilot from the Namespace drop-down list, and check whether the pods of the namespace are deleted as expected.
NoteIf you have modified the namespace to which arms-pilot belongs, select the new namespace.
Install ack-onepilot.
Log on to the ACK console. On the Clusters page, click the name of the cluster.
In the left-side navigation pane, click . On the Add-ons page, search for ack-onepilot.
Click Install on the ack-onepilot card.
NoteBy default, the ack-onepilot component supports 1,000 pods. For every additional 1,000 pods in the cluster, you must add 0.5 CPU cores and 512 MB memory for the component.
In the dialog box that appears, configure the parameters and click OK. We recommend that you use the default values.
NoteAfter you install ack-onepilot, you can upgrade, configure, or uninstall it on the Add-ons page.
Check whether ack-onepilot is installed.
Navigate to the cluster details page of the ACK console. In the left-side navigation pane, choose . On the Deployments page, select ack-onepilot from the Namespace drop-down list, and check whether the pods of the namespace are running as expected.
Alibaba Cloud Prometheus monitoring
No relevant dashboard found on the Prometheus monitoring page
Issue
After you enable Managed Service for Prometheus, log on to the ACK console and choose in the left-side navigation pane. On the Prometheus Monitoring page, No dashboard is found error message is displayed. To resolve the issue, you can perform the following steps:

Solution
Reinstall the ack-arms-prometheus component.
Uninstall the ack-arms-prometheus component.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the one you want to manage and click its name. In the left-side navigation pane, click Add-ons.
On the Add-ons page, click the Logs and Monitoring tab and find the ack-arms-prometheus component. Click Uninstall. In the dialog box that appears, click OK.
Reinstall the ack-arms-prometheus component.
Click Install. In the dialog box that appears, click OK.
Wait for the installation to complete. Go to the Prometheus Monitoring page to check whether the issue is resolved.
If the issue persists, perform the following steps.
Check whether the Prometheus instance is connected.
Log on to the ARMS console.
In the left-side navigation pane, click Integration Management.
On the Integrated Environments tab, check whether a container environment whose name is the same as the cluster exists in the Container Service list.
If you cannot find the container environment, view the dashboards in the ARMS or Managed Service for Prometheus console.
If you can find the container environment, click Configure Agent in the Actions column. The Configure Agent page appears.
Check whether the installed agents run as expected. If an error is reported, submit a ticket.
If the issue persists, submit a ticket to contact technical support.
Why is Managed Service for Prometheus data abnormal and cannot be displayed?
Cause
Managed Service for Prometheus data cannot be displayed because the synchronization job with Alibaba Cloud Prometheus cloud service failed, which resulted in resource registration failure, or because the Prometheus instance is not properly connected. You can perform the following steps to troubleshoot the issue.
Solution
Check the status of the job for accessing Managed Service for Prometheus.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose .
At the top of the Jobs page, set Namespace to arms-prom, and then click o11y-init-environment to verify whether the task is successful.
If the task is not successful, data may fail to be synchronized to Managed Service for Prometheus and resource registration may fail. You can view the pod logs to get the specific failure reason. For more information, see Troubleshoot pod issues.
If no pod exists, continue with the following steps.
Reinstall the Prometheus monitoring component.
Uninstall the ack-arms-prometheus component.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the one you want to manage and click its name. In the left-side navigation pane, click Add-ons.
On the Add-ons page, click the Logs and Monitoring tab and find the ack-arms-prometheus component. Click Uninstall. In the dialog box that appears, click OK.
Reinstall the ack-arms-prometheus component.
Click Install. In the dialog box that appears, click OK.
Wait for the installation to complete. Go to the Prometheus Monitoring page to check whether the issue is resolved.
If the issue persists, perform the following steps.
Check whether the Prometheus instance is connected.
Log on to the ARMS console.
In the left-side navigation pane, click Integration Management.
On the Integrated Environments tab, check whether a container environment whose name is the same as the cluster exists in the Container Service list.
If you cannot find the container environment, view the dashboards in the ARMS or Managed Service for Prometheus console.
If you can find the container environment, click Configure Agent in the Actions column. The Configure Agent page appears.
Check whether the installed agents run as expected. If an error is reported, submit a ticket.
If the preceding steps do not resolve the issue, submit a ticket to contact technical support for help.
Error "rendered manifests contain a resource that already exists" when reinstalling Alibaba Cloud Prometheus monitoring
Issue
When you uninstall and reinstall the Prometheus agent, the following error message appears:
rendered manifests contain a resource that already exists. Unable to continue with install: existing resource conflict: kind: ClusterRole, namespace: , name: arms-pilot-prom-k8s
Cause
After you run commands to manually uninstall the Prometheus agent, resources such as roles may fail to be deleted.
Solution
Run the following command to find the ClusterRoles of the Prometheus agent:
kubectl get ClusterRoles --all-namespaces | grep promRun the following command to delete the ClusterRoles that are queried in the previous step:
kubectl delete ClusterRole [$Cluster_Roles] -n arms-promNoteThe [$Cluster_Roles] parameter specifies the ClusterRoles that are queried in the previous step.
If the issue persists after you delete the ClusterRoles, view the value of kind in the error message to check whether resources other than ClusterRoles exist. Perform the preceding operations to delete them in sequence.
How do I check the version of the ack-arms-prometheus component?
Background
You need to check the version of the ack-arms-prometheus component deployed in your cluster and whether it needs to be updated.
Solution
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the one you want to manage and click its name. In the left-side navigation pane, click Add-ons.
On the Add-ons page, click the Logs and Monitoring tab and find the ack-arms-prometheus component.
The version number is displayed in the lower part of the component. If a new version is available, click Upgrade on the right side to update the component.
NoteThe Upgrade button is displayed only if the component is not updated to the latest version.
Why can't GPU monitoring be deployed?
Cause
Managed Service for Prometheus may be unable to monitor GPU-accelerated nodes that are configured with taints. You can perform the following steps to view the taints of a GPU-accelerated node.
Solution
Run the following command to view the taints of a GPU-accelerated node:
If you added custom taints to the GPU-accelerated node, you can view information about the custom taints. In this example, a taint whose
keyis set totest-key,valueis set totest-value, andeffectis set toNoScheduleis added to the node.kubectl describe node cn-beijing.47.100.***.***Expected output:
Taints:test-key=test-value:NoScheduleUse one of the following methods to handle the taint:
Run the following command to delete the taint from the GPU-accelerated node:
kubectl taint node cn-beijing.47.100.***.*** test-key=test-value:NoSchedule-Add a toleration rule that allows pods to be scheduled to the CPU-accelerated node with the taint.
# 1 Run the following command to modify ack-prometheus-gpu-exporter: kubectl edit daemonset -n arms-prom ack-prometheus-gpu-exporter # 2. Add the following fields to the YAML file to tolerate the taint: #Other fields are omitted. # The tolerations field must be added above the containers field and both fields must be of the same level. tolerations: - key: "test-key" operator: "Equal" value: "test-value" effect: "NoSchedule" containers: # Irrelevant fields are not shown.
How do I completely manually delete ARMS-Prometheus when manual resource deletion causes reinstallation failure?
Background
If you delete only the namespace of Managed Service for Prometheus, resource configurations are retained. In this case, you may fail to reinstall ack-arms-prometheus. You can perform the following operations to delete the residual resource configurations:
Solution
Run the following command to delete the arms-prom namespace:
kubectl delete namespace arms-promRun the following commands to delete the related ClusterRoles:
kubectl delete ClusterRole arms-kube-state-metrics kubectl delete ClusterRole arms-node-exporter kubectl delete ClusterRole arms-prom-ack-arms-prometheus-role kubectl delete ClusterRole arms-prometheus-oper3 kubectl delete ClusterRole arms-prometheus-ack-arms-prometheus-role kubectl delete ClusterRole arms-pilot-prom-k8s kubectl delete ClusterRole gpu-prometheus-exporter kubectl delete ClusterRole o11y:addon-controller:role kubectl delete ClusterRole arms-aliyunserviceroleforarms-clusterroleRun the following commands to delete the related ClusterRoleBindings:
kubectl delete ClusterRoleBinding arms-node-exporter kubectl delete ClusterRoleBinding arms-prom-ack-arms-prometheus-role-binding kubectl delete ClusterRoleBinding arms-prometheus-oper-bind2 kubectl delete ClusterRoleBinding arms-kube-state-metrics kubectl delete ClusterRoleBinding arms-pilot-prom-k8s kubectl delete ClusterRoleBinding arms-prometheus-ack-arms-prometheus-role-binding kubectl delete ClusterRoleBinding gpu-prometheus-exporter kubectl delete ClusterRoleBinding o11y:addon-controller:rolebinding kubectl delete ClusterRoleBinding arms-kube-state-metrics-agent kubectl delete ClusterRoleBinding arms-node-exporter-agent kubectl delete ClusterRoleBinding arms-aliyunserviceroleforarms-clusterrolebindingRun the following commands to delete the related Roles and RoleBindings:
kubectl delete Role arms-pilot-prom-spec-ns-k8s kubectl delete Role arms-pilot-prom-spec-ns-k8s -n kube-system kubectl delete RoleBinding arms-pilot-prom-spec-ns-k8s kubectl delete RoleBinding arms-pilot-prom-spec-ns-k8s -n kube-system
After you delete the residual resource configurations, go to the ACK console, choose Operations > Add-ons, and reinstall the ack-arms-prometheus component.
Error "xxx in use" when installing the ack-arms-prometheus component
Cause
When you deploy the ack-arms-prometheus component, an error message indicating "xxx in use" appears. This suggests that resources are being used or residual resources exist, causing the component installation to fail.
Solution
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage. Then, click the name of the cluster or click Details in the Actions column.
In the left-side navigation pane of the cluster details page, choose . On the Helm page, check whether the ack-arms-prometheus application is displayed.
If the ack-arms-prometheus application is displayed on the Helm page, delete the ack-arms-prometheus application and then install ack-arms-prometheus on the Add-ons page. For more information about how to install ack-arms-prometheus, see Manage components.
If the ack-arms-prometheus application is displayed on the Helm page, perform the following steps:
If the ack-arms-prometheus application is not displayed on the Helm page, residual data exists after the ack-arms-prometheus application is deleted. You must manually delete the residual data. For more information about how to delete the residual data related to ack-arms-prometheus, see FAQ.
Install ack-arms-prometheus on the Add-ons page. For more information about how to install ack-arms-prometheus, see Manage components.
If the issue persists, submit a ticket.
Installation failure after "Component Not Installed" prompt when installing the ack-arms-prometheus component
Issue
When you try to install the ack-arms-prometheus component, you first see a "Component Not Installed" prompt, but subsequent installation attempts still fail.
Solution
Check whether ack-arms-prometheus is already installed.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage. Then, click the name of the cluster or click Details in the Actions column.
Go to the cluster details page in the ACK console and choose Applications > Helm in the left-side navigation pane.
Check whether the ack-arms-prometheus application is displayed on the Helm page.
If ack-arms-prometheus is displayed on the Helm page, delete ack-arms-prometheus on the Helm page and then install ack-arms-prometheus from the Add-ons page. For more information about how to install ack-arms-prometheus, see Manage components.
If the ack-arms-prometheus application is not displayed on the Helm page, perform the following operations:
If ack-arms-prometheus is not displayed on the Helm page, it indicates that residual data exists after ack-arms-prometheus is deleted. You must manually delete the residual data. For more information about how to delete the residual data related to ack-arms-prometheus, see FAQ.
Install ack-arms-prometheus on the Add-ons page. For more information about how to install ack-arms-prometheus, see Manage components.
If the issue persists, submit a ticket.
Check whether errors are reported in the log of ack-arms-prometheus.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose .
In the upper part of the Deployments page, set Namespace to arms-prom and then click arms-prometheus-ack-arms-prometheus.
Click the Logs tab and check whether errors are reported in the log.
If errors are reported in the log, submit a ticket.
Check whether installation errors are reported by the Prometheus agent.
Log on to the ARMS console.
In the left-side navigation pane, click Integration Management.
On the Integrated Environments tab, view the environment list on the Container Service tab. Find the ACK environment instance and click Configure Agent in the Actions column. The Configure Agent page appears.
Check whether the installed agents run as expected. If an error is reported, submit a ticket.
Open-source Prometheus monitoring
How do I configure DingTalk alert notifications?
Issue
After deploying open-source Prometheus, you need to configure alert notifications through DingTalk.
Solution
Obtain the webhook URL of your DingTalk chatbot. For more information, see Event monitoring.
On the Parameters wizard page, find the dingtalk section, set enabled to true, and then specify the webhook URL of your DingTalk chatbot in the token field. For more information, see Configure DingTalk alert notifications in Alert configurations.
Error when deploying prometheus-operator
Issue
Can't install release with errors: rpc error: code = Unknown desc = object is being deleted: customresourcedefinitions.apiextensions.k8s.io "xxxxxxxx.monitoring.coreos.com" already existsSolution
The error message indicates that the cluster fails to clear custom resource definition (CRD) objects of the previous deployment. Run the following commands to delete the CRD objects. Then, deploy prometheus-operator again:
kubectl delete crd prometheuses.monitoring.coreos.com
kubectl delete crd prometheusrules.monitoring.coreos.com
kubectl delete crd servicemonitors.monitoring.coreos.com
kubectl delete crd alertmanagers.monitoring.coreos.comEmail alerts are not working
Issue
After deploying open-source Prometheus, your configured email alerts are not sending alert notifications.
Solution
Make sure that the value of smtp_auth_password is the SMTP authorization code instead of the logon password of the email account. Make sure that the SMTP server endpoint includes a port number.
Error "Cluster cannot be accessed, please try again or submit a ticket" when clicking YAML update
Issue
After deploying open-source Prometheus, when you click YAML update, the error "The current cluster cannot be accessed temporarily. Please try again later or submit a ticket for feedback" appears.
Solution
If the configuration file of Tiller is overlarge, the cluster cannot be accessed. To solve this issue, you can delete some annotations in the configuration file and mount the file to a pod as a ConfigMap. You can specify the name of the ConfigMap in the configMaps fields of the prometheus and alertmanager sections. For more information, see the second method in Mount a ConfigMap to Prometheus.
How do I enable features after deploying prometheus-operator?
Issue
After deploying open-source Prometheus, you may need to further configure it to enable specific features.
Solution
After prometheus-operator is deployed, you can perform the following steps to enable the features of prometheus-operator. Go to the cluster details page and choose in the left-side navigation pane. On the Helm page, find ack-prometheus-operator and click Update in the Actions column. In Update Release panel, configure the code block to enable the features. Then, click OK.
How to choose between TSDB and Alibaba Cloud disk
Issue
When selecting a storage solution, how do you choose between TSDB and Alibaba Cloud disk, and how do you configure the data reclamation policy?
Solution
TSDB storage is available to limited regions. However, disk storage is supported in all regions. The following figure shows how to configure the data retention policy.
Issues with Grafana Dashboard display
Issue
After deploying open-source Prometheus, there are display issues with the Grafana Dashboard.
Solution
Go to the cluster details page and choose in the left-side navigation pane. On the Helm page, find ack-prometheus-operator and click Update in the Actions column. In Update Release panel, check whether the value of the clusterVersion field is correct. If the Kubernetes version of your cluster is earlier than 1.16, set clusterVersion to 1.14.8-aliyun.1. If the Kubernetes version of your cluster is 1.16 or later, set clusterVersion to 1.16.6-aliyun.1.
Failure to reinstall ack-prometheus-operator after deleting its namespace
Cause
After you delete the ack-prometheus namespace, the related resource configurations may be retained. In this case, you may fail to install ack-prometheus again. You can perform the following operations to delete the related resource configurations:
Solution
Delete role-based access control (RBAC)-related resource configurations.
Run the following commands to delete the related ClusterRoles:
kubectl delete ClusterRole ack-prometheus-operator-grafana-clusterrole kubectl delete ClusterRole ack-prometheus-operator-kube-state-metrics kubectl delete ClusterRole psp-ack-prometheus-operator-kube-state-metrics kubectl delete ClusterRole psp-ack-prometheus-operator-prometheus-node-exporter kubectl delete ClusterRole ack-prometheus-operator-operator kubectl delete ClusterRole ack-prometheus-operator-operator-psp kubectl delete ClusterRole ack-prometheus-operator-prometheus kubectl delete ClusterRole ack-prometheus-operator-prometheus-pspRun the following commands to delete the related ClusterRoleBindings:
kubectl delete ClusterRoleBinding ack-prometheus-operator-grafana-clusterrolebinding kubectl delete ClusterRoleBinding ack-prometheus-operator-kube-state-metrics kubectl delete ClusterRoleBinding psp-ack-prometheus-operator-kube-state-metrics kubectl delete ClusterRoleBinding psp-ack-prometheus-operator-prometheus-node-exporter kubectl delete ClusterRoleBinding ack-prometheus-operator-operator kubectl delete ClusterRoleBinding ack-prometheus-operator-operator-psp kubectl delete ClusterRoleBinding ack-prometheus-operator-prometheus kubectl delete ClusterRoleBinding ack-prometheus-operator-prometheus-psp
Run the following command to delete the related CRD objects:
kubectl delete crd alertmanagerconfigs.monitoring.coreos.com kubectl delete crd alertmanagers.monitoring.coreos.com kubectl delete crd podmonitors.monitoring.coreos.com kubectl delete crd probes.monitoring.coreos.com kubectl delete crd prometheuses.monitoring.coreos.com kubectl delete crd prometheusrules.monitoring.coreos.com kubectl delete crd servicemonitors.monitoring.coreos.com kubectl delete crd thanosrulers.monitoring.coreos.com
Alert management
Alert rule synchronization failure with error message "The Project does not exist : k8s-log-xxx"
Issue
The alert rule synchronization status in the alert center shows the error message The Project does not exist : k8s-log-xxx.
Cause
You did not create an event center in Log Service for your cluster.
Solution
In the Simple Log Service console, check whether you have reached the quota limit. For more information about resources, see Basic resources.
If you have reached the quota limit, delete unnecessary projects or submit a ticket to request an increase in the project resource quota limit. For information about how to delete a project, see Manage projects.
If you have not reached the quota limit, perform the following steps.
Reinstall ack-node-problem-detector.
When you reinstall the component, a default project named k8s-log-xxxxxx is created.
Uninstall ack-node-problem-detector.
In the left-side navigation pane of the ACK conso details page of the target cluster, choose .
Click the Logs & Monitoring tab. In the ack-node-problem-detector card, click Uninstall. In the dialog box that appears, click OK.
After the uninstallation is complete, install ack-node-problem-detector.
In the left-side navigation pane, choose
On the Alert Configuration page, click Start Installation. The console automatically creates a project and installs and upgrades the components.
On the Alert Configuration page, turn off the switch in the Enabled column for the corresponding alert rule set. Wait until Alert Rule Status changes to Rule Disabled, and then turn on the switch to retry.
Alert rule synchronization failure with error message this rule have no xxx contact groups reference
Issue
The alert rule fails to be synchronized and an error message similar to this rule have no xxx contact groups reference is returned.
Cause
No contact group subscribes to the alert rule.
Solution
Create a contact group and add contacts.
Click Edit Notification Object on the right side of the corresponding alert rule set and configure a contact group that subscribes to the alert rule set.
Other issues
Why is there no data when running kubectl top pod/node?
Issue
When you run kubectl top pod or kubectl top node in the command line, no data is returned.
Solution
Run the following command to check whether the metrics-server API service works as normal:
kubectl get apiservices | grep metrics-server
If
v1beta1.metrics.k8s.ioshowsTruein the returned result, the metrics-server API service is working normally.Optional: If the metrics-server API service is not working normally, run the following commands on the cluster node to check whether metrics-server can be accessed through port 443 and port 8082 within the cluster:
curl -v <metrics-server_Pod_IP>:8082/apis/metrics/v1alpha1/nodes curl -v <metrics-server_Pod_IP>:443/apis/metrics/v1alpha1/nodesIf data is returned after you run the preceding command, it indicates that metrics-server can be accessed through port 443 and port 8082 within the cluster.
Optional: If metrics-server cannot be accessed through port 443 and port 8082 within the cluster, restart metrics-server.
You can delete the pod that runs metrics-server to restart metrics-server.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose .
At the top of the Stateless page, set Namespace to kube-system, and click metrics-server.
On the Pods tab, choose More >Delete from the Actions column of the metrics-server pod, and then click OK in the dialog box.
If you still cannot identify the issue after performing the preceding checks, submit a ticket by using the following template: submit a ticket.
Ticket template:
Does the metrics-server API service work as normal?
Yes
Can metrics-server be accessed through port 443 and port 8082 within the cluster?
Yes
Provide the cluster ID.
Why is there partial data missing when running kubectl top pod/node?
Issue
When you run kubectl top pod or kubectl top node in the command line, some data is missing.
Solution
Perform the following checks:
Check whether the data of all pods on a node is missing or only the data of some pods is missing. If the data of all pods on a node is missing, check whether a timezone difference exists on the node. You can use the
datecommand of the NTP server to check the timezone.Check whether the pod that runs metrics-server can connect to the node through port 10255.
If you still cannot identify the issue after performing the preceding checks, submit a ticket by using the following template: submit a ticket.
Ticket template:
Is the missing data about all pods on a specific node?
Yes
Does a timezone difference exist on the node?
No
Can metrics-server connect to the specific node?
Yes
What should I do if HPA cannot obtain metrics data?
Issue
When using Kubernetes Horizontal Pod Autoscaler (HPA), you may encounter situations where it cannot obtain metrics data.
Solution
Perform the following checks:
Check the result of running kubectl top pod for the corresponding pod. If the data is abnormal, see Why is there no data when running kubectl top pod/node? and Why is there partial data missing when running kubectl top pod/node? for troubleshooting.
If you still cannot identify the issue after performing the preceding checks, submit a ticket by using the following template: submit a ticket.
Ticket template:
Does the monitoring data show anomalies?
No
Run
kubectl describe hpa<hpa_name> and submit the metadata information.
Why does HPA create extra pods during rolling updates?
Issue
When performing a Kubernetes rolling update, you may notice that HPA (Horizontal Pod Autoscaler) unexpectedly launches additional pods.
Solution
Perform the following checks:
Check whether metrics-server is upgraded to the latest version. If the version is correct, you can use the kubectl edit deployments -n kube-system metrics-server command to add the following startup parameters to the command section:
--metric-resolution=15s
--enable-hpa-rolling-update-skipped=trueIf you issue after performing the preceding checks, submit a ticket by using the following template: submit a ticket.
Ticket template:
Is metrics-server upgraded to the latest version?
Yes
Are startup parameters added to prevent excess pods?
Yes
Run
kubectl describe hpa<hpa_name> and submit the HPA description.
> Edit YAML