All Products
Search
Document Center

Container Service for Kubernetes:Observability FAQ

Last Updated:Jun 26, 2025

Category

Jump links

Log collection

Application monitoring

Alibaba Cloud Prometheus monitoring

Open-source Prometheus monitoring (ack-prometheus-operator component)

Alert management

Other issues

Log collection

How do I troubleshoot container log collection exceptions?

Issue

Container log collection exceptions occur, preventing new log content from being reported.

Solution

  • Check whether the machine group heartbeat is abnormal.

    You can check the heartbeat status of a machine group to check whether Logtail is successfully installed.

    1. Check the heartbeat status of the machine group.

      1. Log on to the Simple Log Service console.

      2. In the Projects section, click the one you want to manage.

        image

      3. In the left-side navigation pane, choose Resources > Machine Groups.

      4. In the Machine Groups list, click the machine group whose heartbeat status you want to check.

      5. On the Machine Group Configurations page, check the machine group status and record the number of nodes whose heartbeat status is OK.

    2. Count the number of worker nodes in the cluster to which your container belongs.

      1. Connect to the cluster.

      2. Run the following command to view the number of worker nodes in the cluster:

        kubectl get node | grep -v master

        The system returns information that is similar to the following code:

        NAME                                 STATUS    ROLES     AGE       VERSION
        cn-hangzhou.i-bp17enxc2us3624wexh2   Ready     <none>    238d      v1.10.4
        cn-hangzhou.i-bp1ad2b02jtqd1shi2ut   Ready     <none>    220d      v1.10.4
    3. Check whether the number of nodes whose heartbeat status is OK is equal to the number of worker nodes in the cluster. Then, select a troubleshooting method based on the check result.

      • The heartbeat status of all nodes in the machine group is Failed.

        • If you collect logs from standard Docker containers, check whether the values of the ${your_region_name}, ${your_aliyun_user_id}, and ${your_machine_group_user_defined_id} parameters are valid. For more information, see Collect logs from standard Docker containers.

        • If you use a Container Service for Kubernetes (ACK) cluster, submit a ticket. For more information, see Install Logtail.

        • If you use a self-managed Kubernetes cluster, check whether the values of the {your-project-suffix}, {regionId}, {aliuid}, {access-key-id}, and {access-key-secret} parameters are valid. For more information, see Collect text logs from Kubernetes containers in Sidecar mode.

          If the values are invalid, run the helm del --purge alibaba-log-controller command to delete the installation package and then re-install the package.

      • The number of nodes whose heartbeat status is OK in the machine group is less than the number of worker nodes in the cluster.

        • Check whether a YAML file is used to deploy the required DaemonSet.

          1. Run the following command. If a response is returned, the DaemonSet is deployed by using the YAML file.

            kubectl get po -n kube-system -l k8s-app=logtail
          2. Download the latest version of the Logtail DaemonSet template.

          3. Configure the ${your_region_name}, ${your_aliyun_user_id}, and ${your_machine_group_name} parameters based on your business requirements.

          4. Run the following command to update the YAML file:

            kubectl apply -f ./logtail-daemonset.yaml
        • In other cases, submit a ticket.

  • Check whether container log collection is abnormal.

    If no logs exist in the Consumption Preview section or on the query and analysis page of the related Logstore when you query data in the Simple Log Service console, Simple Log Service does not collect logs from your container. In this case, check the status of your container and perform the following operations.

    Important
    • Take note of the following items when you collect logs from container files:

      • Logtail collects only incremental logs. If a log file on your server is not updated after a Logtail configuration is delivered and applied to the server, Logtail does not collect logs from the file. For more information, see Read log files.

      • Logtail collects logs only from files in the default storage of containers or in the file systems that are mounted on containers. Other storage methods are not supported.

    • After logs are collected to a Logstore, you must create indexes. Then, you can query and analyze the logs in the Logstore. For more information, see Create indexes.

    1. Check whether the heartbeat status of your machine group is normal. For more information, see Troubleshoot an error that occurs due to the abnormal heartbeat status of a machine group.

    2. Check whether the Logtail configuration is valid.

      Check whether the settings of the following parameters in the Logtail configuration meet your business requirements: IncludeLabel, ExcludeLabel, IncludeEnv, and ExcludeEnv.

      Note
      • Container labels are retrieved by running the docker inspect command. Container labels are different from Kubernetes labels.

      • To check whether logs can be collected as expected, you can temporarily remove the settings of the IncludeLabel, ExcludeLabel, IncludeEnv, and ExcludeEnv parameters from the Logtail configuration. If logs can be collected, the preceding parameters are invalid.

  • Other operations and maintenance operations.

    • View Logtail logs

      The logs of Logtail are stored in the ilogtail.LOG and logtail_plugin.LOG files in the /usr/local/ilogtail/ directory of a Logtail container.

      1. Log on to the Logtail container. For more information, see Log on to a Logtail container.

      2. Go to the /usr/local/ilogtail/ directory.

        cd /usr/local/ilogtail
      3. View the ilogtail.LOG and logtail_plugin.LOG files.

        cat ilogtail.LOG
        cat logtail_plugin.LOG
    • Description of the standard output (stdout) of the Logtail container

      The stdout of the container does not provide reference for troubleshooting. Ignore the following stdout:

      start umount useless mount points, /shm$|/merged$|/mqueue$
      umount: /logtail_host/var/lib/docker/overlay2/3fd0043af174cb0273c3c7869500fbe2bdb95d13b1e110172ef57fe840c82155/merged: must be superuser to unmount
      umount: /logtail_host/var/lib/docker/overlay2/d5b10aa19399992755de1f85d25009528daa749c1bf8c16edff44beab6e69718/merged: must be superuser to unmount
      umount: /logtail_host/var/lib/docker/overlay2/5c3125daddacedec29df72ad0c52fac800cd56c6e880dc4e8a640b1e16c22dbe/merged: must be superuser to unmount
      ......
      xargs: umount: exited with status 255; aborting
      umount done
      start logtail
      ilogtail is running
      logtail status:
      ilogtail is running
    • View the status of Simple Log Service-related components in a Kubernetes cluster

      Run the following command to view the status and information of the alibaba-log-controller Deployment:

      kubectl get deploy alibaba-log-controller -n kube-system

      Result:

      NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
      alibaba-log-controller   1/1     1            1           11d

      Run the following command to view the status and information of the logtail-ds DaemonSet:

      kubectl get ds logtail-ds -n kube-system

      Result:

      NAME         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR  AGE
      logtail-ds   2         2         2       2            2           **ux           11d
    • View the version number, IP address, and startup time of Logtail

      The related information is stored in the /usr/local/ilogtail/app_info.json file of your Logtail container.

      kubectl exec logtail-ds-****k -n kube-system cat /usr/local/ilogtail/app_info.json

      The system returns information that is similar to the following code:

      {
         "UUID" : "",
         "hostname" : "logtail-****k",
         "instance_id" : "0EB****_172.20.4.2_1517810940",
         "ip" : "172.20.4.2",
         "logtail_version" : "0.16.2",
         "os" : "Linux; 3.10.0-693.2.2.el7.x86_64; #1 SMP Tue Sep 12 22:26:13 UTC 2017; x86_64",
         "update_time" : "2018-02-05 06:09:01"
      }

Why can't I delete a project?

Issue

How do I delete a project, or what should I do if I receive an "insufficient permissions" error message when I try to delete a project?

Solution

For more information about how to delete a project or Logstore, see Manage a project and Manage a Logstore. If a project fails to be deleted, see What do I do if the "Operation denied, insufficient permissions" error message is returned when I delete a project?

Common error types in Simple Log Service data collection

If you encounter other errors that are not described in this topic, submit a ticket.

Error

Description

Solution

LOG_GROUP_WAIT_TOO_LONG_ALARM

After a data packet is generated, the system waits for a long period of time to send the packet.

Check whether the system sends packets as expected, the data volume exceeds the default limit, the quota is insufficient, or network errors occur.

Note

If you need to collect logs from a large number of log files and the log files occupy a large amount of memory, you can modify the startup parameters of Logtail.

LOGFILE_PERMINSSION_ALARM

Logtail does not have the permissions to read the specified file.

Check whether the startup account of Logtail on the server is root. We recommend that you use the root account.

SPLIT_LOG_FAIL_ALARM

Logtail failed to split logs into lines because the regular expression that is specified to match the beginning of the first line of a log did not match the content of the logs.

Check whether the regular expression is correct.

If you want to collect single-line logs, you can specify .* as the regular expression.

MULTI_CONFIG_MATCH_ALARM

By default, you can use only one Logtail configuration to collect logs from a log file. If you use multiple Logtail configurations to collect logs from a log file, only one Logtail configuration takes effect.

Note

You can use multiple Logtail configurations to collect the standard output (stdout) or standard errors (stderr) of Docker containers.

REGEX_MATCH_ALARM

In full regex mode, the log content does not match the specified regular expression.

Copy the sample log in the error details to generate a new regular expression.

PARSE_LOG_FAIL_ALARM

In modes such as JSON and delimiter, Logtail failed to parse logs because the log format did not conform to the defined format.

Click the error link to view details.

CATEGORY_CONFIG_ALARM

The Logtail configuration is invalid.

In most cases, this issue occurs if the specified regular expression fails to extract a part of the file path as a topic. If this issue occurs due to other causes, submit a ticket.

LOGTAIL_CRASH_ALARM

Logtail stops responding because the resource usage of the server on which Logtail runs exceeds the upper limit.

Increase the upper limits of CPU utilization and memory usage. For more information, see Configure the startup parameters of Logtail.

REGISTER_INOTIFY_FAIL_ALARM

Logtail failed to register the log listener in Linux. This error may occur because Logtail does not have the permissions to access the folder on which Logtail listens or the folder has been deleted.

Check whether Logtail has the permissions to access the folder or the folder is deleted.

DISCARD_DATA_ALARM

The CPU resources that are configured for Logtail are insufficient, or throttling is triggered when network data is sent.

Increase the upper limits of CPU utilization or concurrent operations that can be performed to send network data. For more information, see Configure the startup parameters of Logtail.

SEND_DATA_FAIL_ALARM

  • No AccessKey pair is created for your Alibaba Cloud account.

  • The server on which Logtail runs failed to connect to Simple Log Service, or the network quality is poor.

  • The write quota on Simple Log Service is insufficient.

  • Create an AccessKey pair for your Alibaba Cloud account.

  • Check the local configuration file /usr/local/ilogtail/ilogtail_config.json and run the curl <Server address> command to check whether content is returned.

  • Increase the number of shards for the Logstore to allow more data to be written to the Logstore.

REGISTER_INOTIFY_FAIL_ALARM

Logtail failed to register the inotify watcher for the log directory.

Check whether the log directory exists and check the permission settings of the directory.

SEND_QUOTA_EXCEED_ALARM

The log write traffic exceeds the limit.

Increase the number of shards in the Simple Log Service console. For more information, see Split a shard.

READ_LOG_DELAY_ALARM

Log collection lags behind log generation. In most cases, this error occurs because the CPU resources that are configured for Logtail are insufficient or throttling is triggered when network data is sent.

Increase the upper limits of CPU utilization or concurrent operations that can be performed to send network data. For more information, see Configure the startup parameters of Logtail.

When you import historical data, a large amount of data is collected in a short period of time. You can ignore this error.

DROP_LOG_ALARM

Log collection lags behind log generation, and the number of log files that are generated during rotation and are not parsed exceeds 20. In most cases, this error occurs because the CPU resources that are configured for Logtail are insufficient or throttling is triggered when network data is sent.

Increase the upper limits of CPU utilization or concurrent operations that can be performed to send network data. For more information, see Configure the startup parameters of Logtail.

LOGDIR_PERMINSSION_ALARM

Logtail does not have the permissions to read the log directory.

Check whether the log directory exists. If the log directory exists, check the permission settings of the directory.

ENCODING_CONVERT_ALARM

Encoding failed.

Check whether the configuration for log encoding is consistent with the actual implementation of log encoding.

OUTDATED_LOG_ALARM

The logs are expired. The log time lags behind the collection time for more than 12 hours. Possible causes:

  • The log parsing progress lags behind the expected time for more than 12 hours.

  • The custom time field is incorrectly configured.

  • The time output of the log recording program is invalid.

  • Check whether the READ_LOG_DELAY_ALARM error is reported.

    If the error is reported, fix the error. If the error is not reported, check the configuration of the time field.

  • Check the configuration of the time field. If the time field is correctly configured, check whether the time output of the log recording program is valid.

STAT_LIMIT_ALARM

The number of files in the log directory that is specified in the Logtail configuration exceeds the limit.

Check whether the log directory contains a large number of files and subdirectories. Reconfigure the log directory for monitoring and the maximum number of levels of subdirectories that you want to monitor.

You can also modify the mem_usage_limit parameter. For more information, see Configure the startup parameters of Logtail.

DROP_DATA_ALARM

When the Logtail process exits, logs are dumped to the local disk. However, the dump operation times out. As a result, logs that are not dumped to the local disk are discarded.

In most cases, this error occurs because collection is severely blocked. Increase the upper limits of CPU utilization or concurrent operations that can be performed to send network data. For more information, see Configure the startup parameters of Logtail.

INPUT_COLLECT_ALARM

An error occurred when data was collected from the input data source.

Fix the error based on the error details.

HTTP_LOAD_ADDRESS_ALARM

The specified Addresses parameter in the Logtail configuration that is used to collect HTTP data is invalid.

Specify a valid value for the Addresses parameter.

HTTP_COLLECT_ALARM

An error occurred when HTTP data was collected.

Fix the error based on the error details. In most cases, this error is caused by timeout.

FILTER_INIT_ALARM

An error occurred when the filter was initialized.

In most cases, this error is caused by the invalid regular expressions of the filter. Fix the error based on the error details.

INPUT_CANAL_ALARM

An error occurred in the plug-in that is used to collect MySQL binary logs.

Fix the error based on the error details.

When a Logtail configuration is updated, the canal service may restart. If the error is caused by the service restart, you can ignore the error.

CANAL_INVALID_ALARM

The plug-in that is used to collect MySQL binary logs is abnormal.

In most cases, this error is caused by inconsistent metadata. Metadata inconsistency may occur due to table scheme changes during running. Check whether table schemas are changed in the period during which the error is repeatedly reported. In other cases, submit a ticket.

MYSQL_INIT_ALARM

An error occurred when MySQL was initialized.

Fix the error based on the error details.

MYSQL_CHECKPOING_ALARM

The format of the checkpoints that are used to collect MySQL data is invalid.

Check whether to modify the checkpoint-related settings in the Logtail configuration. In other cases, submit a ticket.

MYSQL_TIMEOUT_ALARM

The MySQL query times out.

Check whether the MySQL server is properly connected to the network.

MYSQL_PARSE_ALARM

The MySQL query results failed to be parsed.

Check whether the format of the checkpoints that are used to collect MySQL data matches the format of the required fields.

AGGREGATOR_ADD_ALARM

The system failed to add data to the queue.

Data is sent too fast. If large amounts of data need to be sent, you can ignore this error.

ANCHOR_FIND_ALARM

An error occurred in the processor_anchor plug-in, an error occurred in the Logtail configuration, or logs that do not match the Logtail configuration exist.

Click the error link to view the sub-type of the error. The following sub-types are available. You can check the settings based on the error details of each sub-type.

  • anchor cannot find key: The SourceKey parameter is configured in the Logtail configuration, but the specified fields are not found in logs.

  • anchor no start: The system cannot find a match for the Start parameter in the value of SourceKey.

  • anchor no stop: The system cannot find a match for the Stop parameter in the value of SourceKey.

ANCHOR_JSON_ALARM

An error occurred in the processor_anchor plug-in. The plug-in failed to expand the JSON data that is extracted based on the values of the Start and Stop parameters.

Click the error link to view details. View collected logs and related configurations to check whether configuration errors or invalid logs exist.

CANAL_RUNTIME_ALARM

An error occurred in the plug-in that is used to collect MySQL binary logs.

Click the error link to view details and perform troubleshooting based on the details. In most cases, this error is related to the primary ApsaraDB RDS for MySQL instance that is connected.

CHECKPOINT_INVALID_ALARM

The system failed to parse checkpoints.

Click the error link to view details and perform troubleshooting based on the details and the key-value pairs of the checkpoints in the details. The values of the checkpoints are indicated by the first 1,024 bytes in the checkpoint file.

DIR_EXCEED_LIMIT_ALARM

The number of directories on which Logtail listens at the same time exceeds the limit.

Check whether the Logtail configurations whose data is saved to the current Logstore and other Logtail configurations on the server on which Logtail is installed involve a large number of subdirectories. Reconfigure the log directory for monitoring and the maximum number of levels of subdirectories that you want to monitor for each Logtail configuration.

DOCKER_FILE_MAPPING_ALARM

The system failed to add a Docker file mapping by running a Logtail command.

Click the error link to view details and perform troubleshooting based on the details and the command in the details.

DOCKER_FILE_MATCH_ALARM

The specified file cannot be found in the Docker container.

Click the error link to view details and perform troubleshooting based on the container information and the file path that is used for the search.

DOCKER_REGEX_COMPILE_ALARM

An error occurred in the service_docker_stdout plug-in. The system failed to compile data based on the value of the BeginLineRegex parameter provided in the Logtail configuration.

Click the error link to view details and check whether the regular expression in the details is correct.

DOCKER_STDOUT_INIT_ALARM

The service_docker_stdout plug-in failed to be initialized.

Click the error link to view the sub-type of the error. The following sub-types are available:

  • host...version...error: Check whether the Docker engine that is specified in the Logtail configuration can be accessed.

  • load checkpoint error: The system failed to load the checkpoint file. If this error does not affect your business, you can ignore this error.

  • container...: The specified container has invalid label values. Only stdout and stderr are supported. Perform troubleshooting based on the error details.

DOCKER_STDOUT_START_ALARM

The stdout size exceeds the limit when the service_docker_stdout plug-in is used to collect data.

In most cases, this error occurs because the stdout already exists the first time you use the plug-in. You can ignore this error.

DOCKER_STDOUT_STAT_ALARM

The service_docker_stdout plug-in cannot find the stdout.

In most cases, this error occurs because no stdout is available when a container is terminated. You can ignore this error.

FILE_READER_EXCEED_ALARM

The number of files that Logtail opens at the same time exceeds the limit.

In most cases, this error occurs because Logtail is collecting logs from a large number of files. Check whether the settings of the Logtail configuration are proper.

GEOIP_ALARM

An error occurred in the processor_geoip plug-in.

Click the error link to view the sub-type of the error. The following sub-types are available:

  • invalid ip...: The system failed to obtain an IP address. Check whether the SourceKey parameter in the Logtail configuration is correctly configured or whether invalid logs exist.

  • parse ip...: The system failed to parse an IP address into a city. Perform troubleshooting based on the error details.

  • cannot find key...: The system cannot find a match for the SourceKey parameter in logs. Check whether the Logtail configuration is correct or whether invalid logs exist.

HTTP_INIT_ALARM

An error occurred in the metric_http plug-in. The plug-in failed to compile the regular expression that is specified by the ResponseStringMatch parameter in the Logtail configuration.

Click the error link to view details and check whether the regular expression in the details is correct.

HTTP_PARSE_ALARM

An error occurred in the metric_http plug-in. The plug-in failed to obtain HTTP responses.

Click the error link to view details and check the Logtail configuration or the requested HTTP server based on the details.

INIT_CHECKPOINT_ALARM

An error occurred in the plug-in that is used to collect binary logs. The plug-in failed to load the checkpoint file and started to process data from the beginning without a checkpoint.

Click the error link to view details and determine whether to ignore the error based on the details.

LOAD_LOCAL_EVENT_ALARM

Logtail performs local event handling.

In most cases, this error does not occur. If this error is caused by a non-human operation, you must perform troubleshooting. You can click the error link to view details and perform troubleshooting based on the file name, Logtail configuration name, project, and Logstore in the details.

LOG_REGEX_FIND_ALARM

Errors occur in the processor_split_log_regex and processor_split_log_string plug-ins. The plug-ins cannot find a match for the SplitKey parameter in logs.

Click the error link to view details and check whether configuration errors exist.

LUMBER_CONNECTION_ALARM

An error occurred in the service_lumberjack plug-in. The server was shut down while the plug-in was stopped.

Click the error link to view details and perform troubleshooting based on the details. In most cases, you can ignore this error.

LUMBER_LISTEN_ALARM

An error occurred in the service_lumberjack plug-in. The plug-in failed to perform listening during initialization.

Click the error link to view the sub-type of the error. The following sub-types are available:

  • init tls error...: Check whether TLS-related configurations are correct based on the error details.

  • listen init error...: Check whether address-related configurations are correct based on the error details.

LZ4_COMPRESS_FAIL_ALARM

An error occurred when Logtail performed LZ4 compression.

Click the error link to view details and perform troubleshooting based on the values of log lines, project, category, and region in the details.

MYSQL_CHECKPOINT_ALARM

An error occurred in the plug-in that is used to collect MySQL data. The error is related to checkpoints.

Click the error link to view the sub-type of the error. The following sub-types are available:

  • init checkpoint error...: Checkpoint initialization failed. Check the checkpoint column specified by the Logtail configuration based on the error details and check whether the obtained values are correct.

  • not matched checkpoint...: Logs do not match the checkpoint information. Check whether the mismatch is caused by user operations such as configuration update based on the error details. If the mismatch is caused by user operations, you can ignore the error.

NGINX_STATUS_COLLECT_ALARM

An error occurred in the nginx_status plug-in. The plug-in failed to obtain status information.

Click the error link to view details and perform troubleshooting based on the details and the URLs in the details.

NGINX_STATUS_INIT_ALARM

An error occurred in the nginx_status plug-in. The plug-in failed to initialize the URLs specified in parsing configurations.

Click the error link to view details and check whether the URLs in the details are correct.

OPEN_FILE_LIMIT_ALARM

Logtail failed to open the file because the number of opened files exceeded the limit.

Click the error link to view details and perform troubleshooting based on the file path, project, and Logstore in the details.

OPEN_LOGFILE_FAIL_ALARM

An error occurred when Logtail opened the file.

Click the error link to view details and perform troubleshooting based on the file path, project, and Logstore in the details.

PARSE_DOCKER_LINE_ALARM

An error occurred in the service_docker_stdout plug-in. The plug-in failed to parse the log.

Click the error link to view the sub-type of the error. The following sub-types are available:

  • parse docker line error: empty line: The log is empty.

  • parse json docker line error...: The system failed to parse the log into the JSON format. Perform troubleshooting based on the error details and the first 512 bytes of the log.

  • parse cri docker line error...: The system failed to parse the log into the CRI format. Perform troubleshooting based on the error details and the first 512 bytes of the log.

PLUGIN_ALARM

An error occurred when plug-ins were initialized and called.

Click the error link to view the sub-type of the error. The following sub-types are available. You can perform troubleshooting based on the error details.

  • init plugin error...: The system failed to initialize a plug-in.

  • hold on error...: The system failed to suspend a plug-in.

  • resume error...: The system failed to resume a plug-in.

  • start service error...: The system failed to start a plug-in of the service input type.

  • stop service error...: The system failed to stop a plug-in of the service input type.

PROCESSOR_INIT_ALARM

An error occurred in the processor_regex plug-in. The plug-in failed to compile the regular expression that is specified in the Logtail configuration.

Click the error link to view details and check whether the regular expression in the details is correct.

PROCESS_TOO_SLOW_ALARM

Logtail parses logs too slowly.

  1. Click the error link to view details and check whether the parsing speed is acceptable based on the log quantity, buffer size, and parsing time in the details.

  2. If the speed is too slow, check whether other processes on the server on which Logtail is installed occupy excessive CPU resources or whether inappropriate parsing configurations exist, such as inefficient regular expressions.

REDIS_PARSE_ADDRESS_ALARM

An error occurred in the Redis plug-in. The plug-in failed to parse the value of the ServerUrls parameter that is provided in the Logtail configuration.

Click the error link to view details. Check the URLs for which the error is reported.

REGEX_FIND_ALARM

An error occurred in the processor_regex plug-in. The plug-in failed to find the fields that are specified by the SourceKey parameter in logs.

Click the error link to view details. Check whether the SourceKey parameter is correctly configured or whether the logs are valid.

REGEX_UNMATCHED_ALARM

An error occurred in the processor_regex plug-in. The match operation of the plug-in failed.

Click the error link to view the sub-type of the error. The following sub-types are available. You can perform troubleshooting based on the error details.

  • unmatch this log content...: The system failed to match logs against the regular expression specified in the Logtail configuration.

  • match result count less...: The number of matched fields is less than the number of fields specified by Keys in the Logtail configuration.

SAME_CONFIG_ALARM

Duplicate Logtail configurations are found for a Logstore. The most recent Logtail configuration that is found is discarded.

Click the error link to view details. Check whether configuration errors exist based on the details and the Logtail configuration path in the details.

SPLIT_FIND_ALARM

Errors occurred in the split_char and split_string plug-ins. The plug-ins failed to find the fields that are specified by the SourceKey parameter in logs.

Click the error link to view details. Check whether the SourceKey parameter is correctly configured or whether the logs are valid.

SPLIT_LOG_ALARM

Errors occurred in the processor_split_char and processor_split_string plug-ins. The number of parsed fields is different from the number of fields that are specified by the SplitKeys parameter.

Click the error link to view details. Check whether the SourceKey parameter is correctly configured or whether the logs are valid.

STAT_FILE_ALARM

An error occurred when the LogFileReader object was used to collect data from a file.

Click the error link to view details and perform troubleshooting based on the details and the file path in the details.

SERVICE_SYSLOG_INIT_ALARM

An error occurred in the service_syslog plug-in. The plug-in failed to be initialized.

Click the error link to view details. Check whether the Address parameter in the Logtail configuration is correctly configured.

SERVICE_SYSLOG_STREAM_ALARM

An error occurred in the service_syslog plug-in. The plug-in failed to collect data over TCP.

Click the error link to view the sub-type of the error. The following sub-types are available. You can perform troubleshooting based on the error details.

  • accept error...: An error occurs when the Accept command is run. The plug-in waits for a while and tries again.

  • setKeepAlive error...: The system failed to configure the Keep Alive parameter. The plug-in skips this error and continues.

  • connection i/o timeout...: A TCP read times out. The plug-in modifies the timeout period and continues to read data.

  • scan error...: A TCP read error occurs. The plug-in waits for a while and tries again.

SERVICE_SYSLOG_PACKET_ALARM

An error occurred in the service_syslog plug-in. The plug-in failed to collect data over UDP.

Click the error link to view the sub-type of the error. The following sub-types are available. You can perform troubleshooting based on the error details.

  • connection i/o timeout...: A UDP read times out. The plug-in modifies the timeout period and continues to read data.

  • read from error...: A UDP read error occurs. The plug-in waits for a while and tries again.

PARSE_TIME_FAIL_ALARM

The system failed to parse the log time.

You can use one of the following methods to identify the cause of the error and fix the error:

  • Check whether the time field that is extracted by using a regular expression is correct.

  • Check whether the value of the specified time field matches the time expression specified in the Logtail configuration.

Application monitoring

Why is there no monitoring data after installing the agent for an ACK cluster application?

Cause

  1. Application monitoring is suspended.

  2. The ARMS agent is not loaded as expected at the pod where the application resides.

Solution

  1. Check whether application monitoring is suspended.

    1. Log on to the ARMS console. In the left-side navigation pane, choose Application Monitoring > Application List.

    2. On the Application List page, select a region in the top navigation bar and click the name of the application.

      If the application is not found, proceed to Step 2: Check whether the ARMS agent is loaded as expected.

    3. If you are using the new Application Real-Time Monitoring Service (ARMS) console, choose Configuration > Custom Configurations in the top navigation bar of the application details page. In the Probe switch settings section, check whether application monitoring is suspended.

    4. If you are using the old ARMS console, click Application Settings in the left-side navigation pane of the application details page. On the page that appears, click the Custom Configuration tab. In the Agent Switch Settings section, check whether Probe Master Switch is turned on.

  2. Check whether the agent is correctly loaded.

    1. Log on to the ACK console. In the left-side navigation pane, click Clusters. On the Clusters page, click the name of the cluster to go to the cluster details page.

    2. In the left-side navigation pane, choose Workloads > Pods.

    3. On the Pods page, select the namespace in which your application resides, find the application, and then click Edit YAML in the Actions column.

    4. In the Edit YAML dialog box, check whether the YAML file contains initContainers.

      db_am_ack_apppod_yaml

      • If the YAML file does not contain initContainers, the pod has not been injected into one-pilot-initcontainer. Perform Step 5.

      • If the YAML file contains initContainers, the pod has been injected into one-pilot-initcontainer. Perform Step 8.

    5. In the left-side navigation pane of the cluster details page, choose Workloads > Pods. On the page that appears, set the Namespace parameter to ack-onepilot. Check if any pod named ack-onepilot-* with completed rolling updates exists in the pod list.

    6. In the left-side navigation pane of the cluster details page, choose Workloads > Deployments or StatefulSets. On the page that appears, find the application and choose image > Edit YAML in the Actions column. In the Edit YAML dialog box, check whether the YAML file contains the following labels in the spec.template.metadata section:

      labels:
        armsPilotAutoEnable: "on"
        armsPilotCreateAppName: "<your-deployment-name>"    # Replace <your-deployment-name> with the actual application name. 
        armsSecAutoEnable: "on"    # If you want to connect the application to Application Security, you must configure this parameter.
      • If the YAML file contains the labels, perform Step 7.

      • If the YAML file does not contain the labels, perform the following operations: In the Edit YAML dialog box, add the labels to the spec > template > metadata section and replace <your-deployment-name> with the actual application name. Then, click Update.

    7. In the left-side navigation pane of the cluster details page, choose Workloads > Pods. On the page that appears, find the pod and choose More > Logs in the Actions column to check whether the pod logs of ack-onepilot report a Security Token Service (STS) error in the "Message":"STS error" format.

    8. In the left-side navigation pane of the cluster details page, choose Workloads > Pods, find the pod and click Edit YAML in the Actions column. In the Edit YAML dialog box, check whether the YAML file contains the following javaagent parameter:

      -javaagent:/home/admin/.opt/ArmsAgent/aliyun-java-agent.jar
      Note

      If you use an ARMS agent earlier than 2.7.3.5, replace aliyun-java-agent.jar in the preceding code with arms-bootstrap-1.7.0-SNAPSHOT.jar. We recommend that you upgrade the agent to the latest version at the earliest opportunity.

      • If the YAML file contains the parameter, find the pod on the Pods page and click Terminal in the Actions column to go to the command line page. Run the following command to check whether the logs file contains a log file with the .log file extension. Then, submit a ticket.

        cd /home/admin/.opt/ArmsAgent/logs
      • If the YAML file does not contain the parameter, submit a ticket.

ARMS Addon Token does not exist in the cluster

Issue

ARMS Addon Token does not exist in the cluster.

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters. On the Clusters page, click the name of the cluster to go to the cluster details page.

  2. In the left-side navigation pane, choose Configurations > Secrets.

  3. In the upper part of the page, select kube-system from the Namespace drop-down list to check whether addon.arms.token is enabled.

    ARMS Addon Token

Solution

Grant ARMS access permissions to Container Service for Kubernetes.

Manually add permission policies

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters. On the Clusters page, click the name of the cluster.

  2. On the Basic Information tab of the Cluster Information page, click the link next to Worker RAM Role in the Cluster Resources section.

  3. On the page that appears, click Grant Permission on the Permissions tab.

  4. In the Grant Permission panel, add the following policies and click Grant permissions.

    • AliyunTracingAnalysisFullAccess: full access to Managed Service for OpenTelemetry.

    • AliyunARMSFullAccess: full access to ARMS.

Why is monitoring data abnormal after changing the cluster or namespace of an application?

Issue

  • The value displayed in the namespace column on the custom dashboard is not updated after you change the namespace of your application.

    image.png

  • After you change the cluster of your application, the data for rate, errors, and duration (RED) metrics is displayed normally but no data is displayed for container monitoring metrics, such as CPU and memory.

Cause

Container-related parameters, such as Namespace and ClusterId, are configured when the application is created and the values of these parameters cannot be automatically updated. If you change the cluster or namespace of your application, the container-related data may fail to be queried or displayed.

Solution

  • Delete the application, recreate the application, and then report monitoring data again. For more information, see Delete applications.

    This method causes the loss of historical data.

  • Submit a ticket.

How do I customize the Java agent mount path?

Background

Typically, the ack-onepilot component specifies the mount path for Application Real-Time Monitoring Service (ARMS) agents for Java by injecting the environment variable JAVA_TOOL_OPTIONS. However, you may need to customize this path for scenarios such as:

  • Centralized configuration management

    Manage the mount path through a Kubernetes ConfigMap to ensure environment consistency.

  • Persistent storage

    Store agent files in a custom persistent volume claim (PVC) to meet enterprise security or O&M requirements.

Solution

To customize the mount path for ARMS agents for Java, these version requirements must be met:

Important

This configuration also applies to Microservice Engine (MSE) due to shared ack-onepilot integration.

  1. Add the disableJavaToolOptionsInjection annotation to the Kubernetes workload, such as a deployment, that requires a custom mount path.

    The ack-onepilot component will not automatically set the mount path or other Java Virtual Machine (JVM) parameters using the environment variable JAVA_TOOL_OPTIONS.

    1. To view the YAML file of the deployment, run the following command:

      kubectl get deployment {Deployment name} -o yaml
      Note

      If you're not sure about the deployment name, run the following command to list all deployments:

      kubectl get deployments --all-namespace

      Then, find the one you want in the results and view its YAML file.

    2. Run the following command to edit the YAML file:

      kubectl edit deployment {Deployment name} -o yaml
    3. In the YAML file, add the following labels to spec.template.metadata:

      labels:
        armsPilotAutoEnable: "on"
        armsPilotCreateAppName: "<your-deployment-name>"    # The name of your deployment.
        disableJavaToolOptionsInjection: "true" # If you want to customize the mount path for the ARMS agent for Java, set this parameter to true.
  2. Replace the default mount path /home/admin/.opt/AliyunJavaAgent/aliyun-java-agent.jar in your Java startup script or command with your custom path:

    java -javaagent:/home/admin/.opt/AliyunJavaAgent/aliyun-java-agent.jar ... ... -jar xxx.jar

    Other information such as the reporting region and license key is provided by ack-onepilot through environment variables.

How do I report data across regions in an ACK cluster?

Issue

How do you report data from Region A to Region B across regions?

Solution

  1. Update the ack-onepilot component to V4.0.0 or later.

  2. Add the ARMS_REPORT_REGION environment variable to the ack-onepilot-ack-onepilot application in the ack-onepilot namespace. The value must be the ID of a region where ARMS is available. Example: cn-hangzhou or cn-beijing.

  3. Restart the existing application or deploy a new application to report data across regions.

    Note

    After the environment variable is added, all applications deployed in the cluster report data to the region specified in the previous step.

How do I uninstall arms-pilot and install ack-onepilot?

Background

The old Application Monitoring agent arms-pilot is no longer maintained. You can install the new agent ack-onepilot to monitor your applications. ack-onepilot is fully compatible with arms-pilot. You can seamlessly install ack-onepilot without the need to modify application configurations. This topic describes how to uninstall arms-pilot and install ack-onepilot.

Solution

  • You must install ack-onepilot in an ACK cluster V1.16 or later. If your cluster is earlier than V1.16, upgrade the cluster first. For more information, see Update the Kubernetes version of an ACK cluster.

  • You must uninstall arms-pilot before installing ack-onepilot. If you have both ack-onepilot and arms-pilot installed, the ARMS agent cannot be mounted. If arms-pilot is not completely uninstalled, ack-onepilot does not work because it regards that arms-pilot is still working in the environment.

  • When you uninstall arms-pilot and install ack-onepilot, the configurations of arms-pilot cannot be automatically synchronized to ack-onepilot. We recommend that you record the configurations and then manually configure ack-onepilot.

  1. Uninstall arms-pilot.

    1. Log on to the ACK console. On the Clusters page, click the name of the cluster.

    2. In the left-side navigation pane, choose Applications > Helm.

    3. On the Helm page, find arms-pilot and click Delete in the Actions column.

    4. In the Delete message, click OK.

  2. Check whether arms-pilot is uninstalled.

    Go to the cluster details page of the ACK console. In the left-side navigation pane, choose Workloads > Deployments. On the Deployments page, select arms-pilot from the Namespace drop-down list, and check whether the pods of the namespace are deleted as expected.

    Note

    If you have modified the namespace to which arms-pilot belongs, select the new namespace.

  3. Install ack-onepilot.

    1. Log on to the ACK console. On the Clusters page, click the name of the cluster.

    2. In the left-side navigation pane, click Add-ons. On the Add-ons page, search for ack-onepilot.

    3. Click Install on the ack-onepilot card.

      Note

      By default, the ack-onepilot component supports 1,000 pods. For every additional 1,000 pods in the cluster, you must add 0.5 CPU cores and 512 MB memory for the component.

    4. In the dialog box that appears, configure the parameters and click OK. We recommend that you use the default values.

      Note

      After you install ack-onepilot, you can upgrade, configure, or uninstall it on the Add-ons page.

  4. Check whether ack-onepilot is installed.

    Navigate to the cluster details page of the ACK console. In the left-side navigation pane, choose Workloads > Deployments. On the Deployments page, select ack-onepilot from the Namespace drop-down list, and check whether the pods of the namespace are running as expected.onepilot

Alibaba Cloud Prometheus monitoring

No relevant dashboard found on the Prometheus monitoring page

Issue

After you enable Managed Service for Prometheus, log on to the ACK console and choose Operations > Prometheus Monitoring in the left-side navigation pane. On the Prometheus Monitoring page, No dashboard is found error message is displayed. To resolve the issue, you can perform the following steps:

image

Solution

  1. Reinstall the ack-arms-prometheus component.

    1. Uninstall the ack-arms-prometheus component.

      1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

      2. On the Clusters page, find the one you want to manage and click its name. In the left-side navigation pane, click Add-ons.

      3. On the Add-ons page, click the Logs and Monitoring tab and find the ack-arms-prometheus component. Click Uninstall. In the dialog box that appears, click OK.

    2. Reinstall the ack-arms-prometheus component.

      1. Click Install. In the dialog box that appears, click OK.

      2. Wait for the installation to complete. Go to the Prometheus Monitoring page to check whether the issue is resolved.

        If the issue persists, perform the following steps.

  2. Check whether the Prometheus instance is connected.

    1. Log on to the ARMS console.

    2. In the left-side navigation pane, click Integration Management.

    3. On the Integrated Environments tab, check whether a container environment whose name is the same as the cluster exists in the Container Service list.

If the issue persists, submit a ticket to contact technical support.

Why is Managed Service for Prometheus data abnormal and cannot be displayed?

Cause

Managed Service for Prometheus data cannot be displayed because the synchronization job with Alibaba Cloud Prometheus cloud service failed, which resulted in resource registration failure, or because the Prometheus instance is not properly connected. You can perform the following steps to troubleshoot the issue.

Solution

  1. Check the status of the job for accessing Managed Service for Prometheus.

    1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

    2. On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose Workloads > Jobs.

    3. At the top of the Jobs page, set Namespace to arms-prom, and then click o11y-init-environment to verify whether the task is successful.

      If the task is not successful, data may fail to be synchronized to Managed Service for Prometheus and resource registration may fail. You can view the pod logs to get the specific failure reason. For more information, see Troubleshoot pod issues.

      If no pod exists, continue with the following steps.

  2. Reinstall the Prometheus monitoring component.

    1. Uninstall the ack-arms-prometheus component.

      1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

      2. On the Clusters page, find the one you want to manage and click its name. In the left-side navigation pane, click Add-ons.

      3. On the Add-ons page, click the Logs and Monitoring tab and find the ack-arms-prometheus component. Click Uninstall. In the dialog box that appears, click OK.

    2. Reinstall the ack-arms-prometheus component.

      1. Click Install. In the dialog box that appears, click OK.

      2. Wait for the installation to complete. Go to the Prometheus Monitoring page to check whether the issue is resolved.

        If the issue persists, perform the following steps.

  3. Check whether the Prometheus instance is connected.

    1. Log on to the ARMS console.

    2. In the left-side navigation pane, click Integration Management.

    3. On the Integrated Environments tab, check whether a container environment whose name is the same as the cluster exists in the Container Service list.

If the preceding steps do not resolve the issue, submit a ticket to contact technical support for help.

Error "rendered manifests contain a resource that already exists" when reinstalling Alibaba Cloud Prometheus monitoring

Issue

When you uninstall and reinstall the Prometheus agent, the following error message appears:

rendered manifests contain a resource that already exists. Unable to continue with install: existing resource conflict: kind: ClusterRole, namespace: , name: arms-pilot-prom-k8s

image

Cause

After you run commands to manually uninstall the Prometheus agent, resources such as roles may fail to be deleted.

Solution

  1. Run the following command to find the ClusterRoles of the Prometheus agent:

    kubectl get ClusterRoles --all-namespaces | grep prom

  2. Run the following command to delete the ClusterRoles that are queried in the previous step:

     kubectl delete ClusterRole [$Cluster_Roles] -n arms-prom
    Note

    The [$Cluster_Roles] parameter specifies the ClusterRoles that are queried in the previous step.

  3. If the issue persists after you delete the ClusterRoles, view the value of kind in the error message to check whether resources other than ClusterRoles exist. Perform the preceding operations to delete them in sequence.

How do I check the version of the ack-arms-prometheus component?

Background

You need to check the version of the ack-arms-prometheus component deployed in your cluster and whether it needs to be updated.

Solution

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the one you want to manage and click its name. In the left-side navigation pane, click Add-ons.

  3. On the Add-ons page, click the Logs and Monitoring tab and find the ack-arms-prometheus component.

    The version number is displayed in the lower part of the component. If a new version is available, click Upgrade on the right side to update the component.

    Note

    The Upgrade button is displayed only if the component is not updated to the latest version.

Why can't GPU monitoring be deployed?

Cause

Managed Service for Prometheus may be unable to monitor GPU-accelerated nodes that are configured with taints. You can perform the following steps to view the taints of a GPU-accelerated node.

Solution

  1. Run the following command to view the taints of a GPU-accelerated node:

    If you added custom taints to the GPU-accelerated node, you can view information about the custom taints. In this example, a taint whose key is set to test-key, value is set to test-value, and effect is set to NoSchedule is added to the node.

    kubectl describe node cn-beijing.47.100.***.***

    Expected output:

    Taints:test-key=test-value:NoSchedule
  2. Use one of the following methods to handle the taint:

    • Run the following command to delete the taint from the GPU-accelerated node:

      kubectl taint node cn-beijing.47.100.***.*** test-key=test-value:NoSchedule-
    • Add a toleration rule that allows pods to be scheduled to the CPU-accelerated node with the taint.

      # 1 Run the following command to modify ack-prometheus-gpu-exporter: 
      kubectl edit daemonset -n arms-prom ack-prometheus-gpu-exporter
      
      # 2. Add the following fields to the YAML file to tolerate the taint: 
      #Other fields are omitted. 
      # The tolerations field must be added above the containers field and both fields must be of the same level. 
      tolerations:
      - key: "test-key"
        operator: "Equal"
        value: "test-value"
        effect: "NoSchedule"
      containers:
       # Irrelevant fields are not shown.

How do I completely manually delete ARMS-Prometheus when manual resource deletion causes reinstallation failure?

Background

If you delete only the namespace of Managed Service for Prometheus, resource configurations are retained. In this case, you may fail to reinstall ack-arms-prometheus. You can perform the following operations to delete the residual resource configurations:

Solution

  • Run the following command to delete the arms-prom namespace:

    kubectl delete namespace arms-prom
  • Run the following commands to delete the related ClusterRoles:

    kubectl delete ClusterRole arms-kube-state-metrics
    kubectl delete ClusterRole arms-node-exporter
    kubectl delete ClusterRole arms-prom-ack-arms-prometheus-role
    kubectl delete ClusterRole arms-prometheus-oper3
    kubectl delete ClusterRole arms-prometheus-ack-arms-prometheus-role
    kubectl delete ClusterRole arms-pilot-prom-k8s
    kubectl delete ClusterRole gpu-prometheus-exporter
    kubectl delete ClusterRole o11y:addon-controller:role
    kubectl delete ClusterRole arms-aliyunserviceroleforarms-clusterrole
  • Run the following commands to delete the related ClusterRoleBindings:

    kubectl delete ClusterRoleBinding arms-node-exporter
    kubectl delete ClusterRoleBinding arms-prom-ack-arms-prometheus-role-binding
    kubectl delete ClusterRoleBinding arms-prometheus-oper-bind2
    kubectl delete ClusterRoleBinding arms-kube-state-metrics
    kubectl delete ClusterRoleBinding arms-pilot-prom-k8s
    kubectl delete ClusterRoleBinding arms-prometheus-ack-arms-prometheus-role-binding
    kubectl delete ClusterRoleBinding gpu-prometheus-exporter
    kubectl delete ClusterRoleBinding o11y:addon-controller:rolebinding
    kubectl delete ClusterRoleBinding arms-kube-state-metrics-agent
    kubectl delete ClusterRoleBinding arms-node-exporter-agent
    kubectl delete ClusterRoleBinding arms-aliyunserviceroleforarms-clusterrolebinding
  • Run the following commands to delete the related Roles and RoleBindings:

    kubectl delete Role arms-pilot-prom-spec-ns-k8s
    kubectl delete Role arms-pilot-prom-spec-ns-k8s -n kube-system
    kubectl delete RoleBinding arms-pilot-prom-spec-ns-k8s
    kubectl delete RoleBinding arms-pilot-prom-spec-ns-k8s -n kube-system

After you delete the residual resource configurations, go to the ACK console, choose Operations > Add-ons, and reinstall the ack-arms-prometheus component.

Error "xxx in use" when installing the ack-arms-prometheus component

Cause

When you deploy the ack-arms-prometheus component, an error message indicating "xxx in use" appears. This suggests that resources are being used or residual resources exist, causing the component installation to fail.

Solution

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage. Then, click the name of the cluster or click Details in the Actions column.

  3. In the left-side navigation pane of the cluster details page, choose Applications > Helm. On the Helm page, check whether the ack-arms-prometheus application is displayed.

    • If the ack-arms-prometheus application is displayed on the Helm page, delete the ack-arms-prometheus application and then install ack-arms-prometheus on the Add-ons page. For more information about how to install ack-arms-prometheus, see Manage components.

    • If the ack-arms-prometheus application is displayed on the Helm page, perform the following steps:

      1. If the ack-arms-prometheus application is not displayed on the Helm page, residual data exists after the ack-arms-prometheus application is deleted. You must manually delete the residual data. For more information about how to delete the residual data related to ack-arms-prometheus, see FAQ.

      2. Install ack-arms-prometheus on the Add-ons page. For more information about how to install ack-arms-prometheus, see Manage components.

      3. If the issue persists, submit a ticket.

Installation failure after "Component Not Installed" prompt when installing the ack-arms-prometheus component

Issue

When you try to install the ack-arms-prometheus component, you first see a "Component Not Installed" prompt, but subsequent installation attempts still fail.

Solution

  • Check whether ack-arms-prometheus is already installed.

    1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

    2. On the Clusters page, find the cluster that you want to manage. Then, click the name of the cluster or click Details in the Actions column.

    3. Go to the cluster details page in the ACK console and choose Applications > Helm in the left-side navigation pane.

      Check whether the ack-arms-prometheus application is displayed on the Helm page.

      • If ack-arms-prometheus is displayed on the Helm page, delete ack-arms-prometheus on the Helm page and then install ack-arms-prometheus from the Add-ons page. For more information about how to install ack-arms-prometheus, see Manage components.

      • If the ack-arms-prometheus application is not displayed on the Helm page, perform the following operations:

        1. If ack-arms-prometheus is not displayed on the Helm page, it indicates that residual data exists after ack-arms-prometheus is deleted. You must manually delete the residual data. For more information about how to delete the residual data related to ack-arms-prometheus, see FAQ.

        2. Install ack-arms-prometheus on the Add-ons page. For more information about how to install ack-arms-prometheus, see Manage components.

        3. If the issue persists, submit a ticket.

  • Check whether errors are reported in the log of ack-arms-prometheus.

    1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

    2. On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose Workloads > Deployments.

    3. In the upper part of the Deployments page, set Namespace to arms-prom and then click arms-prometheus-ack-arms-prometheus.

    4. Click the Logs tab and check whether errors are reported in the log.

      If errors are reported in the log, submit a ticket.

  • Check whether installation errors are reported by the Prometheus agent.

    1. Log on to the ARMS console.

    2. In the left-side navigation pane, click Integration Management.

    3. On the Integrated Environments tab, view the environment list on the Container Service tab. Find the ACK environment instance and click Configure Agent in the Actions column. The Configure Agent page appears.

    4. Check whether the installed agents run as expected. If an error is reported, submit a ticket.

Open-source Prometheus monitoring

How do I configure DingTalk alert notifications?

Issue

After deploying open-source Prometheus, you need to configure alert notifications through DingTalk.

Solution

  1. Obtain the webhook URL of your DingTalk chatbot. For more information, see Event monitoring.

  2. On the Parameters wizard page, find the dingtalk section, set enabled to true, and then specify the webhook URL of your DingTalk chatbot in the token field. For more information, see Configure DingTalk alert notifications in Alert configurations.

Error when deploying prometheus-operator

Issue

Can't install release with errors: rpc error: code = Unknown desc = object is being deleted: customresourcedefinitions.apiextensions.k8s.io "xxxxxxxx.monitoring.coreos.com" already exists

Solution

The error message indicates that the cluster fails to clear custom resource definition (CRD) objects of the previous deployment. Run the following commands to delete the CRD objects. Then, deploy prometheus-operator again:

kubectl delete crd prometheuses.monitoring.coreos.com
kubectl delete crd prometheusrules.monitoring.coreos.com
kubectl delete crd servicemonitors.monitoring.coreos.com
kubectl delete crd alertmanagers.monitoring.coreos.com

Email alerts are not working

Issue

After deploying open-source Prometheus, your configured email alerts are not sending alert notifications.

Solution

Make sure that the value of smtp_auth_password is the SMTP authorization code instead of the logon password of the email account. Make sure that the SMTP server endpoint includes a port number.

Error "Cluster cannot be accessed, please try again or submit a ticket" when clicking YAML update

Issue

After deploying open-source Prometheus, when you click YAML update, the error "The current cluster cannot be accessed temporarily. Please try again later or submit a ticket for feedback" appears.

Solution

If the configuration file of Tiller is overlarge, the cluster cannot be accessed. To solve this issue, you can delete some annotations in the configuration file and mount the file to a pod as a ConfigMap. You can specify the name of the ConfigMap in the configMaps fields of the prometheus and alertmanager sections. For more information, see the second method in Mount a ConfigMap to Prometheus.

How do I enable features after deploying prometheus-operator?

Issue

After deploying open-source Prometheus, you may need to further configure it to enable specific features.

Solution

After prometheus-operator is deployed, you can perform the following steps to enable the features of prometheus-operator. Go to the cluster details page and choose Applications > Helm in the left-side navigation pane. On the Helm page, find ack-prometheus-operator and click Update in the Actions column. In Update Release panel, configure the code block to enable the features. Then, click OK.

How to choose between TSDB and Alibaba Cloud disk

Issue

When selecting a storage solution, how do you choose between TSDB and Alibaba Cloud disk, and how do you configure the data reclamation policy?

Solution

TSDB storage is available to limited regions. However, disk storage is supported in all regions. The following figure shows how to configure the data retention policy.数据回收策略

Issues with Grafana Dashboard display

Issue

After deploying open-source Prometheus, there are display issues with the Grafana Dashboard.

Solution

Go to the cluster details page and choose Applications > Helm in the left-side navigation pane. On the Helm page, find ack-prometheus-operator and click Update in the Actions column. In Update Release panel, check whether the value of the clusterVersion field is correct. If the Kubernetes version of your cluster is earlier than 1.16, set clusterVersion to 1.14.8-aliyun.1. If the Kubernetes version of your cluster is 1.16 or later, set clusterVersion to 1.16.6-aliyun.1.

Failure to reinstall ack-prometheus-operator after deleting its namespace

Cause

After you delete the ack-prometheus namespace, the related resource configurations may be retained. In this case, you may fail to install ack-prometheus again. You can perform the following operations to delete the related resource configurations:

Solution

  1. Delete role-based access control (RBAC)-related resource configurations.

    1. Run the following commands to delete the related ClusterRoles:

      kubectl delete ClusterRole ack-prometheus-operator-grafana-clusterrole
      kubectl delete ClusterRole ack-prometheus-operator-kube-state-metrics
      kubectl delete ClusterRole psp-ack-prometheus-operator-kube-state-metrics
      kubectl delete ClusterRole psp-ack-prometheus-operator-prometheus-node-exporter
      kubectl delete ClusterRole ack-prometheus-operator-operator
      kubectl delete ClusterRole ack-prometheus-operator-operator-psp
      kubectl delete ClusterRole ack-prometheus-operator-prometheus
      kubectl delete ClusterRole ack-prometheus-operator-prometheus-psp
    2. Run the following commands to delete the related ClusterRoleBindings:

      kubectl delete ClusterRoleBinding ack-prometheus-operator-grafana-clusterrolebinding
      kubectl delete ClusterRoleBinding ack-prometheus-operator-kube-state-metrics
      kubectl delete ClusterRoleBinding psp-ack-prometheus-operator-kube-state-metrics
      kubectl delete ClusterRoleBinding psp-ack-prometheus-operator-prometheus-node-exporter
      kubectl delete ClusterRoleBinding ack-prometheus-operator-operator
      kubectl delete ClusterRoleBinding ack-prometheus-operator-operator-psp
      kubectl delete ClusterRoleBinding ack-prometheus-operator-prometheus
      kubectl delete ClusterRoleBinding ack-prometheus-operator-prometheus-psp
  2. Run the following command to delete the related CRD objects:

    kubectl delete crd alertmanagerconfigs.monitoring.coreos.com
    kubectl delete crd alertmanagers.monitoring.coreos.com
    kubectl delete crd podmonitors.monitoring.coreos.com
    kubectl delete crd probes.monitoring.coreos.com
    kubectl delete crd prometheuses.monitoring.coreos.com
    kubectl delete crd prometheusrules.monitoring.coreos.com
    kubectl delete crd servicemonitors.monitoring.coreos.com
    kubectl delete crd thanosrulers.monitoring.coreos.com

Alert management

Alert rule synchronization failure with error message "The Project does not exist : k8s-log-xxx"

Issue

The alert rule synchronization status in the alert center shows the error message The Project does not exist : k8s-log-xxx.

Cause

You did not create an event center in Log Service for your cluster.

Solution

  1. In the Simple Log Service console, check whether you have reached the quota limit. For more information about resources, see Basic resources.

    1. If you have reached the quota limit, delete unnecessary projects or submit a ticket to request an increase in the project resource quota limit. For information about how to delete a project, see Manage projects.

    2. If you have not reached the quota limit, perform the following steps.

  2. Reinstall ack-node-problem-detector.

    When you reinstall the component, a default project named k8s-log-xxxxxx is created.

    1. Uninstall ack-node-problem-detector.

      1. In the left-side navigation pane of the ACK conso details page of the target cluster, choose Operations > Components.

      2. Click the Logs & Monitoring tab. In the ack-node-problem-detector card, click Uninstall. In the dialog box that appears, click OK.

    2. After the uninstallation is complete, install ack-node-problem-detector.

      1. In the left-side navigation pane, choose Operations > Alert Configuration

      2. On the Alert Configuration page, click Start Installation. The console automatically creates a project and installs and upgrades the components.

  3. On the Alert Configuration page, turn off the switch in the Enabled column for the corresponding alert rule set. Wait until Alert Rule Status changes to Rule Disabled, and then turn on the switch to retry.

Alert rule synchronization failure with error message this rule have no xxx contact groups reference

Issue

The alert rule fails to be synchronized and an error message similar to this rule have no xxx contact groups reference is returned.

Cause

No contact group subscribes to the alert rule.

Solution

  1. Create a contact group and add contacts.

  2. Click Edit Notification Object on the right side of the corresponding alert rule set and configure a contact group that subscribes to the alert rule set.

Other issues

Why is there no data when running kubectl top pod/node?

Issue

When you run kubectl top pod or kubectl top node in the command line, no data is returned.

Solution

  1. Run the following command to check whether the metrics-server API service works as normal:

    kubectl get apiservices | grep metrics-server

    metris

    If v1beta1.metrics.k8s.io shows True in the returned result, the metrics-server API service is working normally.

  2. Optional: If the metrics-server API service is not working normally, run the following commands on the cluster node to check whether metrics-server can be accessed through port 443 and port 8082 within the cluster:

    curl -v <metrics-server_Pod_IP>:8082/apis/metrics/v1alpha1/nodes
    
    curl -v <metrics-server_Pod_IP>:443/apis/metrics/v1alpha1/nodes

    If data is returned after you run the preceding command, it indicates that metrics-server can be accessed through port 443 and port 8082 within the cluster.

  3. Optional: If metrics-server cannot be accessed through port 443 and port 8082 within the cluster, restart metrics-server.

    You can delete the pod that runs metrics-server to restart metrics-server.

    1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

    2. On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose Workloads > Deployments.

    3. At the top of the Stateless page, set Namespace to kube-system, and click metrics-server.

    4. On the Pods tab, choose More >Delete from the Actions column of the metrics-server pod, and then click OK in the dialog box.

If you still cannot identify the issue after performing the preceding checks, submit a ticket by using the following template: submit a ticket.

Ticket template:

  1. Does the metrics-server API service work as normal?

    Yes

  2. Can metrics-server be accessed through port 443 and port 8082 within the cluster?

    Yes

  3. Provide the cluster ID.

Why is there partial data missing when running kubectl top pod/node?

Issue

When you run kubectl top pod or kubectl top node in the command line, some data is missing.

Solution

Perform the following checks:

  • Check whether the data of all pods on a node is missing or only the data of some pods is missing. If the data of all pods on a node is missing, check whether a timezone difference exists on the node. You can use the date command of the NTP server to check the timezone.

  • Check whether the pod that runs metrics-server can connect to the node through port 10255.

If you still cannot identify the issue after performing the preceding checks, submit a ticket by using the following template: submit a ticket.

Ticket template:

  1. Is the missing data about all pods on a specific node?

    Yes

  2. Does a timezone difference exist on the node?

    No

  3. Can metrics-server connect to the specific node?

    Yes

What should I do if HPA cannot obtain metrics data?

Issue

When using Kubernetes Horizontal Pod Autoscaler (HPA), you may encounter situations where it cannot obtain metrics data.

Solution

Perform the following checks:

Check the result of running kubectl top pod for the corresponding pod. If the data is abnormal, see Why is there no data when running kubectl top pod/node? and Why is there partial data missing when running kubectl top pod/node? for troubleshooting.

If you still cannot identify the issue after performing the preceding checks, submit a ticket by using the following template: submit a ticket.

Ticket template:

  1. Does the monitoring data show anomalies?

    No

  2. Run kubectl describe hpa <hpa_name> and submit the metadata information.

Why does HPA create extra pods during rolling updates?

Issue

When performing a Kubernetes rolling update, you may notice that HPA (Horizontal Pod Autoscaler) unexpectedly launches additional pods.

Solution

Perform the following checks:

Check whether metrics-server is upgraded to the latest version. If the version is correct, you can use the kubectl edit deployments -n kube-system metrics-server command to add the following startup parameters to the command section:

--metric-resolution=15s
--enable-hpa-rolling-update-skipped=true

If you issue after performing the preceding checks, submit a ticket by using the following template: submit a ticket.

Ticket template:

  1. Is metrics-server upgraded to the latest version?

    Yes

  2. Are startup parameters added to prevent excess pods?

    Yes

  3. Run kubectl describe hpa <hpa_name> and submit the HPA description.