Troubleshoot add-on errors - Container Service for Kubernetes

Error code reference

Error code	Description
AddonOperationFailed.ResourceExists	A resource required by the add-on already exists in the cluster
AddonOperationFailed.ReleaseNameInUse	A Helm release with the same name as the add-on already exists
AddonOperationFailed.WaitForAddonReadyTimeout	Add-on pods cannot reach the Ready state after the update request is submitted
AddonOperationFailed.APIServerUnreachable	ACK cannot access the Kubernetes API server
AddonOperationFailed.ResourceNotFound	Resources required by the add-on cannot be found
AddonOperationFailed.TillerUnreachable	Helm V2 Tiller is inaccessible
AddonOperationFailed.FailedCallingWebhook	A mutating webhook for an add-on resource cannot be called
AddonOperationFailed.UserForbidden	Tiller lacks the required role-based access control (RBAC) permissions
AddonOperationFailed.TillerNotFound	No Tiller pod is running in the cluster
AddonOperationFailed.ErrPatchingClusterRoleBinding	A ClusterRoleBinding required by the add-on exists but has a conflicting configuration
AddonOperationFailed.ErrApplyingPatch	The add-on's YAML manifests are incompatible between versions

AddonOperationFailed.ResourceExists

Symptoms

The console displays an error similar to:

Addon status not match, failed upgrade helm addon arms-cmonitor for cluster c3cf94b952cd34b54b71b10b7********, err: rendered manifests contain a resource that already exists. Unable to continue with update: ConfigMap "otel-collector-config" in namespace "arms-prom" exists and cannot be imported into the current release

Cause

A resource required by the add-on already exists in the cluster. Common causes:

Another version (such as the open-source version) is installed by a different method.
The add-on was installed with Helm V2, and its resources were not removed before migrating to Helm V3.
A resource with the same name was created manually.

Solution

Delete the conflicting resources from the error message, then retry the operation.

The following sections list commands for specific add-ons.

arms-prometheus

arms-prometheus is typically installed in the arms-prom namespace. Delete its resources, then reinstall arms-prometheus.

kubectl delete ClusterRole arms-kube-state-metrics
kubectl delete ClusterRole arms-node-exporter
kubectl delete ClusterRole arms-prom-ack-arms-prometheus-role
kubectl delete ClusterRole arms-prometheus-oper3
kubectl delete ClusterRole arms-prometheus-ack-arms-prometheus-role
kubectl delete ClusterRole arms-pilot-prom-k8s
kubectl delete ClusterRoleBinding arms-node-exporter
kubectl delete ClusterRoleBinding arms-prom-ack-arms-prometheus-role-binding
kubectl delete ClusterRoleBinding arms-prometheus-oper-bind2
kubectl delete ClusterRoleBinding kube-state-metrics
kubectl delete ClusterRoleBinding arms-pilot-prom-k8s
kubectl delete ClusterRoleBinding arms-prometheus-ack-arms-prometheus-role-binding
kubectl delete Role arms-pilot-prom-spec-ns-k8s
kubectl delete Role arms-pilot-prom-spec-ns-k8s -n kube-system
kubectl delete RoleBinding arms-pilot-prom-spec-ns-k8s
kubectl delete RoleBinding arms-pilot-prom-spec-ns-k8s -n kube-system

ack-node-local-dns

Important

Workloads are not affected by the deletion. Do not add pods between the deletion and the update. If you do, delete and recreate those pods after the update to reinject the DNS cache.

kubectl delete MutatingWebhookConfiguration ack-node-local-dns-admission-controller

After the resource is deleted, update ack-node-local-dns.

arms-cmonitor

kubectl delete ConfigMap otel-collector-config -n arms-prom
kubectl delete ClusterRoleBinding arms-prom-cmonitor-role-binding
kubectl delete ClusterRoleBinding arms-prom-cmonitor-install-init-role-binding
kubectl delete ClusterRole arms-prom-cmonitor-role
kubectl delete ClusterRole arms-prom-cmonitor-install-init-role
kubectl delete ServiceAccount cmonitor-sa-install-init -n kube-system

After the resources are deleted, install or update arms-cmonitor.

AddonOperationFailed.ReleaseNameInUse

Cause

A Helm release with the same name already exists, preventing installation or update. Common causes:

Another version is installed by a different method.
A leftover Helm release remains from a previous installation attempt.

Solution

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the target cluster. In the left pane, choose Applications > Helm.
Find the add-on's Helm release. In the Actions column, click Delete. In the dialog box, select Clear Release Records and click OK.
Install or update the add-on.

AddonOperationFailed.WaitForAddonReadyTimeout

Cause

Add-on pods cannot reach the Ready state within the timeout period after the update is submitted.

Troubleshooting

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the target cluster. In the left pane, choose Operations > Event Center.
On the Events (Cluster Resource Events) tab, set Level to Warning, select the namespace where the add-on is deployed, and set Type to Pod. Review the event details to identify the cause.

Common causes and solutions

Cause 1: Pods cannot be scheduled (FailedScheduling)

Cluster nodes do not meet the scheduling requirements for the add-on pods. Check event details for these messages:

Event message	Cause	Solution
`Insufficient memory` or `Insufficient cpu`	Nodes lack sufficient resources	Delete unneeded pods, add nodes to the cluster, or upgrade node configurations
`the pod didn't tolerate`	Node taints are not tolerated by the add-on pods	Remove the taints from nodes
`didn't match pod anti-affinity rules`	Anti-affinity rules cannot be satisfied	Add nodes to the cluster

After resolving the scheduling issue, retry the add-on update.

Cause 2: Pod sandbox cannot be created (FailedCreatePodSandBox)

The network plugin cannot allocate IP addresses to pods. Check event details:

If the message contains vSwitch have insufficient IP, add new pod vSwitches in Terway mode.
If the message contains transport: Error while dialing, troubleshoot the pod to check the cluster network plugin.

AddonOperationFailed.APIServerUnreachable

Cause

ACK cannot reach the Kubernetes API server, typically because the Server Load Balancer (SLB) instance exposing the API server is misconfigured.

Solution

See Troubleshoot API server request exceptions.

AddonOperationFailed.ResourceNotFound

Cause

Required add-on resources are missing—likely deleted or modified externally—preventing in-place update.

Solution

Uninstall the add-on and install the latest version.

AddonOperationFailed.TillerUnreachable

Cause

The add-on uses Helm V2, which depends on Tiller. Tiller has encountered an error and is inaccessible.

Solution

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the target cluster. In the left pane, choose Workloads > Pods.
Select the kube-system namespace. Find and delete the tiller pod. The system automatically recreates it.
After the Tiller pod reaches the Ready state, retry the add-on operation.

AddonOperationFailed.FailedCallingWebhook

Symptoms

The console displays an error similar to:

failed to create: Internal error occurred: failed calling webhook "rancher.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation?timeout=10s": no endpoints available for service "rancher-webhook"

Cause

A mutating webhook for an add-on resource cannot be called, blocking updates.

Solution

Fix the failing webhook identified in the error message, then retry the add-on update.

In the example, the rancher-webhook webhook in the cattle-system namespace is unavailable.

AddonOperationFailed.UserForbidden

Cause

The cluster uses Helm V2, but Tiller lacks the RBAC permissions to manage resources, preventing component operations.

Solution

Grant the required RBAC permissions to Tiller. See Role-based access control.

AddonOperationFailed.TillerNotFound

Cause

The cluster uses Helm V2, but no Tiller pod is running.

Solution

Troubleshoot the tiller-deploy pod in the kube-system namespace. After the pod runs normally, retry the add-on operation. See Troubleshoot pod issues.

AddonOperationFailed.ErrPatchingClusterRoleBinding

Cause

A ClusterRoleBinding required by the add-on exists but has a conflicting configuration, typically caused by a separately installed open-source version.

Solution

Uninstall the open-source component version:

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the target cluster. In the left pane, choose Applications > Helm.
Find the add-on's Helm release. In the Actions column, click Delete. In the dialog box, select Clear Release Records and click OK.
Install or update the add-on.

AddonOperationFailed.ErrApplyingPatch

Symptoms

The console displays an error similar to:

spec.template.spec.initContainers[1].name: Duplicate value: "install-cni"

Cause

The YAML manifests of the installed version are incompatible with the target version. Common causes:

Another version (such as the open-source version) is installed by a different method.
The add-on's YAML manifests were modified manually.
The currently installed version is no longer supported.

Solution

Modify the component's YAML manifests based on the error message. For assistance, submit a ticket.

Example: Flannel container name conflict

If a discontinued Flannel version is installed, the update may fail with:

spec.template.spec.initContainers[1].name: Duplicate value: "install-cni"

Edit the Flannel DaemonSet manifest:

kubectl -n kube-system edit ds kube-flannel-ds

Find the install-cni container under spec.template.spec.containers and delete it (lines 7–21 in the example):

      containers:
      - name: kube-flannel
        image: registry-vpc.{{.Region}}.aliyuncs.com/acs/flannel:{{.ImageVersion}}
        command: [ "/opt/bin/flanneld", "--ip-masq", "--kube-subnet-mgr" ]
        ...
        # Irrelevant lines are not shown. Delete comment lines 7 to 21.
    # - command:
      # - /bin/sh
      # - -c
      # - set -e -x; cp -f /etc/kube-flannel/cni-conf.json /etc/cni/net.d/10-flannel.conf;
        # while true; do sleep 3600; done
      # image: registry-vpc.cn-beijing.aliyuncs.com/acs/flannel:v0.11.0.1-g6e46593e-aliyun
      # imagePullPolicy: IfNotPresent
      # name: install-cni
      # resources: {}
      # terminationMessagePath: /dev/termination-log
      # terminationMessagePolicy: File
      # volumeMounts:
      # - mountPath: /etc/cni/net.d
        # name: cni
      # - mountPath: /etc/kube-flannel/
         # Irrelevant lines are not shown. Delete comment lines 7 to 21.
          name: flannel-cfg
        ...

Deleting these lines does not interrupt running workloads. A rolling update starts automatically. After it completes, update Flannel from the ACK console. See Manage components.