All Products
Search
Document Center

Container Compute Service:Configure automatic rotation for instances with hardware exceptions

Last Updated:Mar 03, 2026

If a hardware exception occurs on the underlying infrastructure, Alibaba Cloud Service (ACS) reports it using methods such as Kubernetes events and conditions. For more information, see GPU Fault Diagnosis and Recovery. To avoid service interruptions, you can configure the acs-instance-helper component to enable fault handling. This feature automates workload scale-out and eviction, which improves O&M efficiency and ensures service stability.

How it works

When a planned O&M event or a hardware fault, such as a damaged GPU, occurs on the underlying infrastructure of an ACS instance, it can affect service stability and performance or pose a risk of downtime. You can configure the acs-instance-helper component to achieve fully automated fault handling:

  1. Automatic fault monitoring: The component continuously listens for the pod's fault Condition. The underlying infrastructure automatically reports this signal for events such as GPU failures, full machine failures, or planned node restart O&M events.

  2. Aligned window processing: The component determines the processing time based on the fault handling deadline reported by the underlying node and the optional O&M window configuration. If the deadline permits, the component waits until the preset O&M window begins before it starts processing.

  3. Triggered rotational updates: The component automatically uses the online scale-out policy (scale out first, then destroy) for stateless applications, such as Deployments and CloneSets, to perform a smooth rotation of pods on the faulty node.

    Important

    For non-online applications, acs-instance-helper directly evicts the instance after it detects the abnormal state.

Applicable scope

  • Your ACS cluster version is 1.28 or later.

  • The ACK Virtual Node component is installed, and its version is v2.16.0 or later. For more information, see ACK Virtual Node.

Install the component

  1. In the ACS console, click the name of your target cluster. In the navigation pane on the left, choose Applications > Helm.

  2. On the Helm page, click Create.

    1. Basic Information: In the Chart search box, enter acs-instance-helper and select it from the search results.

    2. Parameters: For Chart Version, select the latest version.

Configure global settings for the acs-instance-helper component (Optional)

You can configure additional supported workload types and O&M windows for the component.

Console

  1. In the left navigation pane of the target cluster's details page, choose Configurations > ConfigMaps.

  2. On the ConfigMaps page, click Create from YAML. Then, copy the following manifest into the Templates area and click Create.

kubectl

  1. Obtain a cluster kubeconfig file and use kubectl to connect to the cluster.

  2. Save the following YAML manifest as a file named acs-instance-helper-global-configmap.yaml. Then, run the kubectl apply -f acs-instance-helper-global-configmap.yaml command.

apiVersion: v1
kind: ConfigMap
metadata:
  name: acs-instance-helper-global-config
  namespace: kube-system
data:
  customOnlineWorkloads: foo.io/SomeWorkload,bar.io/AnotherWorkload
  hardwareFaultEvictionSeconds: "60"
  maintenanceTime: "2025-10-09T10:00:00+08:00"
  maintenanceDuration: "4h"
  maintenanceWeeklyPeriod: "Saturday,Sunday"
  # maintenanceRecurrence: "FREQ=WEEKLY;BYDAY=SA,SU"  # O&M window: every Saturday and Sunday

Expand the following section to view descriptions of the configuration items.

Configuration item descriptions

Configuration Key

Description

Example Value

customOnlineWorkloads

Marks an application as an "online service" type, which uses the online scale-out policy (scale-out first, then destroy) for a smooth rotation of pods on faulty nodes.

The component supports Deployment workloads by default. Configure this item to add support for other custom workload types.

Important
  • Ensure that the custom workload controller can maintain the number of pod replicas by automatically adding more when the count is insufficient.

  • A lossless rotation during a fault cannot be guaranteed for all custom workloads. Test thoroughly when you enable this feature for a custom workload.

foo.io/SomeWorkload,bar.io/AnotherWorkload

hardwareFaultEvictionSeconds

For services that use the "scale-out and evict" policy, this is the waiting time in seconds from the completion of scale-out to the start of eviction for the faulty instance.

  • The default value is "300" (5 minutes). The value must be in string format, for example, "60".

"60"

maintenanceTime

Defines the start time of the O&M window in RFC3339 format. Configuring this item enables the O&M window for the cluster.

Use a clear time zone identifier, such as +08:00 or UTC.

2025-10-09T10:00:00+08:00

Declares that the O&M window starts at 10:00 AM on October 9, 2025 (UTC+8).

maintenanceDuration

Specifies the duration of each O&M window.

  • This takes effect only after maintenanceTime is configured.

  • The value must be in string format. Formats such as "3", "3h", and "3H" are supported.

  • The default value is "3", which means 3 hours.

"4h"

maintenanceWeeklyPeriod

A simplified way to specify which days of the week are O&M days.

  • This takes effect only after maintenanceTime is configured.

  • Valid values are Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday. Separate multiple values with commas.

  • If you configure this item, it overwrites the value of maintenanceRecurrence.

Saturday,Sunday

maintenanceRecurrence

Uses RFC5545 recurrence rule syntax to define the O&M cycle, providing greater flexibility.

  • This takes effect only after maintenanceTime is configured.

  • Currently, only FREQ=WEEKLY is supported. The COUNT and UNTIL parameters are not supported.

  • If maintenanceWeeklyPeriod is also configured, this item is ignored.

FREQ=WEEKLY;BYDAY=SA,SU

The fault processing time is determined by both the fault handling deadline and the configured O&M window. If an O&M window is available before the deadline, the acs-instance-helper component prioritizes running the fault repair within that window. If the entire process is not completed within a single O&M window, the remaining operations continue in the next O&M window.

image

Create and configure a workload

You can enable the fault handling feature for a workload by configuring its annotations.

The fault handling feature rotates instances by repeatedly trying to evict them using the Eviction API, rather than directly deleting them. You can configure a PodDisruptionBudget (PDB) policy to control the concurrency of evictions and prevent service interruptions. For more information, see Control pod eviction concurrency using a PDB resource.

Console

  1. In the navigation pane on the left of the destination cluster, choose Workloads > Deployments.

  2. On the Deployments page, click Create from YAML. Copy the following content into the Template area and then click Create.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: hardware-fault-helper-example
      namespace: default
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: hardware-fault-helper-example
      template:
        metadata:
          labels:
            app: hardware-fault-helper-example
          annotations:
            # Key annotation: Enables the fault handling feature for the workload
            "ops.alibabacloud.com/enable-hardware-fault-helper": "true"
        spec:
          containers:
            - image: registry-cn-hangzhou.ack.aliyuncs.com/dev/hello-world:v1
              name: main-container
              resources:
                limits:
                  cpu: 100m
                  memory: 100Mi
          restartPolicy: Always
  3. In the pop-up window, find the destination stateless application and click View . Confirm that the pod status is Running

kubectl

  1. Save the following YAML content as a file named `app.yaml` and then run the `kubectl apply -f app.yaml` command.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: hardware-fault-helper-example
      namespace: default
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: hardware-fault-helper-example
      template:
        metadata:
          labels:
            app: hardware-fault-helper-example
          annotations:
            # Key annotation: Enables the fault handling feature for the workload
            "ops.alibabacloud.com/enable-hardware-fault-helper": "true"
        spec:
          containers:
            - image: registry-cn-hangzhou.ack.aliyuncs.com/dev/hello-world:v1
              name: main-container
              resources:
                limits:
                  cpu: 100m
                  memory: 100Mi
          restartPolicy: Always
  2. Confirm that the pod status of the destination application is Running.

    kubectl get pods -l app=hardware-fault-helper-example

Simulate a fault scenario and observe the results

In a real-world scenario, the underlying control system automatically adds the Condition. Here, we simulate a fault scenario by manually injecting a Condition into one of the pods.

  1. Simulate the fault: Replace POD_NAME with the actual pod name to inject a hardware fault Condition into the destination pod.

    The fault handling deadline is the time specified in the message field.
    kubectl patch pod POD_NAME --type='merge' --subresource=status -p='{
      "status": {
        "conditions": [
          {
            "type": "Interruption.HardwareFault",
            "status": "True",
            "reason": "MockForTest",
            "message": "Underlying infrastructure issue [Reboot] scheduled at 2099-03-12T09:00:00.000+08:00",
            "lastProbeTime": "'$(date -u +"%Y-%m-%dT%H:%M:%SZ")'",
            "lastTransitionTime": "'$(date -u +"%Y-%m-%dT%H:%M:%SZ")'"
          }
        ]
      }
    }'
  2. Observe the scale-out: After the fault is injected, acs-instance-helper triggers a scale-out based on the O&M window configuration (or immediately if no configuration exists). A new pod is created, and the original workload's status remains unaffected.

    kubectl get pods -l app=hardware-fault-helper-example

    Expected output:

    NAME                                             READY   STATUS    RESTARTS   AGE
    hardware-fault-helper-example-7cf4cf96c5-xxxxx   1/1     Running   0          2m21s
    hardware-fault-helper-example-7cf4cf96c5-yyyyy   1/1     Running   0          36s # New scaled-out pod
  3. View the scale-out event: Check the events of the faulty pod. You can see a NewInstanceCreationTriggered event, which confirms that the scale-out operation was triggered by hardware-fault-helper.

    kubectl describe po POD_NAME

    Expected output:

    ...
      Normal  NewInstanceCreationTriggered  62s    hardware-fault-helper  controller default/hardware-fault-helper-example-7cf4cf96c5 (apiVersion:apps/v1, kind:ReplicaSet) will create a new instance
  4. View the eviction event: After the time specified by hardwareFaultEvictionSeconds elapses, the faulty pod is taken offline (it enters the Terminating state and is then deleted). You can also observe an offline event at this time.

    kubectl describe po POD_NAME

    Expected output:

    ...
      Warning  InstanceEvictedGracefully     2s     hardware-fault-helper  pod is deleted due to hardware fault
      Normal   Killing                       1s     kubelet                Stopping container main-container
  5. Confirm recovery: Finally, the faulty pod is completely replaced, leaving only the newly created pod.

    kubectl get pods -l app=hardware-fault-helper-example

    Expected output:

    NAME                                             READY   STATUS      RESTARTS   AGE
    hardware-fault-helper-example-7cf4cf96c5-yyyyy   1/1     Running     0          5m5s

Billing

Installing the acs-instance-helper component deploys a deployment with two replicas in your cluster. Each replica is allocated 1 vCPU and 2 GiB of memory. These resources are consumed from your cluster and will incur fees. For more information about billing, see ACS computing power billing.

FAQ

Control pod eviction concurrency using a PDB resource

When pods must be evicted due to external events such as node draining or automatic scale-in, you can configure a PodDisruptionBudget (PDB) policy to ensure high availability for your services. This policy controls the concurrency of evictions using the following parameters:

  • maxUnavailable: Defines the maximum number of pods that can be unavailable during the eviction process.

  • minAvailable: Defines the minimum number of pods that must remain available during the eviction process.

The following example ensures that at least one pod remains available during eviction.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
  namespace: YOUR_NAMESPACE # Specify the namespace where the policy takes effect. If not specified, it defaults to 'default'
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: app