Configure automatic rotation for instances with hardware exceptions - Container Compute Service

If a hardware exception occurs on the underlying infrastructure, Alibaba Cloud Service (ACS) reports it using methods such as Kubernetes events and conditions. For more information, see GPU Fault Diagnosis and Recovery. To avoid service interruptions, you can configure the acs-instance-helper component to enable fault handling. This feature automates workload scale-out and eviction, which improves O&M efficiency and ensures service stability.

How it works

When a planned O&M event or a hardware fault, such as a damaged GPU, occurs on the underlying infrastructure of an ACS instance, it can affect service stability and performance or pose a risk of downtime. You can configure the acs-instance-helper component to achieve fully automated fault handling:

Automatic fault monitoring: The component continuously listens for the pod's fault Condition. The underlying infrastructure automatically reports this signal for events such as GPU failures, full machine failures, or planned node restart O&M events.
Aligned window processing: The component determines the processing time based on the fault handling deadline reported by the underlying node and the optional O&M window configuration. If the deadline permits, the component waits until the preset O&M window begins before it starts processing.
Triggered rotational updates: The component automatically uses the online scale-out policy (scale out first, then destroy) for stateless applications, such as Deployments and CloneSets, to perform a smooth rotation of pods on the faulty node.
Important
For non-online applications, acs-instance-helper directly evicts the instance after it detects the abnormal state.

Applicable scope

Your ACS cluster version is 1.28 or later.
The ACK Virtual Node component is installed, and its version is v2.16.0 or later. For more information, see ACK Virtual Node.

Install the component

In the ACS console, click the name of your target cluster. In the navigation pane on the left, choose Applications > Helm.
On the Helm page, click Create.
1. Basic Information: In the Chart search box, enter acs-instance-helper and select it from the search results.
2. Parameters: For Chart Version, select the latest version.

Configure global settings for the acs-instance-helper component (Optional)

You can configure additional supported workload types and O&M windows for the component.

Console

In the left navigation pane of the target cluster's details page, choose Configurations > ConfigMaps.
On the ConfigMaps page, click Create from YAML. Then, copy the following manifest into the Templates area and click Create.

kubectl

Obtain a cluster kubeconfig file and use kubectl to connect to the cluster.
Save the following YAML manifest as a file named acs-instance-helper-global-configmap.yaml. Then, run the kubectl apply -f acs-instance-helper-global-configmap.yaml command.

apiVersion: v1
kind: ConfigMap
metadata:
  name: acs-instance-helper-global-config
  namespace: kube-system
data:
  customOnlineWorkloads: foo.io/SomeWorkload,bar.io/AnotherWorkload
  hardwareFaultEvictionSeconds: "60"
  maintenanceTime: "2025-10-09T10:00:00+08:00"
  maintenanceDuration: "4h"
  maintenanceWeeklyPeriod: "Saturday,Sunday"
  # maintenanceRecurrence: "FREQ=WEEKLY;BYDAY=SA,SU"  # O&M window: every Saturday and Sunday

Expand the following section to view descriptions of the configuration items.

Configuration item descriptions

Configuration Key	Description	Example Value
`customOnlineWorkloads`	Marks an application as an "online service" type, which uses the online scale-out policy (scale-out first, then destroy) for a smooth rotation of pods on faulty nodes. The component supports Deployment workloads by default. Configure this item to add support for other custom workload types. Important Ensure that the custom workload controller can maintain the number of pod replicas by automatically adding more when the count is insufficient. A lossless rotation during a fault cannot be guaranteed for all custom workloads. Test thoroughly when you enable this feature for a custom workload.	`foo.io/SomeWorkload,bar.io/AnotherWorkload`
`hardwareFaultEvictionSeconds`	For services that use the "scale-out and evict" policy, this is the waiting time in seconds from the completion of scale-out to the start of eviction for the faulty instance. The default value is `"300"` (5 minutes). The value must be in string format, for example, `"60"`.	`"60"`
`maintenanceTime`	Defines the start time of the O&M window in `RFC3339` format. Configuring this item enables the O&M window for the cluster. Use a clear time zone identifier, such as `+08:00` or `UTC`.	`2025-10-09T10:00:00+08:00` Declares that the O&M window starts at 10:00 AM on October 9, 2025 (UTC+8).
`maintenanceDuration`	Specifies the duration of each O&M window. This takes effect only after `maintenanceTime` is configured. The value must be in string format. Formats such as `"3"`, `"3h"`, and `"3H"` are supported. The default value is `"3"`, which means 3 hours.	`"4h"`
`maintenanceWeeklyPeriod`	A simplified way to specify which days of the week are O&M days. This takes effect only after `maintenanceTime` is configured. Valid values are `Monday`, `Tuesday`, `Wednesday`, `Thursday`, `Friday`, `Saturday`, and `Sunday`. Separate multiple values with commas. If you configure this item, it overwrites the value of `maintenanceRecurrence`.	`Saturday,Sunday`
`maintenanceRecurrence`	Uses `RFC5545` recurrence rule syntax to define the O&M cycle, providing greater flexibility. This takes effect only after `maintenanceTime` is configured. Currently, only `FREQ=WEEKLY` is supported. The `COUNT` and `UNTIL` parameters are not supported. If `maintenanceWeeklyPeriod` is also configured, this item is ignored.	`FREQ=WEEKLY;BYDAY=SA,SU`

The fault processing time is determined by both the fault handling deadline and the configured O&M window. If an O&M window is available before the deadline, the acs-instance-helper component prioritizes running the fault repair within that window. If the entire process is not completed within a single O&M window, the remaining operations continue in the next O&M window.

Create and configure a workload

You can enable the fault handling feature for a workload by configuring its annotations.

The fault handling feature rotates instances by repeatedly trying to evict them using the Eviction API, rather than directly deleting them. You can configure a PodDisruptionBudget (PDB) policy to control the concurrency of evictions and prevent service interruptions. For more information, see Control pod eviction concurrency using a PDB resource.

Console

In the navigation pane on the left of the destination cluster, choose Workloads > Deployments.

On the Deployments page, click Create from YAML. Copy the following content into the Template area and then click Create.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hardware-fault-helper-example
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hardware-fault-helper-example
  template:
    metadata:
      labels:
        app: hardware-fault-helper-example
      annotations:
        # Key annotation: Enables the fault handling feature for the workload
        "ops.alibabacloud.com/enable-hardware-fault-helper": "true"
    spec:
      containers:
        - image: registry-cn-hangzhou.ack.aliyuncs.com/dev/hello-world:v1
          name: main-container
          resources:
            limits:
              cpu: 100m
              memory: 100Mi
      restartPolicy: Always

In the pop-up window, find the destination stateless application and click View . Confirm that the pod status is Running.

kubectl

Save the following YAML content as a file named `app.yaml` and then run the `kubectl apply -f app.yaml` command.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hardware-fault-helper-example
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hardware-fault-helper-example
  template:
    metadata:
      labels:
        app: hardware-fault-helper-example
      annotations:
        # Key annotation: Enables the fault handling feature for the workload
        "ops.alibabacloud.com/enable-hardware-fault-helper": "true"
    spec:
      containers:
        - image: registry-cn-hangzhou.ack.aliyuncs.com/dev/hello-world:v1
          name: main-container
          resources:
            limits:
              cpu: 100m
              memory: 100Mi
      restartPolicy: Always

Confirm that the pod status of the destination application is Running.
```
kubectl get pods -l app=hardware-fault-helper-example
```

Simulate a fault scenario and observe the results

In a real-world scenario, the underlying control system automatically adds the Condition. Here, we simulate a fault scenario by manually injecting a Condition into one of the pods.

Simulate the fault: Replace POD_NAME with the actual pod name to inject a hardware fault Condition into the destination pod.

The fault handling deadline is the time specified in the message field.

kubectl patch pod POD_NAME --type='merge' --subresource=status -p='{
  "status": {
    "conditions": [
      {
        "type": "Interruption.HardwareFault",
        "status": "True",
        "reason": "MockForTest",
        "message": "Underlying infrastructure issue [Reboot] scheduled at 2099-03-12T09:00:00.000+08:00",
        "lastProbeTime": "'$(date -u +"%Y-%m-%dT%H:%M:%SZ")'",
        "lastTransitionTime": "'$(date -u +"%Y-%m-%dT%H:%M:%SZ")'"
      }
    ]
  }
}'

Observe the scale-out: After the fault is injected, acs-instance-helper triggers a scale-out based on the O&M window configuration (or immediately if no configuration exists). A new pod is created, and the original workload's status remains unaffected.

kubectl get pods -l app=hardware-fault-helper-example

Expected output:

NAME                                             READY   STATUS    RESTARTS   AGE
hardware-fault-helper-example-7cf4cf96c5-xxxxx   1/1     Running   0          2m21s
hardware-fault-helper-example-7cf4cf96c5-yyyyy   1/1     Running   0          36s # New scaled-out pod

View the scale-out event: Check the events of the faulty pod. You can see a NewInstanceCreationTriggered event, which confirms that the scale-out operation was triggered by hardware-fault-helper.

kubectl describe po POD_NAME

Expected output:

...
  Normal  NewInstanceCreationTriggered  62s    hardware-fault-helper  controller default/hardware-fault-helper-example-7cf4cf96c5 (apiVersion:apps/v1, kind:ReplicaSet) will create a new instance

View the eviction event: After the time specified by hardwareFaultEvictionSeconds elapses, the faulty pod is taken offline (it enters the Terminating state and is then deleted). You can also observe an offline event at this time.

kubectl describe po POD_NAME

Expected output:

...
  Warning  InstanceEvictedGracefully     2s     hardware-fault-helper  pod is deleted due to hardware fault
  Normal   Killing                       1s     kubelet                Stopping container main-container

Confirm recovery: Finally, the faulty pod is completely replaced, leaving only the newly created pod.

kubectl get pods -l app=hardware-fault-helper-example

Expected output:

NAME                                             READY   STATUS      RESTARTS   AGE
hardware-fault-helper-example-7cf4cf96c5-yyyyy   1/1     Running     0          5m5s

Billing

Installing the acs-instance-helper component deploys a deployment with two replicas in your cluster. Each replica is allocated 1 vCPU and 2 GiB of memory. These resources are consumed from your cluster and will incur fees. For more information about billing, see ACS computing power billing.

FAQ

Control pod eviction concurrency using a PDB resource

When pods must be evicted due to external events such as node draining or automatic scale-in, you can configure a PodDisruptionBudget (PDB) policy to ensure high availability for your services. This policy controls the concurrency of evictions using the following parameters:

maxUnavailable: Defines the maximum number of pods that can be unavailable during the eviction process.
minAvailable: Defines the minimum number of pods that must remain available during the eviction process.

The following example ensures that at least one pod remains available during eviction.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
  namespace: YOUR_NAMESPACE # Specify the namespace where the policy takes effect. If not specified, it defaults to 'default'
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: app