If a hardware exception occurs on the underlying infrastructure, Alibaba Cloud Service (ACS) reports it using methods such as Kubernetes events and conditions. For more information, see GPU Fault Diagnosis and Recovery. To avoid service interruptions, you can configure the acs-instance-helper component to enable fault handling. This feature automates workload scale-out and eviction, which improves O&M efficiency and ensures service stability.
How it works
When a planned O&M event or a hardware fault, such as a damaged GPU, occurs on the underlying infrastructure of an ACS instance, it can affect service stability and performance or pose a risk of downtime. You can configure the acs-instance-helper component to achieve fully automated fault handling:
Automatic fault monitoring: The component continuously listens for the pod's fault
Condition. The underlying infrastructure automatically reports this signal for events such as GPU failures, full machine failures, or planned node restart O&M events.Aligned window processing: The component determines the processing time based on the fault handling deadline reported by the underlying node and the optional O&M window configuration. If the deadline permits, the component waits until the preset O&M window begins before it starts processing.
Triggered rotational updates: The component automatically uses the online scale-out policy (scale out first, then destroy) for stateless applications, such as Deployments and CloneSets, to perform a smooth rotation of pods on the faulty node.
ImportantFor non-online applications, acs-instance-helper directly evicts the instance after it detects the abnormal state.
Applicable scope
Your ACS cluster version is 1.28 or later.
The ACK Virtual Node component is installed, and its version is v2.16.0 or later. For more information, see ACK Virtual Node.
Install the component
In the ACS console, click the name of your target cluster. In the navigation pane on the left, choose .
On the Helm page, click Create.
Basic Information: In the Chart search box, enter acs-instance-helper and select it from the search results.
Parameters: For Chart Version, select the latest version.
Configure global settings for the acs-instance-helper component (Optional)
You can configure additional supported workload types and O&M windows for the component.
Console
In the left navigation pane of the target cluster's details page, choose Configurations > ConfigMaps.
On the ConfigMaps page, click Create from YAML. Then, copy the following manifest into the Templates area and click Create.
kubectl
Obtain a cluster kubeconfig file and use kubectl to connect to the cluster.
Save the following YAML manifest as a file named
acs-instance-helper-global-configmap.yaml. Then, run thekubectl apply -f acs-instance-helper-global-configmap.yamlcommand.
apiVersion: v1
kind: ConfigMap
metadata:
name: acs-instance-helper-global-config
namespace: kube-system
data:
customOnlineWorkloads: foo.io/SomeWorkload,bar.io/AnotherWorkload
hardwareFaultEvictionSeconds: "60"
maintenanceTime: "2025-10-09T10:00:00+08:00"
maintenanceDuration: "4h"
maintenanceWeeklyPeriod: "Saturday,Sunday"
# maintenanceRecurrence: "FREQ=WEEKLY;BYDAY=SA,SU" # O&M window: every Saturday and SundayExpand the following section to view descriptions of the configuration items.
Create and configure a workload
You can enable the fault handling feature for a workload by configuring its annotations.
The fault handling feature rotates instances by repeatedly trying to evict them using the Eviction API, rather than directly deleting them. You can configure a PodDisruptionBudget (PDB) policy to control the concurrency of evictions and prevent service interruptions. For more information, see Control pod eviction concurrency using a PDB resource.
Console
In the navigation pane on the left of the destination cluster, choose Workloads > Deployments.
On the Deployments page, click Create from YAML. Copy the following content into the Template area and then click Create.
apiVersion: apps/v1 kind: Deployment metadata: name: hardware-fault-helper-example namespace: default spec: replicas: 1 selector: matchLabels: app: hardware-fault-helper-example template: metadata: labels: app: hardware-fault-helper-example annotations: # Key annotation: Enables the fault handling feature for the workload "ops.alibabacloud.com/enable-hardware-fault-helper": "true" spec: containers: - image: registry-cn-hangzhou.ack.aliyuncs.com/dev/hello-world:v1 name: main-container resources: limits: cpu: 100m memory: 100Mi restartPolicy: AlwaysIn the pop-up window, find the destination stateless application and click View . Confirm that the pod status is
Running.
kubectl
Save the following YAML content as a file named `app.yaml` and then run the `kubectl apply -f app.yaml` command.
apiVersion: apps/v1 kind: Deployment metadata: name: hardware-fault-helper-example namespace: default spec: replicas: 1 selector: matchLabels: app: hardware-fault-helper-example template: metadata: labels: app: hardware-fault-helper-example annotations: # Key annotation: Enables the fault handling feature for the workload "ops.alibabacloud.com/enable-hardware-fault-helper": "true" spec: containers: - image: registry-cn-hangzhou.ack.aliyuncs.com/dev/hello-world:v1 name: main-container resources: limits: cpu: 100m memory: 100Mi restartPolicy: AlwaysConfirm that the pod status of the destination application is
Running.kubectl get pods -l app=hardware-fault-helper-example
Simulate a fault scenario and observe the results
In a real-world scenario, the underlying control system automatically adds the Condition. Here, we simulate a fault scenario by manually injecting a Condition into one of the pods.
Simulate the fault: Replace
POD_NAMEwith the actual pod name to inject a hardware faultConditioninto the destination pod.The fault handling deadline is the time specified in the
messagefield.kubectl patch pod POD_NAME --type='merge' --subresource=status -p='{ "status": { "conditions": [ { "type": "Interruption.HardwareFault", "status": "True", "reason": "MockForTest", "message": "Underlying infrastructure issue [Reboot] scheduled at 2099-03-12T09:00:00.000+08:00", "lastProbeTime": "'$(date -u +"%Y-%m-%dT%H:%M:%SZ")'", "lastTransitionTime": "'$(date -u +"%Y-%m-%dT%H:%M:%SZ")'" } ] } }'Observe the scale-out: After the fault is injected, acs-instance-helper triggers a scale-out based on the O&M window configuration (or immediately if no configuration exists). A new pod is created, and the original workload's status remains unaffected.
kubectl get pods -l app=hardware-fault-helper-exampleExpected output:
NAME READY STATUS RESTARTS AGE hardware-fault-helper-example-7cf4cf96c5-xxxxx 1/1 Running 0 2m21s hardware-fault-helper-example-7cf4cf96c5-yyyyy 1/1 Running 0 36s # New scaled-out podView the scale-out event: Check the events of the faulty pod. You can see a
NewInstanceCreationTriggeredevent, which confirms that the scale-out operation was triggered byhardware-fault-helper.kubectl describe po POD_NAMEExpected output:
... Normal NewInstanceCreationTriggered 62s hardware-fault-helper controller default/hardware-fault-helper-example-7cf4cf96c5 (apiVersion:apps/v1, kind:ReplicaSet) will create a new instanceView the eviction event: After the time specified by
hardwareFaultEvictionSecondselapses, the faulty pod is taken offline (it enters the Terminating state and is then deleted). You can also observe an offline event at this time.kubectl describe po POD_NAMEExpected output:
... Warning InstanceEvictedGracefully 2s hardware-fault-helper pod is deleted due to hardware fault Normal Killing 1s kubelet Stopping container main-containerConfirm recovery: Finally, the faulty pod is completely replaced, leaving only the newly created pod.
kubectl get pods -l app=hardware-fault-helper-exampleExpected output:
NAME READY STATUS RESTARTS AGE hardware-fault-helper-example-7cf4cf96c5-yyyyy 1/1 Running 0 5m5s
Billing
Installing the acs-instance-helper component deploys a deployment with two replicas in your cluster. Each replica is allocated 1 vCPU and 2 GiB of memory. These resources are consumed from your cluster and will incur fees. For more information about billing, see ACS computing power billing.
FAQ
Control pod eviction concurrency using a PDB resource
When pods must be evicted due to external events such as node draining or automatic scale-in, you can configure a PodDisruptionBudget (PDB) policy to ensure high availability for your services. This policy controls the concurrency of evictions using the following parameters:
maxUnavailable: Defines the maximum number of pods that can be unavailable during the eviction process.minAvailable: Defines the minimum number of pods that must remain available during the eviction process.
The following example ensures that at least one pod remains available during eviction.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
namespace: YOUR_NAMESPACE # Specify the namespace where the policy takes effect. If not specified, it defaults to 'default'
spec:
minAvailable: 1
selector:
matchLabels:
app: app