【Product Change】Change announcement for ACK GPU automatic isolation feature
Apr 10, 2026
Container Service for KubernetesAffected time
2026-05-14 begins the canary release
Change and Impact
To enhance flexibility and configurability for handling GPU anomalies, Container Service for Kubernetes (ACK) is updating its GPU automatic isolation mechanism. This change allows you to customize how your clusters respond to GPU issues, better suiting the diverse fault-tolerance requirements of different business scenarios.
Change Details
ACK provides a GPU anomaly detection and automatic isolation feature. When a GPU anomaly is detected, the faulty GPU can be cordoned off to prevent new workloads from being scheduled onto it, minimizing business impact. Since isolation does not equal automated repair, manual intervention is still required to fix or replace the faulty GPU.
Starting with ACK Node Problem Detector (ACK NPD) component version 1.2.35 and ACK NVIDIA Device Plugin component version 0.7.0, the trigger mechanism for automatic GPU isolation will change from enabled by default to opt-in via configuration. Here are the details:- ACK NPD is only responsible for anomaly detection and generating reports.
- ACK NVIDIA Device Plugin determines whether to isolate a faulty GPU based on both the anomaly detection report from ACK NPD and the specific trigger conditions you configured.
- Under this new mechanism, automatic GPU isolation is disabled by default. To enable this feature, you must configure the specific anomalies that should trigger an isolation.
Behavior comparison
Previous mechanism
When ACK NPD detected a GPU anomaly, it generated an isolation file. ACK NVIDIA Device Plugin would read this file and automatically isolate all listed GPUs. You could only enable or disable the entire feature by controlling the generation of this file.New mechanism
When ACK NPD detects a GPU anomaly, it generates an anomaly report. ACK NVIDIA Device Plugin then checks this report against a user-defined list of trigger conditions. By default, this list is empty, meaning automatic GPU isolation is disabled by default. You now have granular control and can define which anomalies trigger an automatic isolation.
Note: To maintain compatibility, the new version of ACK NPD will continue to generate the old-format GPU isolation file. However, the new version of ACK NVIDIA Device Plugin no longer reads this file. Isolation behavior is determined entirely by its own configuration.
Impact Scope
- The new mechanism only applies to ACK clusters running Kubernetes version 1.32 and above.
- Clusters on Kubernetes versions below 1.32 will continue to use the previous isolation mechanism.
The behavior of the automatic GPU isolation feature varies based on the combination of add-on versions:
- ACK NPD version < 1.2.24: GPU anomaly detection is not available.
- ACK NPD version ≥ 1.2.24 and ACK NVIDIA Device Plugin version < 0.7.0: Follows the previous isolation behavior.
- 1.2.24 ≤ ACK NPD version < 1.2.35 and ACK NVIDIA Device Plugin version ≥ 0.7.0: The automatic GPU isolation feature does not function. Other features function normally.
Note: Because earlier ACK NPD versions do not generate the anomaly reports that the new NVIDIA Device Plugin relies on to identify faulty GPUs. Without this report, no isolation can be performed.
- ACK NPD version ≥ 1.2.35 and ACK NVIDIA Device Plugin version ≥ 0.7.0: Follows the new isolation behavior. Isolation is triggered based on user-configured conditions, and is disabled by default.
Action required
To use the new configurable automatic isolation feature, complete the following steps.
1. Upgrade your add-ons
 Ensure that your cluster is running ACK NPD version 1.2.35 or later and ACK NVIDIA Device Plugin version 0.7.0 or later. During the canary release, if the new versions are not yet available on the Components and Add-ons page, submit a ticket to be added to the allowlist. We recommend performing upgrades during off-peak hours.
- To upgrade ACK NPD: See Upgrade the ack-node-problem-detector add-on
- To upgrade ACK NVIDIA Device Plugin: See Upgrade NVIDIA Device Plugin add-on
2. Configure automatic isolation triggers
 Based on your business requirements, configure the specific anomalies that will trigger automatic isolation. For detailed instructions, see the updated GPU anomaly detection and automatic isolation documentation.
3. (Recommended) Configure GPU anomaly alerts
 We also recommend configuring GPU anomaly alerts. This will ensure that you are promptly notified when an anomaly is reported, allowing you to address the issue before it impacts your business. For more information, see Observability best practices for GPU or AI training scenarios.
Review how this change may affect your operations and make the necessary adjustments before upgrading, to prevent unexpected behavior occurring due to the new isolation mechanism.
If you have any questions, please feel free to contact us via our support hotline or by submitting a ticket.