[Product Change] GPU automatic isolation feature changes - Container Service for Kubernetes

To make GPU exception handling more flexible and configurable for different business needs, Container Service for Kubernetes (ACK) is optimizing its GPU automatic isolation mechanism.

Timeline

The phased rollout begins on May 14, 2026.

Changes

ACK supports GPU exception detection and automatic isolation. When ACK detects a GPU exception, it can isolate the faulty GPU to prevent new workloads from being scheduled to it, minimizing the impact on your business operations. However, automatic isolation is not automatic repair. You must still repair or otherwise handle the faulty GPU. Starting with ACK Node Problem Detector (ACK NPD) component version 1.2.35 and ACK NVIDIA Device Plugin component version 0.7.0, the behavior of GPU automatic isolation will change: instead of being triggered by default, it will now be triggered by configuration, as detailed below:

The ACK NPD component is responsible only for exception detection.
The ACK NVIDIA Device Plugin component determines whether to isolate a faulty GPU based on the exception detection report from ACK NPD and the isolation triggers that you configure.
Under the new mechanism, GPU automatic isolation is disabled by default. To enable this feature, you must configure isolation triggers based on your specific requirements.

Mechanism

Previous mechanism
When the ACK NPD component detected a GPU exception, it generated a GPU isolation file. The ACK NVIDIA Device Plugin component would then isolate all GPUs listed in this file. This meant that GPUs were automatically isolated by default upon detection of specific exceptions. You could enable or disable this feature by configuring whether ACK NPD would generate the isolation file.
New mechanism
When the ACK NPD component detects a GPU exception, it generates an exception detection report. The ACK NVIDIA Device Plugin component then uses this report and your configured isolation triggers to decide whether to isolate the GPU. By default, the ACK NVIDIA Device Plugin component has no triggers configured, which means GPU automatic isolation is not triggered by default. You can define which exceptions trigger automatic isolation by configuring specific triggers.

Note

For backward compatibility, the new version of the ACK NPD component will continue to generate the GPU isolation file in the legacy format. However, the new version of the ACK NVIDIA Device Plugin component no longer reads this file. Isolation behavior is now determined solely by its own configuration.

Impact scope

The new mechanism applies only to ACK clusters running Kubernetes 1.32 or later.
For clusters running Kubernetes versions earlier than 1.32, GPU automatic isolation still uses the previous mechanism.

The following table summarizes the GPU automatic isolation behavior for different component versions:

ACK NPD version	NVIDIA plugin version	Isolation behavior
ACK NPD version < 1.2.24	N/A	GPU exception detection is not available.
ACK NPD version ≥ 1.2.24	ACK NVIDIA Device Plugin version < 0.7.0	Isolation is performed based on the previous mechanism.
1.2.24 ≤ ACK NPD version ＜1.2.35	ACK NVIDIA Device Plugin version ≥ 0.7.0	The GPU automatic isolation feature is inactive. Other features function as expected. Older ACK NPD versions do not generate an exception detection report. As a result, the new ACK NVIDIA Device Plugin cannot receive fault information and will not perform automatic isolation.
ACK NPD version ≥ 1.2.35	ACK NVIDIA Device Plugin version ≥ 0.7.0	Isolation is performed based on the new mechanism.

Recommendations

To use the new configurable automatic isolation feature, complete the following steps:

Upgrade component versions.
Ensure that the ACK NPD component is version 1.2.35 or later and the ACK NVIDIA Device Plugin component is version 0.7.0 or later. During the phased rollout, if the new versions are not available on the Component Management page, submit a ticket to be added to the allowlist. We recommend performing upgrades during off-peak hours.
For instructions on how to upgrade the components, see the following topics:
- ACK NPD upgrade: Upgrade the ack-node-problem-detector component.
- ACK NVIDIA Device Plugin upgrade: Upgrade the NVIDIA Device Plugin component.
Configure automatic isolation triggers.
Configure the automatic isolation triggers based on your business requirements. For detailed instructions, see the updated GPU exception detection and automatic isolation documentation.
(Recommended) Configure GPU exception alerts.
We also recommend configuring GPU exception alerts. You will then be notified immediately when an exception occurs, allowing you to address the issue promptly and prevent a faulty GPU from affecting your operations for a prolonged period. For more information, see Best practices for observability in GPU or AI training scenarios.

Before upgrading the components, assess the impact of this change on your services and make any necessary adjustments. This helps prevent unexpected behavior resulting from the change in the isolation mechanism.