GPU Error Correction Code (ECC) detects and corrects memory errors at the cost of some GPU memory. In memory-intensive workloads where every gigabyte counts, disabling ECC reclaims that memory and makes it available to your workload.
With ECC disabled, GPU memory errors are no longer detected or corrected. This can cause task interruptions and data loss. Alibaba Cloud does not restore tasks or data affected by ECC-related issues. Verify that your workload can tolerate memory errors before disabling ECC.
Prerequisites
Before you begin, ensure that you have:
An ACS cluster with GPU nodes
Account permissions to disable GPU ECC — submit a ticket to request this access if you don't have it
Supported GPU models
The following GPU models support disabling ECC mode.
| Card type | Compute class |
|---|---|
| G49E | gpu |
Deploy a pod with ECC mode disabled
ECC mode is enabled by default on all GPUs. To disable it, set the alibabacloud.com/gpu-ecc-mode-disabled annotation to "true" on the pod. Omitting the annotation or setting it to "false" keeps ECC mode enabled.
Create a file named
pod-disable-gpu-ecc.yamlwith the following content.apiVersion: v1 kind: Pod metadata: labels: alibabacloud.com/compute-class: gpu alibabacloud.com/compute-qos: default # Specify the GPU model. Change this value as needed. alibabacloud.com/gpu-model-series: G49E annotations: # Disable ECC mode. alibabacloud.com/gpu-ecc-mode-disabled: "true" name: pod-disable-gpu-ecc namespace: default spec: containers: - command: - sleep - '3600000000' # The sample image has the GPU driver pre-installed. Replace cn-hangzhou with your region. image: acs-registry-vpc.cn-hangzhou.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.09-vllm0.10.2-pytorch2.8-cu128-20250922-serverless imagePullPolicy: IfNotPresent name: test resources: limits: cpu: '8' ephemeral-storage: 30Gi memory: 64Gi nvidia.com/gpu: '1' requests: cpu: '8' ephemeral-storage: 30Gi memory: 64Gi nvidia.com/gpu: '1'Deploy the pod.
kubectl apply -f pod-disable-gpu-ecc.yamlWait for the pod to reach the Running state.
kubectl get pod | grep pod-disable-gpu-eccExpected output:
pod-disable-gpu-ecc 1/1 Running 0 2m16s
Verify ECC mode is disabled
Log in to the pod and run the following command to check the ECC mode status.
nvidia-smi -q | grep "ECC Mode" -A 2Expected output:
ECC Mode
Current : Disabled
Pending : DisabledBoth Current and Pending show Disabled, confirming that ECC mode is off.