All Products
Search
Document Center

Container Compute Service:Disable GPU ECC mode

Last Updated:Mar 25, 2026

GPU Error Correction Code (ECC) detects and corrects memory errors at the cost of some GPU memory. In memory-intensive workloads where every gigabyte counts, disabling ECC reclaims that memory and makes it available to your workload.

Warning

With ECC disabled, GPU memory errors are no longer detected or corrected. This can cause task interruptions and data loss. Alibaba Cloud does not restore tasks or data affected by ECC-related issues. Verify that your workload can tolerate memory errors before disabling ECC.

Prerequisites

Before you begin, ensure that you have:

  • An ACS cluster with GPU nodes

  • Account permissions to disable GPU ECC — submit a ticket to request this access if you don't have it

Supported GPU models

The following GPU models support disabling ECC mode.

Card typeCompute class
G49Egpu

Deploy a pod with ECC mode disabled

ECC mode is enabled by default on all GPUs. To disable it, set the alibabacloud.com/gpu-ecc-mode-disabled annotation to "true" on the pod. Omitting the annotation or setting it to "false" keeps ECC mode enabled.

  1. Create a file named pod-disable-gpu-ecc.yaml with the following content.

    apiVersion: v1
    kind: Pod
    metadata:
      labels:
        alibabacloud.com/compute-class: gpu
        alibabacloud.com/compute-qos: default
        # Specify the GPU model. Change this value as needed.
        alibabacloud.com/gpu-model-series: G49E
      annotations:
        # Disable ECC mode.
        alibabacloud.com/gpu-ecc-mode-disabled: "true"
      name: pod-disable-gpu-ecc
      namespace: default
    spec:
      containers:
        - command:
            - sleep
            - '3600000000'
          # The sample image has the GPU driver pre-installed. Replace cn-hangzhou with your region.
          image: acs-registry-vpc.cn-hangzhou.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.09-vllm0.10.2-pytorch2.8-cu128-20250922-serverless
          imagePullPolicy: IfNotPresent
          name: test
          resources:
            limits:
              cpu: '8'
              ephemeral-storage: 30Gi
              memory: 64Gi
              nvidia.com/gpu: '1'
            requests:
              cpu: '8'
              ephemeral-storage: 30Gi
              memory: 64Gi
              nvidia.com/gpu: '1'
  2. Deploy the pod.

    kubectl apply -f pod-disable-gpu-ecc.yaml
  3. Wait for the pod to reach the Running state.

    kubectl get pod | grep pod-disable-gpu-ecc

    Expected output:

    pod-disable-gpu-ecc   1/1     Running   0          2m16s

Verify ECC mode is disabled

Log in to the pod and run the following command to check the ECC mode status.

nvidia-smi -q | grep "ECC Mode" -A 2

Expected output:

    ECC Mode
        Current                           : Disabled
        Pending                           : Disabled

Both Current and Pending show Disabled, confirming that ECC mode is off.