All Products
Search
Document Center

Container Service for Kubernetes:Work with GPU sharing

Last Updated:Mar 26, 2026

GPU sharing lets multiple pods run on the same GPU card in an ACK Lingjun managed cluster. Depending on whether your workloads need strict memory boundaries, choose between two modes:

Mode How it works GPU memory isolated? Use when
Sharing without isolation Pods share the GPU; memory is not partitioned between them No Workloads that manage their own memory limits (for example, Java apps with -Xmx)
Sharing with isolation (eGPU) Pods share the GPU; each pod gets a hard memory boundary enforced by the eGPU module Yes Multiple containers on one GPU where one must not starve others

Prerequisites

Before you begin, ensure that you have:

  • An ACK Lingjun managed cluster with at least one GPU-accelerated Lingjun node. See Create a Lingjun cluster with ACK activated

  • The GPU sharing component, which is installed by default in ACK Lingjun managed clusters

Enable GPU sharing without isolation

Use this mode when your workloads handle GPU memory limits at the application layer.

Step 1: Label the node

  1. Confirm the node is a Lingjun node by checking whether /etc/lingjun_metadata exists on the node. If the file exists, run nvidia-smi to verify the GPU is accessible. If the file does not exist, the node is not a Lingjun node and you cannot enable GPU sharing for it. Create a Lingjun node pool first. See Overview of Lingjun node pools.

  2. Add the GPU sharing label to the node:

    kubectl label node <NODE_NAME> ack.node.gpu.schedule=share

Step 2: Submit a GPU-sharing job

  1. Create a file named tensorflow.yaml with the following content:

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: tensorflow-mnist-share
    spec:
      parallelism: 1
      template:
        metadata:
          labels:
            app: tensorflow-mnist-share
        spec:
          containers:
          - name: tensorflow-mnist-share
            image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5
            command:
            - python
            - tensorflow-sample-code/tfjob/docker/mnist/main.py
            - --max_steps=100000
            - --data_dir=tensorflow-sample-code/data
            resources:
              limits:
                aliyun.com/gpu-mem: 4  # Request 4 GiB of GPU memory
            workingDir: /root
          restartPolicy: Never

    The key field is aliyun.com/gpu-mem: 4 under resources.limits, which requests 4 GiB of GPU memory for the pod.

  2. Submit the job:

    kubectl apply -f tensorflow.yaml

Step 3: Verify GPU sharing without isolation

  1. Get the pod name:

    kubectl get pod | grep tensorflow
  2. Run nvidia-smi inside the pod:

    kubectl exec -ti tensorflow-mnist-share-xxxxx -- nvidia-smi

    Expected output:

    Wed Jun 14 06:45:56 2023
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  On   | 00000000:00:09.0 Off |                    0 |
    | N/A   35C    P0    59W / 300W |    334MiB / 16384MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+

    Key fields to check:

    • Memory-Usage shows 334MiB / 16384MiB — the pod sees the full 16,384 MiB of GPU memory, not just the 4 GiB it requested. This confirms isolation is not active.

    • If the GPU isolation module were installed, the memory field would show only the requested 4 GiB.

In this mode, the pod does not enforce memory limits at the GPU driver level. The scheduler tracks memory allocations using two environment variables injected into the container: Applications that need to respect the allocation can calculate the permitted fraction: 4 / 16 = 0.25 (25% of total GPU memory).
ALIYUN_COM_GPU_MEM_CONTAINER=4   # GPU memory allocated to this pod (GiB)
ALIYUN_COM_GPU_MEM_DEV=16        # Total GPU memory per card (GiB)

Enable GPU sharing with isolation (eGPU)

Use this mode to enforce hard GPU memory boundaries between pods on the same GPU card.

Step 1: Label the node

  1. Confirm the node is a Lingjun node by checking whether /etc/lingjun_metadata exists on the node. If the file exists, run nvidia-smi to verify the GPU is accessible. If the file does not exist, the node is not a Lingjun node and you cannot enable GPU sharing for it. Create a Lingjun node pool first. See Overview of Lingjun node pools.

  2. Add the GPU sharing label to the node. Choose the label value based on which isolation you need:

    Label value What it enables
    egpu_mem GPU memory isolation only
    egpu_core_mem GPU memory isolation and computing power isolation

    To enable memory isolation:

    kubectl label node <NODE_NAME> ack.node.gpu.schedule=egpu_mem
    GPU computing power must always be requested together with GPU memory. Requesting computing power alone is not supported.

Step 2: Confirm node resources are ready

After labeling the node, wait for the node to report its GPU resources, then verify:

kubectl get node <NODE_NAME> -oyaml

Look for aliyun.com/gpu-mem and aliyun.com/gpu-count in the allocatable and capacity sections:

allocatable:
  aliyun.com/gpu-count: "1"
  aliyun.com/gpu-mem: "80"
  ...
  nvidia.com/gpu: "0"
  ...
capacity:
  aliyun.com/gpu-count: "1"
  aliyun.com/gpu-mem: "80"
  ...
  nvidia.com/gpu: "0"
  ...

Key fields to check:

  • `aliyun.com/gpu-count: "1"` — the node has one GPU card.

  • `aliyun.com/gpu-mem: "80"` — the node has 80 GB of total GPU memory.

  • `nvidia.com/gpu: "0"` — the whole GPU is not exposed as an independent schedulable resource; memory is allocated via aliyun.com/gpu-mem.

To schedule a pod to an entire GPU device, add the label ack.gpushare.placement=require-whole-device to the pod and specify the amount of GPU memory using aliyun.com/gpu-mem.

Step 3: Run a benchmarking job to verify isolation

  1. Create a file named benchmark.yaml with the following content:

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: benchmark-job
    spec:
      parallelism: 1
      template:
        spec:
          containers:
          - name: benchmark-job
            image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
            command:
            - bash
            - run.sh
            - --num_batches=500000000
            - --batch_size=8
            resources:
              limits:
                aliyun.com/gpu-mem: 10  # Request 10 GB of GPU memory
            workingDir: /root
          restartPolicy: Never
          hostNetwork: true
          tolerations:
            - operator: Exists
  2. Submit the job:

    kubectl apply -f benchmark.yaml
  3. After the pod starts, open a shell in the pod:

    kubectl exec -ti benchmark-job-xxxx bash
  4. Run vgpu-smi to check the GPU isolation status:

    • Memory-Usage shows 8307MiB / 10782MiB — the pod is capped at approximately 10 GB, confirming that GPU memory isolation is active.

    • Unlike nvidia-smi in the non-isolated mode, vgpu-smi shows only the memory allocated to this pod, not the total GPU memory.

    vgpu-smi

    Expected output:

    +------------------------------------------------------------------------------+
    |    VGPU_SMI 460.91.03     DRIVER_VERSION: 460.91.03     CUDA Version: 11.2   |
    +-------------------------------------------+----------------------------------+
    | GPU  Name                Bus-Id           |        Memory-Usage     GPU-Util |
    |===========================================+==================================|
    |   0  xxxxxxxx            00000000:00:07.0 |  8307MiB / 10782MiB   100% /  100% |
    +-------------------------------------------+----------------------------------+

    Key fields to check:

FAQ

How do I check whether the GPU sharing component is installed?

Run the following command:

kubectl get ds -nkube-system | grep gpushare

If the component is installed, the output lists the following DaemonSets:

NAME                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                    AGE
gpushare-egpu-device-plugin-ds       0         0         0       0            0           <none>
gpushare-egpucore-device-plugin-ds   0         0         0       0            0           <none>

What's next