This topic describes how to use eGPU to schedule and isolate GPU resources on the Lingjun nodes in a Container Service for Kubernetes (ACK) Lingjun managed cluster.
Table of contents
Prerequisites
An ACK Lingjun managed cluster is created and the cluster contains GPU-accelerated Lingjun nodes.
By default, an eGPU-based GPU sharing and scheduling component is installed in ACK Lingjun managed clusters to allow you to directly use the GPU sharing and scheduling feature. For more information about how to check whether the eGPU-based GPU sharing and scheduling component is installed, see How do I check whether the eGPU-based GPU sharing and scheduling component is installed in my cluster?
The eGPU-based GPU sharing and scheduling component does not have limits on instance types. H800 Lingjun nodes do not support GPU memory isolation or computing power isolation because eGPU does not support all features of H800 Lingjun nodes. If you want to use GPU memory isolation and computing power isolation, use other types of Lingjun nodes.
Step 1: Enable GPU sharing and scheduling
To enable GPU sharing and scheduling for a Lingjun node, perform the following steps:
Search for the /etc/lingjun_metadata file.
If the file exists, run the
nvidia-smi
command. If no error is returned, the node for which you want to enable GPU sharing and scheduling is a Lingjun node. You can proceed to the next step.If the file does not exist, the node for which you want to enable GPU sharing and scheduling is not a Lingjun node. You cannot enable this feature for the node. To enable GPU sharing and scheduling in this case, you need to create a Lingjun node. For more information, see Lingjun node pools.
Run the following command to add the
ack.node.gpu.schedule
label to the node to enable GPU sharing and scheduling:kubectl label node <NODE_NAME> ack.node.gpu.schedule=<SHARE_MODE>
NoteIf the value of the label is
egpu_mem
, only GPU memory is isolated. If the value of the label isegpu_core_mem
, both GPU memory and GPU computing power are isolated. GPU computing power must be requested together with GPU memory. You can request only GPU memory separately.
Step 2: Use shared GPU resources
In this example, the value of the label is set to egpu_core_mem
.
Wait until the node reports the GPU information.
Run the following command to query the resources on the node:
kubectl get node <NODE_NAME> -oyaml
Expected output:
allocatable: aliyun.com/gpu-core.percentage: "100" aliyun.com/gpu-count: "1" aliyun.com/gpu-mem: "80" ... nvidia.com/gpu: "0" ... capacity: aliyun.com/gpu-core.percentage: "100" aliyun.com/gpu-count: "1" aliyun.com/gpu-mem: "80 ... nvidia.com/gpu: "0" ...
The output indicates that the
aliyun.com/gpu-mem
andaliyun.com/gpu-core.percentage
resources are available.Use shared GPU resources. For more information, see Configure the GPU sharing component.
NoteIf you want to allocate an entire GPU to a pod when scheduling the pod, add the
ack.gpushare.placement=require-whole-device
label to the pod and specify the requested amount of GPU memory ingpu-mem
. Then, a GPU that can provide the requested amount of GPU memory is automatically allocated to the pod.
Step 3: Run a job to verify GPU sharing and scheduling
Use the following YAML file to submit a Benchmark job:
apiVersion: batch/v1 kind: Job metadata: name: benchmark-job spec: parallelism: 1 template: spec: containers: - name: benchmark-job image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3 command: - bash - run.sh - --num_batches=500000000 - --batch_size=8 resources: limits: aliyun.com/gpu-mem: 10 aliyun.com/gpu-core.percentage: 60 workingDir: /root restartPolicy: Never hostNetwork: true tolerations: - operator: Exists
Run the command to submit the job:
kubectl apply -f benchmark.yaml
Run the following command to access the pod after the pod enters the Running state:
kubectl exec -ti benchmark-job-xxxx bash
Run the following command in the pod to query the GPU isolation information:
vgpu-smi
Expected output:
+------------------------------------------------------------------------------+ | VGPU_SMI 460.91.03 DRIVER_VERSION: 460.91.03 CUDA Version: 11.2 | +-------------------------------------------+----------------------------------+ | GPU Name Bus-Id | Memory-Usage GPU-Util | |===========================================+==================================| | 0 xxxxxxxx 00000000:00:07.0 | 8307MiB / 10782MiB 60% / 60% | +-------------------------------------------+----------------------------------+
The output indicates that 10 GB of GPU memory and 60% of computing power are allocated to the pod.
FAQ
How do I check whether the eGPU-based GPU sharing and scheduling component is installed in my cluster?
Run the following command to check whether the eGPU-based GPU sharing and scheduling component is installed:
kubectl get ds -nkube-system | grep gpushare
Expected output:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpushare-egpu-device-plugin-ds 0 0 0 0 0 <none>
gpushare-egpucore-device-plugin-ds 0 0 0 0 0 <none>
The output indicates that the eGPU-based GPU sharing and scheduling component is installed.