Use the default GPU scheduling mode - Container Service for Kubernetes

Container Service for Kubernetes (ACK) allows you to schedule and manage GPU resources by using GPU scheduling. The default GPU scheduling mode is the same as that used in open source Kubernetes. In this topic, a GPU-accelerated TensorFlow job is used as an example to describe how to quickly deploy a GPU-heavy application.

Usage notes

We recommend that you request GPU resources for GPU-accelerated nodes managed by ACK in the same way as Kubernetes extended resources. You need to pay attention to the following items when you request GPU resources for applications and use GPU resources.

Do not run GPU-heavy applications directly on nodes.
Do not use command-line tools, such as docker, podman, and nerdctl, to create containers or use these tools to request GPU resources for containers. For example, do not run the docker run --gpus all or docker run -e NVIDIA_VISIBLE_DEVICES=all command and run GPU-heavy applications.
Do not add the NVIDIA_VISIBLE_DEVICES=all or NVIDIA_VISIBLE_DEVICES=<GPU ID> environment variable to the env section in the pod YAML file. Do not use the NVIDIA_VISIBLE_DEVICES environment variable to request GPU resources for pods and run GPU-heavy applications.
Do not set NVIDIA_VISIBLE_DEVICES=all and run GPU-heavy applications when you build container images if the NVIDIA_VISIBLE_DEVICES environment variable is not specified in the pod YAML file.
Do not add privileged: true to the securityContext section in the pod YAML file and run GPU-heavy applications.

If you use the preceding methods to request GPU resources, the following security risks may exist.

If you use one of the preceding methods to request GPU resources on a node but do not specify the details in the device resource ledger of the scheduler, the actual GPU resource allocation information may be different from that in the device resource ledger of the scheduler. In this scenario, the scheduler can still schedule certain pods that request the GPU resources to the node. As a result, your applications may compete for resources provided by the same GPU, such as requesting resources from the same GPU, and some applications may fail to start up due to insufficient GPU resources.
Using the preceding methods may also cause other unknown issues, such as the issues reported by the NVIDIA community.

Procedure

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose Workloads > Deployments in the left-side navigation pane.
On the Deployments page, click Create from YAML in the upper-right corner.

On the page that appears, select a namespace and select an existing template from the Sample Template drop-down list or select Custom. Enter the following code block into the Template editor and click Create.

apiVersion: v1
kind: Pod
metadata:
  name: tensorflow-mnist
  namespace: default
spec:
  containers:
  - image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
    name: tensorflow-mnist
    command:
    - python
    - tensorflow-sample-code/tfjob/docker/mnist/main.py
    - --max_steps=100000
    - --data_dir=tensorflow-sample-code/data
    resources:
      limits:
        nvidia.com/gpu: 1  # Request one GPU for the pod. 
    workingDir: /root
  restartPolicy: Always

In the left-side navigation pane of the cluster management page, choose Workloads > Pods, find the pod that you created, and then click the name of the pod to view the pod information.
Click the Logs tab and view the log of the TensorFlow job. The following output indicates that the job is using the GPU.