This topic describes how to use GPU container instances based on the example of recognizing images with TensorFlow. This feature is applicable to Serverless Kubernetes clusters and virtual nodes in the clusters.

Background information

Based on elastic container instances (ECIs), ACK Serverless (Serverless Kubernetes) supports GPU container instances, so users can quickly run AI computing tasks in a serverless manner. This facilitates the operations and maintenance of the AI platform and significantly improves the computing efficiency.

AI computing depends on the GPU instances. However, building a GPU cluster environment is a complicated task that includes purchasing GPU specifications, preparing machines, and installing drivers and the container environment. The serverless delivery of GPU resources provides users standard and out-of-the-box resources. Users can log on to the nodes to install GPU drivers without purchasing relevant machines. This helps to simplify the deployment on the AI platform and allows users to focus on the development of AI models and applications instead of the construction and maintenance of infrastructure. GPU/CPU resources are ready to use and easy to obtain. Compared with the subscription billing method, the pay-as-you-go billing method reduces both the costs and the resource consumption.

Create a GPU-mounted pod in ACK Serverless. Use an annotation to specify the GPU type, and specify the type and the number of GPU instances in resource.limits. Each pod occupies the GPU exclusively. The charges for GPU instances are the same as those for ECS GPU types and no extra charges are incurred.

Note vGPU-mounted pods are not supported for the time being.

Prerequisites

You have created a Serverless Kubernetes cluster. Or, a Virtual nodes is created in the Kubernetes cluster.

The following describes how to recognize images with TensorFlow in a Serverless Kubernetes cluster.image
  1. Log on to the Container Service console.
  2. In the left-side navigation pane under Container Service - Kubernetes, choose Applications > Deployments. On the Deployments page that appears, click Create from Template in the upper-right corner.
  3. Select the cluster and namespace, select a sample template or Custom from the Sample Template drop-down list, and click Create.
    You can use the following YAML template to create a pod. In this example, the specified GPU type in the pod is P4 and the number of GPU instances is 1.
    apiVersion: v1
    kind: Pod
    metadata:
      name: tensorflow
      annotations:
        k8s.aliyun.com/eci-gpu-type : "P4"
    spec:
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/ack-serverless/tensorflow
        name: tensorflow
        command:
        - "sh"
        - "-c"
        - "python models/tutorials/image/imagenet/classify_image.py"
        resources:
          limits:
            nvidia.com/gpu: "1"
      restartPolicy: OnFailure
  4. Wait for a few minutes. In the left-side navigation pane, click Applications > Pods. When the following content appears, it indicates that the pod was created successfully.
    Pods
  5. Click the target pod. On the Pods- tensorflow page, click the Logs tab. If the following content appears, it indicates that the image recognition is successful.
    Pods - tensorflow
If you want to use this feature in the virtual node of ACK, see Virtual nodes. You need to schedule the pod to the virtual node, or create the pod in a namespace that contains virtual-node-affinity-injection=enabled label. Then, use the following file to replace the YAML file in step 4. Example:
apiVersion: v1
kind: Pod
metadata:
  name: tensorflow
  annotations:
    k8s.aliyun.com/eci-gpu-type : "P4"
spec:
  containers:
  - image: registry-vpc.cn-hangzhou.aliyuncs.com/ack-serverless/tensorflow
    name: tensorflow
    command:
    - "sh"
    - "-c"
    - "python models/tutorials/image/imagenet/classify_image.py"
    resources:
      limits:
        nvidia.com/gpu: "1"
  restartPolicy: OnFailure
  nodeName: virtual-kubelet
Note The virtual node-based method supports multiple deep learning frameworks, including Kubeflow, arena, and other custom resource definitions (CRDs).