All Products
Search
Document Center

Container Service for Kubernetes:Submit an inference task to use shared GPU resources

Last Updated:Oct 10, 2023

In some scenarios, you may want to share a GPU among multiple inference tasks to improve GPU utilization. This topic describes how to use Arena to submit an inference task to use shared GPU resources.

Prerequisites

Procedure

  1. Run the following command to query the available GPU resources in the cluster:

    arena top node

    Expected output:

    NAME                      IPADDRESS       ROLE    STATUS    GPU(Total)  GPU(Allocated)
    cn-beijing.192.168.1.108  192.168.20.255  <none>  Ready     0           0
    cn-beijing.192.168.8.10   192.168.8.10    <none>  Ready     0           0
    cn-beijing.192.168.1.101  192.168.1.101   <none>  Ready     1           0
    cn-beijing.192.168.1.112  192.168.1.112   <none>  Ready     1           0
    cn-beijing.192.168.8.252  192.168.8.252   <none>  Ready     1           0
    ---------------------------------------------------------------------------------------------------
    Allocated/Total GPUs In Cluster:
    0/3 (0.0%)

    The preceding output shows that the cluster has three GPUs, and the utilization of each GPU is 0.0%.

  2. Use Arena to submit an inference task.

    Important
    • In this example, a TensorFlow inference task is submitted. The model file is added to the Docker image when you create the Docker image.

    • If you have not added the model file to the image, you need to configure a shared NAS volume. For more information, see Configure a shared NAS volume.

    Run the following command to submit an inference task:

    arena serve tensorflow \
        --name=mymnist2 \
        --model-name=mnist \
        --gpumemory=3 \ 
        --gpucore=10 \
        --image=registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:latest-gpu-mnist \
        --model-path=/tfmodel/mnist \
        --version-policy=specific:2
        --data=mydata=/mnt/data

    The following table describes the parameters.

    Parameter

    Description

    --name

    The name of the task.

    --model-name

    The name of the model.

    --gpumemory

    The amount of GPU memory that is requested by the task. For example, if a GPU has 8 GiB of memory and the first task requests 3 GiB of GPU memory (--gpumemory=3), the GPU still has 5 GiB of memory. If the second task requests 4 GiB of GPU memory (--gpumemory=4), the two tasks can run on the same GPU.

    --gpucore

    The computing power by percentage that is requested by the task. For example, if the first task requests 10% of the computing power of a GPU (--gpucore=10), the GPU still has 90% of the computing power left. If the second task requests 50% of the computing power (--gpucore=50), the two tasks can run on the same GPU.

    --image

    The image that is used to run the task.

    --model-path

    The path of the model in the container.

    --version-policy

    The version of the model that you want to use. For example, --version-policy=specific:2 specifies that version 2 of the model is used. Version 2 of the model can be found in the path that is specified by --model-path.

    --data=mydata

    The directory to which the volume is mounted. In this example, /mnt/data is used.

  3. Run the following command to query all tasks:

    arena serve list

    Expected output:

    NAME      TYPE        VERSION       DESIRED  AVAILABLE  ADDRESS       PORTS
    mymnist1  Tensorflow  202101162119  1        0          172.16.3.123  GRPC:8500,RESTFUL:8501
    mymnist2  Tensorflow  202101191447  1        1          172.16.1.147  GRPC:8500,RESTFUL:8501
  4. Run the following command to query the details of the submitted task:

    arena serve get mymnist2

    Expected output:

    Name:           mymnist2
    Namespace:      default
    Type:           Tensorflow
    Version:        202101191447
    Desired:        1
    Available:      1
    Age:            20m
    Address:        172.16.1.147
    Port:           GRPC:8500,RESTFUL:8501
    GPUMemory(GiB): 3
    
    Instances:
      NAME                                                       STATUS   AGE  READY  RESTARTS  GPU(Memory/GiB)  NODE
      ----                                                       ------   ---  -----  --------  ---------------  ----
      mymnist2-202101191447-tensorflow-serving-7f64bf9749-mtnpc  Running  20m  1/1    0         3                cn-beijing.192.168.1.112
    Note

    If the value of Desired equals the value of Available, the task is ready.

  5. Optional: Run the following command to print task logs:

    arena serve logs mymnist2 -t 10
    Note

    -t 10 specifies that only the last 10 log entries are returned.

    Expected output:

    2021-01-18 13:21:58.482985: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:206] Restoring SavedModel bundle.
    2021-01-18 13:21:58.483673: I external/org_tensorflow/tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2500005000 Hz
    2021-01-18 13:21:58.508734: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running initialization op on SavedModel bundle at path: /tfmodel/mnist/2
    2021-01-18 13:21:58.513041: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel load for tags { serve }; Status: success: OK. Took 798017 microseconds.
    2021-01-18 13:21:58.513263: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /tfmodel/mnist/2/assets.extra/tf_serving_warmup_requests
    2021-01-18 13:21:58.513467: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: mnist2 version: 2}
    2021-01-18 13:21:58.516620: I tensorflow_serving/model_servers/server.cc:371] Running gRPC ModelServer at 0.0.0.0:8500 ...
    [warn] getaddrinfo: address family for nodename not supported
    2021-01-18 13:21:58.521317: I tensorflow_serving/model_servers/server.cc:391] Exporting HTTP/REST API at:localhost:8501 ...
    [evhttp_server.cc : 238] NET_LOG: Entering the event loop ...
  6. Deploy and verify the TensorFlow inference service.

    1. Create a file named tfserving-test-client.yaml with the following content:

      cat <<EOF | kubectl create -f -
      kind: Pod
      apiVersion: v1
      metadata:
        name: tfserving-test-client
      spec:
        containers:
        - name: test-client
          image: registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow-serving-test-client:curl
          command: ["sleep","infinity"]
          imagePullPolicy: IfNotPresent
      EOF
    2. Run the following command to deploy a pod:

      kubectl apply -f tfserving-test-client.yaml
    3. Run the following command to query the IP address and port of the service:

      arena serve list

      The expected output indicates that the IP address of the mymnist2 task is 172.16.1.147 and the port is 8501.

      NAME      TYPE        VERSION       DESIRED  AVAILABLE  ADDRESS       PORTS
      mymnist1  Tensorflow  202101162119  1        0          172.16.3.123  GRPC:8500,RESTFUL:8501
      mymnist2  Tensorflow  202101191447  1        1          172.16.1.147  GRPC:8500,RESTFUL:8501
    4. Run the following command to check whether the TensorFlow service is available:

      kubectl exec -ti tfserving-test-client bash validate.sh 172.16.1.147 8501

      Expected output:

      {
          "predictions": [[2.04608277e-05, 1.72721537e-09, 7.74099826e-05, 0.00364777911, 1.25222937e-06, 2.27521796e-05, 1.14668763e-08, 0.99597472, 3.68833389e-05, 0.000218785644]
          ]
      }

      The output indicates the following information:

      • The data requested by the validate.sh script is the pixel values of an image in the mnist dataset.

      • The model predicts that the input data is 7 among all single-digit numbers from 0 to 9, with the highest probability of 0.99597472.