All Products
Search
Document Center

Container Service for Kubernetes:Submit an inference task that uses shared GPU resources

Last Updated:Dec 25, 2025

In some scenarios, you may want to share a GPU among multiple inference tasks to improve GPU utilization. This topic describes how to use Arena to submit an inference task to use shared GPU resources.

Prerequisites

Procedure

  1. Run the following command to query the available GPU resources in the cluster:

    arena top node

    Expected output:

    NAME                      IPADDRESS       ROLE    STATUS    GPU(Total)  GPU(Allocated)
    cn-beijing.192.168.1.108  192.168.20.255  <none>  Ready     0           0
    cn-beijing.192.168.8.10   192.168.8.10    <none>  Ready     0           0
    cn-beijing.192.168.1.101  192.168.1.101   <none>  Ready     1           0
    cn-beijing.192.168.1.112  192.168.1.112   <none>  Ready     1           0
    cn-beijing.192.168.8.252  192.168.8.252   <none>  Ready     1           0
    ---------------------------------------------------------------------------------------------------
    Allocated/Total GPUs In Cluster:
    0/3 (0.0%)

    The preceding output shows that the cluster has three GPUs, and the utilization of each GPU is 0.0%.

  2. Use Arena to submit an inference task.

    Important
    • This example submits a TensorFlow inference task. The training model was added to the Docker image during image creation.

    • If you have not added the model file to the image, you need to configure a shared NAS volume. For more information, see Configure a shared NAS volume.

    Run the following command to submit an inference task:

    arena serve tensorflow \
        --name=mymnist2 \
        --model-name=mnist \
        --gpumemory=3 \ 
        --gpucore=10 \
        --image=registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:latest-gpu-mnist \
        --model-path=/tfmodel/mnist \
        --version-policy=specific:2 \
        --data=mydata=/mnt/data

    The following table describes the parameters.

    Parameter

    Description

    --name

    The name of the task.

    --model-name

    The name of the model.

    --gpumemory

    The amount of GPU memory to request in GiB. For example, a GPU has 8 GiB of memory. If the first task requests 3 GiB (--gpumemory=3), 5 GiB of memory remains. If a second task requests 4 GiB (--gpumemory=4), both tasks can run on the same GPU.

    --gpucore

    The percentage of computing power to request. By default, a GPU has 100 units of computing power. For example, if the first task requests 10% of the computing power (--gpucore=10), 90% of the computing power remains. If a second task requests 50% of the computing power (--gpucore=50), both tasks can run on the same GPU.

    --image

    The image that is used to run the task.

    --model-path

    The path of the model in the container.

    --version-policy

    The model version. For example, --version-policy=specific:2 specifies that version 2 of the model is used. A folder named 2 must exist in the path specified by --model-path.

    --data=mydata

    The directory where the volume is mounted. This example uses /mnt/data.

  3. Run the following command to query all tasks:

    arena serve list

    The following is an example of the output:

    NAME      TYPE        VERSION       DESIRED  AVAILABLE  ADDRESS       PORTS
    mymnist1  Tensorflow  202101162119  1        0          172.16.3.123  GRPC:8500,RESTFUL:8501
    mymnist2  Tensorflow  202101191447  1        1          172.16.1.147  GRPC:8500,RESTFUL:8501
  4. Run the following command to query the details of the submitted task:

    arena serve get mymnist2

    Expected output:

    Name:           mymnist2
    Namespace:      default
    Type:           Tensorflow
    Version:        202101191447
    Desired:        1
    Available:      1
    Age:            20m
    Address:        172.16.1.147
    Port:           GRPC:8500,RESTFUL:8501
    GPUMemory(GiB): 3
    
    Instances:
      NAME                                                       STATUS   AGE  READY  RESTARTS  GPU(Memory/GiB)  NODE
      ----                                                       ------   ---  -----  --------  ---------------  ----
      mymnist2-202101191447-tensorflow-serving-7f64bf9749-mtnpc  Running  20m  1/1    0         3                cn-beijing.192.168.1.112
    Note

    If the value of Desired equals the value of Available, the task is ready.

  5. Optional: Run the following command to print task logs:

    arena serve logs mymnist2 -t 10
    Note

    -t 10 displays the last 10 lines of the log.

    The system returns output similar to the following:

    2021-01-18 13:21:58.482985: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:206] Restoring SavedModel bundle.
    2021-01-18 13:21:58.483673: I external/org_tensorflow/tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2500005000 Hz
    2021-01-18 13:21:58.508734: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running initialization op on SavedModel bundle at path: /tfmodel/mnist/2
    2021-01-18 13:21:58.513041: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel load for tags { serve }; Status: success: OK. Took 798017 microseconds.
    2021-01-18 13:21:58.513263: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /tfmodel/mnist/2/assets.extra/tf_serving_warmup_requests
    2021-01-18 13:21:58.513467: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: mnist2 version: 2}
    2021-01-18 13:21:58.516620: I tensorflow_serving/model_servers/server.cc:371] Running gRPC ModelServer at 0.0.0.0:8500 ...
    [warn] getaddrinfo: address family for nodename not supported
    2021-01-18 13:21:58.521317: I tensorflow_serving/model_servers/server.cc:391] Exporting HTTP/REST API at:localhost:8501 ...
    [evhttp_server.cc : 238] NET_LOG: Entering the event loop ...
  6. Deploy and verify the TensorFlow inference service.

    1. Create a file named tfserving-test-client.yaml that contains the following content.

      cat <<EOF | kubectl create -f -
      kind: Pod
      apiVersion: v1
      metadata:
        name: tfserving-test-client
      spec:
        containers:
        - name: test-client
          image: registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow-serving-test-client:curl
          command: ["sleep","infinity"]
          imagePullPolicy: IfNotPresent
      EOF
    2. Run the following command to deploy a pod:

      kubectl apply -f tfserving-test-client.yaml
    3. Run the following command to query the IP address and port of the service:

      arena serve list

      The output is similar to the following. The IP address of mymnist2 is 172.16.1.147, and the port is 8501.

      NAME      TYPE        VERSION       DESIRED  AVAILABLE  ADDRESS       PORTS
      mymnist1  Tensorflow  202101162119  1        0          172.16.3.123  GRPC:8500,RESTFUL:8501
      mymnist2  Tensorflow  202101191447  1        1          172.16.1.147  GRPC:8500,RESTFUL:8501
    4. Run the following command to verify that the TensorFlow inference service is available.

      kubectl exec -ti tfserving-test-client bash validate.sh 172.16.1.147 8501

      Expected output:

      {
          "predictions": [
              [2.04608277e-05, 1.72721537e-09, 7.74099826e-05, 0.00364777911, 1.25222937e-06, 2.27521796e-05, 1.14668763e-08, 0.99597472, 3.68833389e-05, 0.000218785644]
          ]
      }

      The output indicates the following information:

      • The data requested in the validate.sh script is a list of pixel values from an image in the mnist test dataset.

      • The model predicts that the input data is 7 among all single-digit numbers from 0 to 9, with the highest probability of 0.99597472.