All Products
Search
Document Center

Container Service for Kubernetes:Submit an inference task that uses shared GPU resources

Last Updated:Mar 26, 2026

Shared GPU scheduling lets multiple inference tasks run on a single GPU simultaneously, improving GPU utilization. This guide walks you through submitting a TensorFlow inference task using Arena with shared GPU resources.

Prerequisites

Before you begin, ensure that you have:

How shared GPU scheduling works

When you submit an inference task with --gpumemory and --gpucore, Arena schedules the task on a GPU that has enough free memory and computing power to accommodate the request—without requiring a full GPU.

The following table shows how two tasks can share an 8 GiB GPU with 100 units of computing power:

Resource dimensionTask A requestTask B requestFits on one GPU?
Memory (--gpumemory)3 GiB4 GiBYes (3 + 4 = 7 GiB, within 8 GiB)
Computing power (--gpucore)10%50%Yes (10 + 50 = 60%, within 100%)
Note

--gpucore specifies the percentage of computing power to request.

Submit an inference task

Step 1: Check available GPU resources

Run the following command to see how many GPUs are available in the cluster:

arena top node

Expected output:

NAME                      IPADDRESS       ROLE    STATUS    GPU(Total)  GPU(Allocated)
cn-beijing.192.168.1.108  192.168.20.255  <none>  Ready     0           0
cn-beijing.192.168.8.10   192.168.8.10    <none>  Ready     0           0
cn-beijing.192.168.1.101  192.168.1.101   <none>  Ready     1           0
cn-beijing.192.168.1.112  192.168.1.112   <none>  Ready     1           0
cn-beijing.192.168.8.252  192.168.8.252   <none>  Ready     1           0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/3 (0.0%)

The cluster has 3 GPUs, all unallocated.

Step 2: Submit the inference task

Run the following command to submit a TensorFlow inference task:

arena serve tensorflow \
    --name=mymnist2 \
    --model-name=mnist \
    --gpumemory=3 \
    --gpucore=10 \
    --image=registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:latest-gpu-mnist \
    --model-path=/tfmodel/mnist \
    --version-policy=specific:2 \
    --data=mydata=/mnt/data
Important

This example uses a TensorFlow model bundled into the Docker image at build time. If your model file is stored separately, mount it via a shared NAS volume before submitting the task. See Configure a shared NAS volume.

The following table describes the key parameters.

ParameterTypeDescription
--nameStringName of the inference task.
--model-nameStringName of the model.
--gpumemoryInteger (GiB)GPU memory to request.
--gpucoreInteger (0–100)Percentage of GPU computing power to request.
--imageStringContainer image used to run the task.
--model-pathStringPath to the model inside the container.
--version-policyStringModel version to load. specific:2 loads version 2, which must exist as a subfolder under --model-path.
--dataStringVolume mount in <volume-name>=<mount-path> format.

Step 3: Verify the task is running

Run the following command to list all inference tasks:

arena serve list

Example output:

NAME      TYPE        VERSION       DESIRED  AVAILABLE  ADDRESS       PORTS
mymnist1  Tensorflow  202101162119  1        0          172.16.3.123  GRPC:8500,RESTFUL:8501
mymnist2  Tensorflow  202101191447  1        1          172.16.1.147  GRPC:8500,RESTFUL:8501

Run the following command to get details for mymnist2:

arena serve get mymnist2

Expected output:

Name:           mymnist2
Namespace:      default
Type:           Tensorflow
Version:        202101191447
Desired:        1
Available:      1
Age:            20m
Address:        172.16.1.147
Port:           GRPC:8500,RESTFUL:8501
GPUMemory(GiB): 3

Instances:
  NAME                                                       STATUS   AGE  READY  RESTARTS  GPU(Memory/GiB)  NODE
  ----                                                       ------   ---  -----  --------  ---------------  ----
  mymnist2-202101191447-tensorflow-serving-7f64bf9749-mtnpc  Running  20m  1/1    0         3                cn-beijing.192.168.1.112

If the value of Desired equals the value of Available, the task is ready.

Step 4: (Optional) View task logs

Run the following command to print the last 10 lines of logs:

arena serve logs mymnist2 -t 10

Example output:

2021-01-18 13:21:58.482985: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:206] Restoring SavedModel bundle.
2021-01-18 13:21:58.483673: I external/org_tensorflow/tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2500005000 Hz
2021-01-18 13:21:58.508734: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running initialization op on SavedModel bundle at path: /tfmodel/mnist/2
2021-01-18 13:21:58.513041: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel load for tags { serve }; Status: success: OK. Took 798017 microseconds.
2021-01-18 13:21:58.513263: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /tfmodel/mnist/2/assets.extra/tf_serving_warmup_requests
2021-01-18 13:21:58.513467: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: mnist2 version: 2}
2021-01-18 13:21:58.516620: I tensorflow_serving/model_servers/server.cc:371] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2021-01-18 13:21:58.521317: I tensorflow_serving/model_servers/server.cc:391] Exporting HTTP/REST API at:localhost:8501 ...
[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...

Verify the inference service

  1. Create a test client pod. Save the following content as tfserving-test-client.yaml:

    cat <<EOF | kubectl create -f -
    kind: Pod
    apiVersion: v1
    metadata:
      name: tfserving-test-client
    spec:
      containers:
      - name: test-client
        image: registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow-serving-test-client:curl
        command: ["sleep","infinity"]
        imagePullPolicy: IfNotPresent
    EOF
  2. Deploy the pod:

    kubectl apply -f tfserving-test-client.yaml
  3. Get the IP address and port of the inference service:

    arena serve list

    From the output, mymnist2 is available at 172.16.1.147:8501.

  4. Send a test request to verify the service:

    kubectl exec -ti tfserving-test-client bash validate.sh 172.16.1.147 8501

    Expected output:

    {
        "predictions": [
            [2.04608277e-05, 1.72721537e-09, 7.74099826e-05, 0.00364777911, 1.25222937e-06, 2.27521796e-05, 1.14668763e-08, 0.99597472, 3.68833389e-05, 0.000218785644]
        ]
    }

    The validate.sh script sends pixel values from an MNIST test image. The model predicts that the input data is digit 7 among all single-digit numbers from 0 to 9, with the highest probability of 0.99597472, confirming the service is working correctly.

What's next

  • To submit more inference tasks that share the same GPU, repeat Step 2 with different --name values and adjust --gpumemory and --gpucore so the combined requests fit within the GPU's available resources.