Submit a TensorFlow inference task to share GPU resources with Arena - Container Service for Kubernetes

Shared GPU scheduling lets multiple inference tasks run on a single GPU simultaneously, improving GPU utilization. This guide walks you through submitting a TensorFlow inference task using Arena with shared GPU resources.

Prerequisites

Before you begin, ensure that you have:

An ACK Pro cluster with Kubernetes 1.18.8 or later. See Create an ACK Pro cluster
Arena 0.5.0 or later installed. See Configure the Arena client
The shared GPU scheduling component installed

How shared GPU scheduling works

When you submit an inference task with --gpumemory and --gpucore, Arena schedules the task on a GPU that has enough free memory and computing power to accommodate the request—without requiring a full GPU.

The following table shows how two tasks can share an 8 GiB GPU with 100 units of computing power:

Resource dimension	Task A request	Task B request	Fits on one GPU?
Memory (`--gpumemory`)	3 GiB	4 GiB	Yes (3 + 4 = 7 GiB, within 8 GiB)
Computing power (`--gpucore`)	10%	50%	Yes (10 + 50 = 60%, within 100%)

Note

--gpucore specifies the percentage of computing power to request.

Submit an inference task

Step 1: Check available GPU resources

Run the following command to see how many GPUs are available in the cluster:

arena top node

Expected output:

NAME                      IPADDRESS       ROLE    STATUS    GPU(Total)  GPU(Allocated)
cn-beijing.192.168.1.108  192.168.20.255  <none>  Ready     0           0
cn-beijing.192.168.8.10   192.168.8.10    <none>  Ready     0           0
cn-beijing.192.168.1.101  192.168.1.101   <none>  Ready     1           0
cn-beijing.192.168.1.112  192.168.1.112   <none>  Ready     1           0
cn-beijing.192.168.8.252  192.168.8.252   <none>  Ready     1           0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/3 (0.0%)

The cluster has 3 GPUs, all unallocated.

Step 2: Submit the inference task

Run the following command to submit a TensorFlow inference task:

arena serve tensorflow \
    --name=mymnist2 \
    --model-name=mnist \
    --gpumemory=3 \
    --gpucore=10 \
    --image=registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:latest-gpu-mnist \
    --model-path=/tfmodel/mnist \
    --version-policy=specific:2 \
    --data=mydata=/mnt/data

Important

This example uses a TensorFlow model bundled into the Docker image at build time. If your model file is stored separately, mount it via a shared NAS volume before submitting the task. See Configure a shared NAS volume.

The following table describes the key parameters.

Parameter	Type	Description
`--name`	String	Name of the inference task.
`--model-name`	String	Name of the model.
`--gpumemory`	Integer (GiB)	GPU memory to request.
`--gpucore`	Integer (0–100)	Percentage of GPU computing power to request.
`--image`	String	Container image used to run the task.
`--model-path`	String	Path to the model inside the container.
`--version-policy`	String	Model version to load. `specific:2` loads version `2`, which must exist as a subfolder under `--model-path`.
`--data`	String	Volume mount in `<volume-name>=<mount-path>` format.

Step 3: Verify the task is running

Run the following command to list all inference tasks:

arena serve list

Example output:

NAME      TYPE        VERSION       DESIRED  AVAILABLE  ADDRESS       PORTS
mymnist1  Tensorflow  202101162119  1        0          172.16.3.123  GRPC:8500,RESTFUL:8501
mymnist2  Tensorflow  202101191447  1        1          172.16.1.147  GRPC:8500,RESTFUL:8501

Run the following command to get details for mymnist2:

arena serve get mymnist2

Expected output:

Name:           mymnist2
Namespace:      default
Type:           Tensorflow
Version:        202101191447
Desired:        1
Available:      1
Age:            20m
Address:        172.16.1.147
Port:           GRPC:8500,RESTFUL:8501
GPUMemory(GiB): 3

Instances:
  NAME                                                       STATUS   AGE  READY  RESTARTS  GPU(Memory/GiB)  NODE
  ----                                                       ------   ---  -----  --------  ---------------  ----
  mymnist2-202101191447-tensorflow-serving-7f64bf9749-mtnpc  Running  20m  1/1    0         3                cn-beijing.192.168.1.112

If the value of Desired equals the value of Available, the task is ready.

Step 4: (Optional) View task logs

Run the following command to print the last 10 lines of logs:

arena serve logs mymnist2 -t 10

Example output:

2021-01-18 13:21:58.482985: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:206] Restoring SavedModel bundle.
2021-01-18 13:21:58.483673: I external/org_tensorflow/tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2500005000 Hz
2021-01-18 13:21:58.508734: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running initialization op on SavedModel bundle at path: /tfmodel/mnist/2
2021-01-18 13:21:58.513041: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel load for tags { serve }; Status: success: OK. Took 798017 microseconds.
2021-01-18 13:21:58.513263: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /tfmodel/mnist/2/assets.extra/tf_serving_warmup_requests
2021-01-18 13:21:58.513467: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: mnist2 version: 2}
2021-01-18 13:21:58.516620: I tensorflow_serving/model_servers/server.cc:371] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2021-01-18 13:21:58.521317: I tensorflow_serving/model_servers/server.cc:391] Exporting HTTP/REST API at:localhost:8501 ...
[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...

Verify the inference service

Create a test client pod. Save the following content as tfserving-test-client.yaml:

cat <<EOF | kubectl create -f -
kind: Pod
apiVersion: v1
metadata:
  name: tfserving-test-client
spec:
  containers:
  - name: test-client
    image: registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow-serving-test-client:curl
    command: ["sleep","infinity"]
    imagePullPolicy: IfNotPresent
EOF

Deploy the pod:

kubectl apply -f tfserving-test-client.yaml

Get the IP address and port of the inference service:
```
arena serve list
```
From the output, mymnist2 is available at 172.16.1.147:8501.
Send a test request to verify the service:
```
kubectl exec -ti tfserving-test-client bash validate.sh 172.16.1.147 8501
```
Expected output:
```
{
    "predictions": [
        [2.04608277e-05, 1.72721537e-09, 7.74099826e-05, 0.00364777911, 1.25222937e-06, 2.27521796e-05, 1.14668763e-08, 0.99597472, 3.68833389e-05, 0.000218785644]
    ]
}
```
The validate.sh script sends pixel values from an MNIST test image. The model predicts that the input data is digit 7 among all single-digit numbers from 0 to 9, with the highest probability of 0.99597472, confirming the service is working correctly.

What's next

To submit more inference tasks that share the same GPU, repeat Step 2 with different --name values and adjust --gpumemory and --gpucore so the combined requests fit within the GPU's available resources.