Shared GPU scheduling lets multiple inference tasks run on a single GPU simultaneously, improving GPU utilization. This guide walks you through submitting a TensorFlow inference task using Arena with shared GPU resources.
Prerequisites
Before you begin, ensure that you have:
An ACK Pro cluster with Kubernetes 1.18.8 or later. See Create an ACK Pro cluster
Arena 0.5.0 or later installed. See Configure the Arena client
How shared GPU scheduling works
When you submit an inference task with --gpumemory and --gpucore, Arena schedules the task on a GPU that has enough free memory and computing power to accommodate the request—without requiring a full GPU.
The following table shows how two tasks can share an 8 GiB GPU with 100 units of computing power:
| Resource dimension | Task A request | Task B request | Fits on one GPU? |
|---|---|---|---|
Memory (--gpumemory) | 3 GiB | 4 GiB | Yes (3 + 4 = 7 GiB, within 8 GiB) |
Computing power (--gpucore) | 10% | 50% | Yes (10 + 50 = 60%, within 100%) |
--gpucore specifies the percentage of computing power to request.
Submit an inference task
Step 1: Check available GPU resources
Run the following command to see how many GPUs are available in the cluster:
arena top nodeExpected output:
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-beijing.192.168.1.108 192.168.20.255 <none> Ready 0 0
cn-beijing.192.168.8.10 192.168.8.10 <none> Ready 0 0
cn-beijing.192.168.1.101 192.168.1.101 <none> Ready 1 0
cn-beijing.192.168.1.112 192.168.1.112 <none> Ready 1 0
cn-beijing.192.168.8.252 192.168.8.252 <none> Ready 1 0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/3 (0.0%)The cluster has 3 GPUs, all unallocated.
Step 2: Submit the inference task
Run the following command to submit a TensorFlow inference task:
arena serve tensorflow \
--name=mymnist2 \
--model-name=mnist \
--gpumemory=3 \
--gpucore=10 \
--image=registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:latest-gpu-mnist \
--model-path=/tfmodel/mnist \
--version-policy=specific:2 \
--data=mydata=/mnt/dataThis example uses a TensorFlow model bundled into the Docker image at build time. If your model file is stored separately, mount it via a shared NAS volume before submitting the task. See Configure a shared NAS volume.
The following table describes the key parameters.
| Parameter | Type | Description |
|---|---|---|
--name | String | Name of the inference task. |
--model-name | String | Name of the model. |
--gpumemory | Integer (GiB) | GPU memory to request. |
--gpucore | Integer (0–100) | Percentage of GPU computing power to request. |
--image | String | Container image used to run the task. |
--model-path | String | Path to the model inside the container. |
--version-policy | String | Model version to load. specific:2 loads version 2, which must exist as a subfolder under --model-path. |
--data | String | Volume mount in <volume-name>=<mount-path> format. |
Step 3: Verify the task is running
Run the following command to list all inference tasks:
arena serve listExample output:
NAME TYPE VERSION DESIRED AVAILABLE ADDRESS PORTS
mymnist1 Tensorflow 202101162119 1 0 172.16.3.123 GRPC:8500,RESTFUL:8501
mymnist2 Tensorflow 202101191447 1 1 172.16.1.147 GRPC:8500,RESTFUL:8501Run the following command to get details for mymnist2:
arena serve get mymnist2Expected output:
Name: mymnist2
Namespace: default
Type: Tensorflow
Version: 202101191447
Desired: 1
Available: 1
Age: 20m
Address: 172.16.1.147
Port: GRPC:8500,RESTFUL:8501
GPUMemory(GiB): 3
Instances:
NAME STATUS AGE READY RESTARTS GPU(Memory/GiB) NODE
---- ------ --- ----- -------- --------------- ----
mymnist2-202101191447-tensorflow-serving-7f64bf9749-mtnpc Running 20m 1/1 0 3 cn-beijing.192.168.1.112If the value of Desired equals the value of Available, the task is ready.
Step 4: (Optional) View task logs
Run the following command to print the last 10 lines of logs:
arena serve logs mymnist2 -t 10Example output:
2021-01-18 13:21:58.482985: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:206] Restoring SavedModel bundle.
2021-01-18 13:21:58.483673: I external/org_tensorflow/tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2500005000 Hz
2021-01-18 13:21:58.508734: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running initialization op on SavedModel bundle at path: /tfmodel/mnist/2
2021-01-18 13:21:58.513041: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel load for tags { serve }; Status: success: OK. Took 798017 microseconds.
2021-01-18 13:21:58.513263: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /tfmodel/mnist/2/assets.extra/tf_serving_warmup_requests
2021-01-18 13:21:58.513467: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: mnist2 version: 2}
2021-01-18 13:21:58.516620: I tensorflow_serving/model_servers/server.cc:371] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2021-01-18 13:21:58.521317: I tensorflow_serving/model_servers/server.cc:391] Exporting HTTP/REST API at:localhost:8501 ...
[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...Verify the inference service
Create a test client pod. Save the following content as
tfserving-test-client.yaml:cat <<EOF | kubectl create -f - kind: Pod apiVersion: v1 metadata: name: tfserving-test-client spec: containers: - name: test-client image: registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow-serving-test-client:curl command: ["sleep","infinity"] imagePullPolicy: IfNotPresent EOFDeploy the pod:
kubectl apply -f tfserving-test-client.yamlGet the IP address and port of the inference service:
arena serve listFrom the output,
mymnist2is available at172.16.1.147:8501.Send a test request to verify the service:
kubectl exec -ti tfserving-test-client bash validate.sh 172.16.1.147 8501Expected output:
{ "predictions": [ [2.04608277e-05, 1.72721537e-09, 7.74099826e-05, 0.00364777911, 1.25222937e-06, 2.27521796e-05, 1.14668763e-08, 0.99597472, 3.68833389e-05, 0.000218785644] ] }The
validate.shscript sends pixel values from an MNIST test image. The model predicts that the input data is digit 7 among all single-digit numbers from 0 to 9, with the highest probability of 0.99597472, confirming the service is working correctly.
What's next
To submit more inference tasks that share the same GPU, repeat Step 2 with different
--namevalues and adjust--gpumemoryand--gpucoreso the combined requests fit within the GPU's available resources.