All Products
Search
Document Center

Container Service for Kubernetes:Deploy a TensorFlow model as an inference service

Last Updated:Mar 26, 2026

After training a TensorFlow model, you need to serve it as a network-accessible API for applications to call. This guide shows you how to use Arena to deploy a TensorFlow SavedModel as a TensorFlow Serving inference service on an ACK cluster, covering model upload to OSS, persistent storage configuration, serving instance launch, and external access through an Ingress.

Prerequisites

Before you begin, ensure that you have:

This guide uses a BERT model trained with TensorFlow 1.15 and exported as a SavedModel.

Prepare model storage

Step 1: Check available GPU resources

Run the following command to verify GPU availability in the cluster:

arena top node

The output lists all GPU nodes and their allocation status:

NAME                      IPADDRESS      ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-beijing.192.168.0.100  192.168.0.100  <none>  Ready   1           0
cn-beijing.192.168.0.101  192.168.0.101  <none>  Ready   1           0
cn-beijing.192.168.0.99   192.168.0.99   <none>  Ready   1           0
---------------------------------------------------------------------------------------------------
Allocated/Total GPUs of nodes which own resource nvidia.com/gpu In Cluster:
0/3 (0.0%)

The cluster has three GPU nodes, each with one unallocated GPU.

Step 2: Upload the model to OSS

Important

The following steps use ossutil on Linux. For other operating systems, see ossutil.

  1. Install ossutil.

  2. Create a bucket named examplebucket:

    ossutil64 mb oss://examplebucket

    The following output confirms the bucket is created:

    0.668238(s) elapsed
  3. Upload the SavedModel to the bucket:

    ossutil64 cp model.savedmodel oss://examplebucket

Step 3: Create a persistent volume and persistent volume claim

To mount the OSS bucket as a volume inside the serving container, create a persistent volume (PV) and a persistent volume claim (PVC).

  1. Create a file named Tensorflow.yaml with the following content:

    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: model-csi-pv
    spec:
      capacity:
        storage: 5Gi
      accessModes:
        - ReadWriteMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: model-csi-pv   # Must match the PV name above
        volumeAttributes:
          bucket: "Your Bucket"
          url: "Your oss url"
          akId: "Your Access Key Id"
          akSecret: "Your Access Key Secret"
          otherOpts: "-o max_stat_cache_size=0 -o allow_other"
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: model-pvc
    spec:
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 5Gi

    Replace the following parameters:

    ParameterDescription
    bucketThe name of the OSS bucket. For naming rules, see Bucket naming conventions.
    urlThe endpoint URL for the bucket. For instructions on getting the URL, see Obtain the URL of a single object or the URLs of multiple objects.
    akIdThe AccessKey ID used to access the OSS bucket. Use a Resource Access Management (RAM) user's AccessKey. For details, see Create an AccessKey pair.
    akSecretThe AccessKey secret paired with the AccessKey ID above.
    otherOptsCustom mount options for the OSS bucket. -o max_stat_cache_size=0 disables metadata caching so the system always reads the latest object metadata from OSS. -o allow_other lets other users on the node access the mounted bucket. For additional options, see Custom parameters supported by ossfs.
  2. Apply the manifest to create the PV and PVC:

    kubectl apply -f Tensorflow.yaml

Deploy the inference service

Step 4: Launch TensorFlow Serving

Run the following command to deploy a TensorFlow Serving instance named bert-tfserving:

arena serve tensorflow \
  --name=bert-tfserving \
  --model-name=chnsenticorp \
  --gpus=1 \
  --image=tensorflow/serving:1.15.0-gpu \
  --data=model-pvc:/models \
  --model-path=/models/tensorflow \
  --version-policy=specific:1623831335
ParameterDescription
--nameThe name of the serving job.
--model-nameThe model name TensorFlow Serving uses to identify the model in API requests.
--gpusThe number of GPUs to allocate to the serving instance.
--imageThe TensorFlow Serving container image. Must match the TensorFlow version used during training.
--dataMounts the PVC into the container. Format: <pvc-name>:<mount-path>.
--model-pathThe path inside the container where the model is stored.
--version-policyThe model version to load. specific:<version> pins serving to a single version.

The following output confirms the job is submitted:

configmap/bert-tfserving-202106251556-tf-serving created
configmap/bert-tfserving-202106251556-tf-serving labeled
configmap/bert-tfserving-202106251556-tensorflow-serving-cm created
service/bert-tfserving-202106251556-tensorflow-serving created
deployment.apps/bert-tfserving-202106251556-tensorflow-serving created
INFO[0003] The Job bert-tfserving has been submitted successfully
INFO[0003] You can run `arena get bert-tfserving --type tf-serving` to check the job status

Step 5: Verify the service is running

List all running inference services:

arena serve list

The output shows the bert-tfserving service with its address and ports:

NAME            TYPE        VERSION       DESIRED  AVAILABLE  ADDRESS        PORTS
bert-tfserving  Tensorflow  202106251556  1        1          172.16.95.171  GRPC:8500,RESTFUL:8501

Get the full details of the service:

arena serve get bert-tfserving
Name:       bert-tfserving
Namespace:  inference
Type:       Tensorflow
Version:    202106251556
Desired:    1
Available:  1
Age:        4m
Address:    172.16.95.171
Port:       GRPC:8500,RESTFUL:8501


Instances:
  NAME                                                             STATUS   AGE  READY  RESTARTS  NODE
  ----                                                             ------   ---  -----  --------  ----
  bert-tfserving-202106251556-tensorflow-serving-8554d58d67-jd2z9  Running  4m   1/1    0         cn-beijing.192.168.0.88

The service is deployed in the inference namespace. Port 8500 serves gRPC requests and port 8501 serves HTTP/RESTful requests.

Access the service externally

arena serve tensorflow assigns a cluster IP by default, which is only reachable from within the cluster. Create an Ingress to expose the service for external access.

Step 6: Create an Ingress

  1. In the ACK console, go to the Clusters page, click the target cluster, and navigate to Network > Ingresses in the left-side navigation pane.

  2. From the Namespace drop-down list at the top of the page, select the inference namespace (the same namespace shown in the service details above).

  3. Click Create Ingress in the upper-right corner. For a full description of Ingress parameters, see Create an NGINX Ingress. Use the following settings:

    • Name: Tensorflow

    • Rules:

      • Domain name: Enter a custom domain, for example test.example.com

      • Path: /

      • Rule: ImplementationSpecific (default)

      • Service name: The service name returned by kubectl get service

      • Port: 8501

  4. After the Ingress is created, return to the Ingresses page. The Rules column shows the Ingress address.

    12

Step 7: Call the inference API

Run the following command to call the API of the inference service. For more information about TensorFlow Serving, see TensorFlow Serving API.

curl "http://<Ingress address>"

A successful response looks like this:

{
 "model_version_status": [
  {
   "version": "1623831335",
   "state": "AVAILABLE",
   "status": {
    "error_code": "OK",
    "error_message": ""
   }
  }
 ]
}

The state: AVAILABLE field confirms the model is loaded and ready to serve inference requests.