All Products
Search
Document Center

Alibaba Cloud Service Mesh:Deploy a large language model as an inference service

Last Updated:Mar 11, 2026

A large language model (LLM) is a neural network language model with hundreds of millions of parameters, such as GPT-3, GPT-4, PaLM, and PaLM 2. By deploying an LLM on Service Mesh (ASM) through ModelMesh, you can expose NLP capabilities -- text classification, sentiment analysis, machine translation -- as API endpoints. With the LLM-as-a-service approach, you avoid high infrastructure costs, respond quickly to market changes, and scale services on demand to handle traffic spikes, all while the model runs on the cloud and improves operational efficiency.

This topic walks through three steps: building a custom model serving runtime, deploying the model as an inference service, and sending inference requests through the ASM ingress gateway.

How it works

This deployment uses ModelMesh within ASM to serve a Hugging Face LLM with Parameter-Efficient Fine-Tuning (PEFT) prompt tuning. The following diagram shows how the components fit together:

+---------------------------------------------------------+
|                   ASM ingress gateway                   |
|                    (port 8008, HTTP)                    |
+----------------------------+----------------------------+
                             |
                             v
+---------------------------------------------------------+
|                       ModelMesh                         |
|            (model routing and orchestration)            |
+---------------------------------------------------------+
|  ServingRuntime             InferenceService            |
|  +------------------+      +-----------------------+   |
|  | peft-model-server |<-----| peft-demo             |   |
|  | (MLServer-based)  |      | (model endpoint)      |   |
|  +------------------+      +-----------------------+   |
|         |                                               |
|         v                                               |
|  +------------------+      +-----------------------+   |
|  | MLServer          |      | Hugging Face model    |   |
|  | (gRPC :8001,      |      | + PEFT config         |   |
|  |  HTTP :8002)      |      | (bloomz-560m)         |   |
|  +------------------+      +-----------------------+   |
+---------------------------------------------------------+
  • ServingRuntime: Defines the container image and MLServer configuration for serving the model.

  • InferenceService: Specifies which model to load and which ServingRuntime to use. Serves as the logical endpoint for inference requests.

  • ModelMesh: Routes incoming requests from the ASM ingress gateway to the correct InferenceService. Handles model loading and scaling.

Prerequisites

Before you begin, make sure that you have:

Step 1: Build a custom runtime

Build a custom ServingRuntime to serve a Hugging Face LLM with PEFT prompt tuning. This involves three parts: implementing the model server class, packaging it into a Docker image, and creating the Kubernetes resource.

Implement the model server class

The model server inherits from the MLServer MLModel base class and implements two handlers:

  • load: Loads the pretrained LLM, applies the PEFT prompt tuning configuration, and initializes a tokenizer. The tokenizer lets the server accept raw text input rather than preprocessed tensors.

  • predict: Tokenizes input text, runs inference, and decodes the output back to readable text.

The full implementation is in peft_model_server.py:

peft_model_server.py

from typing import List

from mlserver import MLModel, types
from mlserver.codecs import decode_args

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os

class PeftModelServer(MLModel):
    async def load(self) -> bool:
        self._load_model()
        self.ready = True
        return self.ready

    @decode_args
    async def predict(self, content: List[str]) -> List[str]:
        return self._predict_outputs(content)

    def _load_model(self):
        model_name_or_path = os.environ.get("PRETRAINED_MODEL_PATH", "bigscience/bloomz-560m")
        peft_model_id = os.environ.get("PEFT_MODEL_ID", "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, local_files_only=True)
        config = PeftConfig.from_pretrained(peft_model_id)
        self.model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
        self.model = PeftModel.from_pretrained(self.model, peft_model_id)
        self.text_column = os.environ.get("DATASET_TEXT_COLUMN_NAME", "Tweet text")
        return

    def _predict_outputs(self, content: List[str]) -> List[str]:
        output_list = []
        for input in content:
            inputs = self.tokenizer(
                f'{self.text_column} : {input} Label : ',
                return_tensors="pt",
            )
            with torch.no_grad():
                inputs = {k: v for k, v in inputs.items()}
                outputs = self.model.generate(
                    input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=10, eos_token_id=3
                )
                outputs = self.tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)
            output_list.append(outputs[0])
        return output_list

The model server reads configuration from environment variables:

Environment variableDefault valueDescription
PRETRAINED_MODEL_PATHbigscience/bloomz-560mPath or Hugging Face model ID for the base LLM
PEFT_MODEL_IDaipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LMPEFT prompt tuning configuration ID
DATASET_TEXT_COLUMN_NAMETweet textColumn name used for the input text field

Build the Docker image

Package the model server and its dependencies into a Docker image compatible with ModelMesh.

Dockerfile

# TODO: choose appropriate base image, install Python, MLServer, and
# dependencies of your MLModel implementation
FROM python:3.8-slim-buster
RUN pip install mlserver peft transformers datasets
# ...

# The custom MLModel implementation should be on the Python search path
# instead of relying on the working directory of the image. If using a
# single-file module, this can be accomplished with:
COPY --chown=${USER} ./peft_model_server.py /opt/peft_model_server.py
ENV PYTHONPATH=/opt/

# environment variables to be compatible with ModelMesh Serving
# these can also be set in the ServingRuntime, but this is recommended for
# consistency when building and testing
ENV MLSERVER_MODELS_DIR=/models/_mlserver_models \
    MLSERVER_GRPC_PORT=8001 \
    MLSERVER_HTTP_PORT=8002 \
    MLSERVER_LOAD_MODELS_AT_STARTUP=false \
    MLSERVER_MODEL_NAME=peft-model

# With this setting, the implementation field is not required in the model
# settings which eases integration by allowing the built-in adapter to generate
# a basic model settings file
ENV MLSERVER_MODEL_IMPLEMENTATION=peft_model_server.PeftModelServer

CMD mlserver start ${MLSERVER_MODELS_DIR}

The image exposes two ports: gRPC on 8001 and HTTP on 8002. Setting MLSERVER_MODEL_IMPLEMENTATION tells MLServer which class to load, so no separate model settings file is required.

Create the ServingRuntime resource

Define a ServingRuntime that points to your Docker image and configures the MLServer environment.

sample-runtime.yaml

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: peft-model-server
  namespace: modelmesh-serving
spec:
  supportedModelFormats:
    - name: peft-model
      version: "1"
      autoSelect: true
  multiModel: true
  grpcDataEndpoint: port:8001
  grpcEndpoint: port:8085
  containers:
    - name: mlserver
      image:  registry.cn-beijing.aliyuncs.com/test/peft-model-server:latest
      env:
        - name: MLSERVER_MODELS_DIR
          value: "/models/_mlserver_models/"
        - name: MLSERVER_GRPC_PORT
          value: "8001"
        - name: MLSERVER_HTTP_PORT
          value: "8002"
        - name: MLSERVER_LOAD_MODELS_AT_STARTUP
          value: "true"
        - name: MLSERVER_MODEL_NAME
          value: peft-model
        - name: MLSERVER_HOST
          value: "127.0.0.1"
        - name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
          value: "-1"
        - name: PRETRAINED_MODEL_PATH
          value: "bigscience/bloomz-560m"
        - name: PEFT_MODEL_ID
          value: "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM"
        # - name: "TRANSFORMERS_OFFLINE"
        #   value: "1"
        # - name: "HF_DATASETS_OFFLINE"
        #   value: "1"
      resources:
        requests:
          cpu: 500m
          memory: 4Gi
        limits:
          cpu: "5"
          memory: 5Gi
  builtInAdapter:
    serverType: mlserver
    runtimeManagementPort: 8001
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000

Deploy the ServingRuntime:

kubectl apply -f sample-runtime.yaml

Verify the runtime is available:

kubectl get servingruntimes -n modelmesh-serving

Expected output:

NAME                AGE
peft-model-server   10s

The peft-model-server runtime should appear in the output.

Step 2: Deploy the inference service

Create an InferenceService resource to bind your model to the ServingRuntime from Step 1. The InferenceService is the logical endpoint that ModelMesh uses to route inference requests to the model.

peft-demo-isvc.yaml

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: peft-demo
  namespace: modelmesh-serving
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: peft-model
      runtime: peft-model-server
      storage:
        key: localMinIO
        path: sklearn/mnist-svm.joblib

Configuration fields:

FieldValueDescription
modelFormat.namepeft-modelMust match the format declared in the ServingRuntime
runtimepeft-model-serverTells ModelMesh which runtime serves this model
serving.kserve.io/deploymentModeModelMeshRequired annotation that instructs KServe to deploy through ModelMesh rather than standalone pods

Deploy the InferenceService:

kubectl apply -f peft-demo-isvc.yaml

Check that the InferenceService is ready:

kubectl get inferenceservices -n modelmesh-serving

Expected output:

NAME        URL    READY   AGE
peft-demo          True    30s

Wait until the READY column shows True before proceeding. If the status remains False, see Troubleshooting.

Step 3: Send an inference request

Send a POST request to the deployed model through the ASM ingress gateway. The request uses the KServe v2 inference protocol.

MODEL_NAME="peft-demo"
ASM_GW_IP="<IP-address-of-the-ingress-gateway>"
curl -X POST -k http://${ASM_GW_IP}:8008/v2/models/${MODEL_NAME}/infer -d @./input.json

Replace the following placeholder with your actual value:

PlaceholderDescriptionExample
<IP-address-of-the-ingress-gateway>External IP of your ASM ingress gateway192.168.1.100

The request body (input.json) follows the v2 inference protocol format. Encode input text as Base64 in the bytes_contents field:

{
    "inputs": [
        {
          "name": "content",
          "shape": [1],
          "datatype": "BYTES",
          "contents": {"bytes_contents": ["RXZlcnkgZGF5IGlzIGEgbmV3IGJpbm5pbmcsIGZpbGxlZCB3aXRoIG9wdGlvbnBpZW5pbmcgYW5kIGhvcGU="]}
        }
    ]
}

In this example, bytes_contents is the Base64-encoded form of "Every day is a new beginning, filled with opportunities and hope".

Expected response

A successful inference returns a JSON response with the model output in bytesContents, also Base64-encoded:

{
 "modelName": "peft-demo__isvc-5c5315c302",
 "outputs": [
  {
   "name": "output-0",
   "datatype": "BYTES",
   "shape": [
    "1",
    "1"
   ],
   "parameters": {
    "content_type": {
     "stringParam": "str"
    }
   },
   "contents": {
    "bytesContents": [
     "VHdlZXQgdGV4dCA6IEV2ZXJ5IGRheSBpcyBhIG5ldyBiaW5uaW5nLCBmaWxsZWQgd2l0aCBvcHRpb25waWVuaW5nIGFuZCBob3BlIExhYmVsIDogbm8gY29tcGxhaW50"
    ]
   }
  }
 ]
}

Decode the bytesContents value from Base64 to verify the result:

Tweet text : Every day is a new binning, filled with optionpiening and hope Label : no complaint

The model classified the input text with the label no complaint, confirming the inference service works correctly.

Troubleshooting

IssueCauseSolution
ServingRuntime pod stays in PendingInsufficient CPU or memory in the clusterAdd nodes or reduce the resource requests in the ServingRuntime spec
InferenceService never reaches Ready: TrueModel loading timeout or download failureCheck pod logs with kubectl logs -n modelmesh-serving <pod-name>. Increase modelLoadingTimeoutMillis for slow downloads. For air-gapped clusters, set TRANSFORMERS_OFFLINE and HF_DATASETS_OFFLINE to "1" and pre-download models to local storage
curl returns connection refusedIngress gateway misconfigured or wrong IP/portVerify the ASM ingress gateway IP and confirm port 8008 is exposed
Unexpected model outputModel and PEFT configuration mismatchConfirm that PRETRAINED_MODEL_PATH and PEFT_MODEL_ID point to compatible model and tuning configurations

What to do next