Deploy an LLM inference service with a custom runtime using ModelMesh - Alibaba Cloud Service Mesh

A large language model (LLM) is a neural network language model with hundreds of millions of parameters, such as GPT-3, GPT-4, PaLM, and PaLM 2. By deploying an LLM on Service Mesh (ASM) through ModelMesh, you can expose NLP capabilities -- text classification, sentiment analysis, machine translation -- as API endpoints. With the LLM-as-a-service approach, you avoid high infrastructure costs, respond quickly to market changes, and scale services on demand to handle traffic spikes, all while the model runs on the cloud and improves operational efficiency.

This topic walks through three steps: building a custom model serving runtime, deploying the model as an inference service, and sending inference requests through the ASM ingress gateway.

How it works

This deployment uses ModelMesh within ASM to serve a Hugging Face LLM with Parameter-Efficient Fine-Tuning (PEFT) prompt tuning. The following diagram shows how the components fit together:

+---------------------------------------------------------+
|                   ASM ingress gateway                   |
|                    (port 8008, HTTP)                    |
+----------------------------+----------------------------+
                             |
                             v
+---------------------------------------------------------+
|                       ModelMesh                         |
|            (model routing and orchestration)            |
+---------------------------------------------------------+
|  ServingRuntime             InferenceService            |
|  +------------------+      +-----------------------+   |
|  | peft-model-server |<-----| peft-demo             |   |
|  | (MLServer-based)  |      | (model endpoint)      |   |
|  +------------------+      +-----------------------+   |
|         |                                               |
|         v                                               |
|  +------------------+      +-----------------------+   |
|  | MLServer          |      | Hugging Face model    |   |
|  | (gRPC :8001,      |      | + PEFT config         |   |
|  |  HTTP :8002)      |      | (bloomz-560m)         |   |
|  +------------------+      +-----------------------+   |
+---------------------------------------------------------+

ServingRuntime: Defines the container image and MLServer configuration for serving the model.
InferenceService: Specifies which model to load and which ServingRuntime to use. Serves as the logical endpoint for inference requests.
ModelMesh: Routes incoming requests from the ASM ingress gateway to the correct InferenceService. Handles model loading and scaling.

Prerequisites

Before you begin, make sure that you have:

ModelMesh enabled in your ASM instance with the ASM environment configured. Complete Step 1 and Step 2 in Use ModelMesh to roll out a multi-model inference service before you proceed
Familiarity with creating custom model serving runtimes with ModelMesh. For background, see Use ModelMesh to create a custom model serving runtime

Step 1: Build a custom runtime

Build a custom ServingRuntime to serve a Hugging Face LLM with PEFT prompt tuning. This involves three parts: implementing the model server class, packaging it into a Docker image, and creating the Kubernetes resource.

Implement the model server class

The model server inherits from the MLServer MLModel base class and implements two handlers:

load: Loads the pretrained LLM, applies the PEFT prompt tuning configuration, and initializes a tokenizer. The tokenizer lets the server accept raw text input rather than preprocessed tensors.
predict: Tokenizes input text, runs inference, and decodes the output back to readable text.

The full implementation is in peft_model_server.py:

peft_model_server.py

from typing import List

from mlserver import MLModel, types
from mlserver.codecs import decode_args

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os

class PeftModelServer(MLModel):
    async def load(self) -> bool:
        self._load_model()
        self.ready = True
        return self.ready

    @decode_args
    async def predict(self, content: List[str]) -> List[str]:
        return self._predict_outputs(content)

    def _load_model(self):
        model_name_or_path = os.environ.get("PRETRAINED_MODEL_PATH", "bigscience/bloomz-560m")
        peft_model_id = os.environ.get("PEFT_MODEL_ID", "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, local_files_only=True)
        config = PeftConfig.from_pretrained(peft_model_id)
        self.model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
        self.model = PeftModel.from_pretrained(self.model, peft_model_id)
        self.text_column = os.environ.get("DATASET_TEXT_COLUMN_NAME", "Tweet text")
        return

    def _predict_outputs(self, content: List[str]) -> List[str]:
        output_list = []
        for input in content:
            inputs = self.tokenizer(
                f'{self.text_column} : {input} Label : ',
                return_tensors="pt",
            )
            with torch.no_grad():
                inputs = {k: v for k, v in inputs.items()}
                outputs = self.model.generate(
                    input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=10, eos_token_id=3
                )
                outputs = self.tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)
            output_list.append(outputs[0])
        return output_list

The model server reads configuration from environment variables:

Environment variable	Default value	Description
`PRETRAINED_MODEL_PATH`	`bigscience/bloomz-560m`	Path or Hugging Face model ID for the base LLM
`PEFT_MODEL_ID`	`aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM`	PEFT prompt tuning configuration ID
`DATASET_TEXT_COLUMN_NAME`	`Tweet text`	Column name used for the input text field

Build the Docker image

Package the model server and its dependencies into a Docker image compatible with ModelMesh.

Dockerfile

# TODO: choose appropriate base image, install Python, MLServer, and
# dependencies of your MLModel implementation
FROM python:3.8-slim-buster
RUN pip install mlserver peft transformers datasets
# ...

# The custom MLModel implementation should be on the Python search path
# instead of relying on the working directory of the image. If using a
# single-file module, this can be accomplished with:
COPY --chown=${USER} ./peft_model_server.py /opt/peft_model_server.py
ENV PYTHONPATH=/opt/

# environment variables to be compatible with ModelMesh Serving
# these can also be set in the ServingRuntime, but this is recommended for
# consistency when building and testing
ENV MLSERVER_MODELS_DIR=/models/_mlserver_models \
    MLSERVER_GRPC_PORT=8001 \
    MLSERVER_HTTP_PORT=8002 \
    MLSERVER_LOAD_MODELS_AT_STARTUP=false \
    MLSERVER_MODEL_NAME=peft-model

# With this setting, the implementation field is not required in the model
# settings which eases integration by allowing the built-in adapter to generate
# a basic model settings file
ENV MLSERVER_MODEL_IMPLEMENTATION=peft_model_server.PeftModelServer

CMD mlserver start ${MLSERVER_MODELS_DIR}

The image exposes two ports: gRPC on 8001 and HTTP on 8002. Setting MLSERVER_MODEL_IMPLEMENTATION tells MLServer which class to load, so no separate model settings file is required.

Create the ServingRuntime resource

Define a ServingRuntime that points to your Docker image and configures the MLServer environment.

sample-runtime.yaml

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: peft-model-server
  namespace: modelmesh-serving
spec:
  supportedModelFormats:
    - name: peft-model
      version: "1"
      autoSelect: true
  multiModel: true
  grpcDataEndpoint: port:8001
  grpcEndpoint: port:8085
  containers:
    - name: mlserver
      image:  registry.cn-beijing.aliyuncs.com/test/peft-model-server:latest
      env:
        - name: MLSERVER_MODELS_DIR
          value: "/models/_mlserver_models/"
        - name: MLSERVER_GRPC_PORT
          value: "8001"
        - name: MLSERVER_HTTP_PORT
          value: "8002"
        - name: MLSERVER_LOAD_MODELS_AT_STARTUP
          value: "true"
        - name: MLSERVER_MODEL_NAME
          value: peft-model
        - name: MLSERVER_HOST
          value: "127.0.0.1"
        - name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
          value: "-1"
        - name: PRETRAINED_MODEL_PATH
          value: "bigscience/bloomz-560m"
        - name: PEFT_MODEL_ID
          value: "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM"
        # - name: "TRANSFORMERS_OFFLINE"
        #   value: "1"
        # - name: "HF_DATASETS_OFFLINE"
        #   value: "1"
      resources:
        requests:
          cpu: 500m
          memory: 4Gi
        limits:
          cpu: "5"
          memory: 5Gi
  builtInAdapter:
    serverType: mlserver
    runtimeManagementPort: 8001
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000

Deploy the ServingRuntime:

kubectl apply -f sample-runtime.yaml

Verify the runtime is available:

kubectl get servingruntimes -n modelmesh-serving

Expected output:

NAME                AGE
peft-model-server   10s

The peft-model-server runtime should appear in the output.

Step 2: Deploy the inference service

Create an InferenceService resource to bind your model to the ServingRuntime from Step 1. The InferenceService is the logical endpoint that ModelMesh uses to route inference requests to the model.

peft-demo-isvc.yaml

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: peft-demo
  namespace: modelmesh-serving
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: peft-model
      runtime: peft-model-server
      storage:
        key: localMinIO
        path: sklearn/mnist-svm.joblib

Configuration fields:

Field	Value	Description
`modelFormat.name`	`peft-model`	Must match the format declared in the ServingRuntime
`runtime`	`peft-model-server`	Tells ModelMesh which runtime serves this model
`serving.kserve.io/deploymentMode`	`ModelMesh`	Required annotation that instructs KServe to deploy through ModelMesh rather than standalone pods

Deploy the InferenceService:

kubectl apply -f peft-demo-isvc.yaml

Check that the InferenceService is ready:

kubectl get inferenceservices -n modelmesh-serving

Expected output:

NAME        URL    READY   AGE
peft-demo          True    30s

Wait until the READY column shows True before proceeding. If the status remains False, see Troubleshooting.

Step 3: Send an inference request

Send a POST request to the deployed model through the ASM ingress gateway. The request uses the KServe v2 inference protocol.

MODEL_NAME="peft-demo"
ASM_GW_IP="<IP-address-of-the-ingress-gateway>"
curl -X POST -k http://${ASM_GW_IP}:8008/v2/models/${MODEL_NAME}/infer -d @./input.json

Replace the following placeholder with your actual value:

Placeholder	Description	Example
`<IP-address-of-the-ingress-gateway>`	External IP of your ASM ingress gateway	192.168.1.100

The request body (input.json) follows the v2 inference protocol format. Encode input text as Base64 in the bytes_contents field:

{
    "inputs": [
        {
          "name": "content",
          "shape": [1],
          "datatype": "BYTES",
          "contents": {"bytes_contents": ["RXZlcnkgZGF5IGlzIGEgbmV3IGJpbm5pbmcsIGZpbGxlZCB3aXRoIG9wdGlvbnBpZW5pbmcgYW5kIGhvcGU="]}
        }
    ]
}

In this example, bytes_contents is the Base64-encoded form of "Every day is a new beginning, filled with opportunities and hope".

Expected response

A successful inference returns a JSON response with the model output in bytesContents, also Base64-encoded:

{
 "modelName": "peft-demo__isvc-5c5315c302",
 "outputs": [
  {
   "name": "output-0",
   "datatype": "BYTES",
   "shape": [
    "1",
    "1"
   ],
   "parameters": {
    "content_type": {
     "stringParam": "str"
    }
   },
   "contents": {
    "bytesContents": [
     "VHdlZXQgdGV4dCA6IEV2ZXJ5IGRheSBpcyBhIG5ldyBiaW5uaW5nLCBmaWxsZWQgd2l0aCBvcHRpb25waWVuaW5nIGFuZCBob3BlIExhYmVsIDogbm8gY29tcGxhaW50"
    ]
   }
  }
 ]
}

Decode the bytesContents value from Base64 to verify the result:

Tweet text : Every day is a new binning, filled with optionpiening and hope Label : no complaint

The model classified the input text with the label no complaint, confirming the inference service works correctly.

Troubleshooting

Issue	Cause	Solution
ServingRuntime pod stays in `Pending`	Insufficient CPU or memory in the cluster	Add nodes or reduce the resource `requests` in the ServingRuntime spec
InferenceService never reaches `Ready: True`	Model loading timeout or download failure	Check pod logs with `kubectl logs -n modelmesh-serving <pod-name>`. Increase `modelLoadingTimeoutMillis` for slow downloads. For air-gapped clusters, set `TRANSFORMERS_OFFLINE` and `HF_DATASETS_OFFLINE` to `"1"` and pre-download models to local storage
`curl` returns connection refused	Ingress gateway misconfigured or wrong IP/port	Verify the ASM ingress gateway IP and confirm port `8008` is exposed
Unexpected model output	Model and PEFT configuration mismatch	Confirm that `PRETRAINED_MODEL_PATH` and `PEFT_MODEL_ID` point to compatible model and tuning configurations

What to do next

Use ModelMesh to roll out a multi-model inference service -- Deploy multiple models under one ModelMesh instance to share resources.
Use ModelMesh to create a custom model serving runtime -- Build and customize ServingRuntime resources for other model frameworks.