A large language model (LLM) is a neural network language model with hundreds of millions of parameters, such as GPT-3, GPT-4, PaLM, and PaLM 2. By deploying an LLM on Service Mesh (ASM) through ModelMesh, you can expose NLP capabilities -- text classification, sentiment analysis, machine translation -- as API endpoints. With the LLM-as-a-service approach, you avoid high infrastructure costs, respond quickly to market changes, and scale services on demand to handle traffic spikes, all while the model runs on the cloud and improves operational efficiency.
This topic walks through three steps: building a custom model serving runtime, deploying the model as an inference service, and sending inference requests through the ASM ingress gateway.
How it works
This deployment uses ModelMesh within ASM to serve a Hugging Face LLM with Parameter-Efficient Fine-Tuning (PEFT) prompt tuning. The following diagram shows how the components fit together:
+---------------------------------------------------------+
| ASM ingress gateway |
| (port 8008, HTTP) |
+----------------------------+----------------------------+
|
v
+---------------------------------------------------------+
| ModelMesh |
| (model routing and orchestration) |
+---------------------------------------------------------+
| ServingRuntime InferenceService |
| +------------------+ +-----------------------+ |
| | peft-model-server |<-----| peft-demo | |
| | (MLServer-based) | | (model endpoint) | |
| +------------------+ +-----------------------+ |
| | |
| v |
| +------------------+ +-----------------------+ |
| | MLServer | | Hugging Face model | |
| | (gRPC :8001, | | + PEFT config | |
| | HTTP :8002) | | (bloomz-560m) | |
| +------------------+ +-----------------------+ |
+---------------------------------------------------------+ServingRuntime: Defines the container image and MLServer configuration for serving the model.
InferenceService: Specifies which model to load and which ServingRuntime to use. Serves as the logical endpoint for inference requests.
ModelMesh: Routes incoming requests from the ASM ingress gateway to the correct InferenceService. Handles model loading and scaling.
Prerequisites
Before you begin, make sure that you have:
ModelMesh enabled in your ASM instance with the ASM environment configured. Complete Step 1 and Step 2 in Use ModelMesh to roll out a multi-model inference service before you proceed
Familiarity with creating custom model serving runtimes with ModelMesh. For background, see Use ModelMesh to create a custom model serving runtime
Step 1: Build a custom runtime
Build a custom ServingRuntime to serve a Hugging Face LLM with PEFT prompt tuning. This involves three parts: implementing the model server class, packaging it into a Docker image, and creating the Kubernetes resource.
Implement the model server class
The model server inherits from the MLServer MLModel base class and implements two handlers:
load: Loads the pretrained LLM, applies the PEFT prompt tuning configuration, and initializes a tokenizer. The tokenizer lets the server accept raw text input rather than preprocessed tensors.predict: Tokenizes input text, runs inference, and decodes the output back to readable text.
The full implementation is in peft_model_server.py:
The model server reads configuration from environment variables:
| Environment variable | Default value | Description |
|---|---|---|
PRETRAINED_MODEL_PATH | bigscience/bloomz-560m | Path or Hugging Face model ID for the base LLM |
PEFT_MODEL_ID | aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM | PEFT prompt tuning configuration ID |
DATASET_TEXT_COLUMN_NAME | Tweet text | Column name used for the input text field |
Build the Docker image
Package the model server and its dependencies into a Docker image compatible with ModelMesh.
The image exposes two ports: gRPC on 8001 and HTTP on 8002. Setting MLSERVER_MODEL_IMPLEMENTATION tells MLServer which class to load, so no separate model settings file is required.
Create the ServingRuntime resource
Define a ServingRuntime that points to your Docker image and configures the MLServer environment.
Deploy the ServingRuntime:
kubectl apply -f sample-runtime.yamlVerify the runtime is available:
kubectl get servingruntimes -n modelmesh-servingExpected output:
NAME AGE
peft-model-server 10sThe peft-model-server runtime should appear in the output.
Step 2: Deploy the inference service
Create an InferenceService resource to bind your model to the ServingRuntime from Step 1. The InferenceService is the logical endpoint that ModelMesh uses to route inference requests to the model.
Configuration fields:
| Field | Value | Description |
|---|---|---|
modelFormat.name | peft-model | Must match the format declared in the ServingRuntime |
runtime | peft-model-server | Tells ModelMesh which runtime serves this model |
serving.kserve.io/deploymentMode | ModelMesh | Required annotation that instructs KServe to deploy through ModelMesh rather than standalone pods |
Deploy the InferenceService:
kubectl apply -f peft-demo-isvc.yamlCheck that the InferenceService is ready:
kubectl get inferenceservices -n modelmesh-servingExpected output:
NAME URL READY AGE
peft-demo True 30sWait until the READY column shows True before proceeding. If the status remains False, see Troubleshooting.
Step 3: Send an inference request
Send a POST request to the deployed model through the ASM ingress gateway. The request uses the KServe v2 inference protocol.
MODEL_NAME="peft-demo"
ASM_GW_IP="<IP-address-of-the-ingress-gateway>"
curl -X POST -k http://${ASM_GW_IP}:8008/v2/models/${MODEL_NAME}/infer -d @./input.jsonReplace the following placeholder with your actual value:
| Placeholder | Description | Example |
|---|---|---|
<IP-address-of-the-ingress-gateway> | External IP of your ASM ingress gateway | 192.168.1.100 |
The request body (input.json) follows the v2 inference protocol format. Encode input text as Base64 in the bytes_contents field:
{
"inputs": [
{
"name": "content",
"shape": [1],
"datatype": "BYTES",
"contents": {"bytes_contents": ["RXZlcnkgZGF5IGlzIGEgbmV3IGJpbm5pbmcsIGZpbGxlZCB3aXRoIG9wdGlvbnBpZW5pbmcgYW5kIGhvcGU="]}
}
]
}In this example, bytes_contents is the Base64-encoded form of "Every day is a new beginning, filled with opportunities and hope".
Expected response
A successful inference returns a JSON response with the model output in bytesContents, also Base64-encoded:
{
"modelName": "peft-demo__isvc-5c5315c302",
"outputs": [
{
"name": "output-0",
"datatype": "BYTES",
"shape": [
"1",
"1"
],
"parameters": {
"content_type": {
"stringParam": "str"
}
},
"contents": {
"bytesContents": [
"VHdlZXQgdGV4dCA6IEV2ZXJ5IGRheSBpcyBhIG5ldyBiaW5uaW5nLCBmaWxsZWQgd2l0aCBvcHRpb25waWVuaW5nIGFuZCBob3BlIExhYmVsIDogbm8gY29tcGxhaW50"
]
}
}
]
}Decode the bytesContents value from Base64 to verify the result:
Tweet text : Every day is a new binning, filled with optionpiening and hope Label : no complaintThe model classified the input text with the label no complaint, confirming the inference service works correctly.
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
ServingRuntime pod stays in Pending | Insufficient CPU or memory in the cluster | Add nodes or reduce the resource requests in the ServingRuntime spec |
InferenceService never reaches Ready: True | Model loading timeout or download failure | Check pod logs with kubectl logs -n modelmesh-serving <pod-name>. Increase modelLoadingTimeoutMillis for slow downloads. For air-gapped clusters, set TRANSFORMERS_OFFLINE and HF_DATASETS_OFFLINE to "1" and pre-download models to local storage |
curl returns connection refused | Ingress gateway misconfigured or wrong IP/port | Verify the ASM ingress gateway IP and confirm port 8008 is exposed |
| Unexpected model output | Model and PEFT configuration mismatch | Confirm that PRETRAINED_MODEL_PATH and PEFT_MODEL_ID point to compatible model and tuning configurations |
What to do next
Use ModelMesh to roll out a multi-model inference service -- Deploy multiple models under one ModelMesh instance to share resources.
Use ModelMesh to create a custom model serving runtime -- Build and customize ServingRuntime resources for other model frameworks.