Deploy the Qwen2-1.5B-Instruct model as an inference service on ACK using TensorRT-LLM and Triton Inference Server. This tutorial uses an A10 GPU and walks through the full pipeline: downloading the model from ModelScope, converting it to a TensorRT engine, pre-loading the engine into cache with Fluid, and serving it through Triton.
Background
Qwen2-1.5B-Instruct
Qwen2-1.5B-Instruct is a Transformer-based large language model (LLM) with 1.5 billion parameters, trained on web text, professional books, and code. For more information, see the Qwen2 GitHub repository.
Triton Inference Server
Triton Inference Server is an open source inference serving framework from NVIDIA. It supports multiple ML backends — including TensorRT, TensorFlow, PyTorch, ONNX, and vLLM — and is optimized for real-time, batch, and audio and video streaming inference workloads. For more information, see the Triton Inference Server GitHub repository.
TensorRT-LLM
TensorRT-LLM is an open source library from NVIDIA that compiles LLMs into TensorRT engines optimized for NVIDIA GPUs. It supports both Tensor Parallelism and Pipeline Parallelism, and integrates with Triton as the TensorRT-LLM backend. For more information, see the TensorRT-LLM GitHub repository.
Prerequisites
Before you begin, make sure you have:
-
An ACK Managed Cluster Pro Edition (version 1.22 or later) with A10 GPU nodes. GPU nodes must use driver version 525. To pin the driver to version 525.105.17, add the label
ack.aliyun.com/nvidia-driver-version:525.105.17to the GPU node pool. For more information, see Create an ACK managed cluster and Customize the GPU driver version of a node by specifying a version number. -
The cloud-native AI suite installed with the ack-fluid component deployed.
-
If the cloud-native AI suite is not yet installed: deploy Fluid and enable data caching acceleration. See Deploy the cloud-native AI suite.
-
If the cloud-native AI suite is already installed: go to the Marketplace page in the ACK console and deploy the ack-fluid component.
ImportantIf you have open source Fluid installed, uninstall it before deploying ack-fluid.
-
-
The latest version of the Arena client. See Configure the Arena client.
-
Object Storage Service (OSS) activated with a bucket created. See Activate OSS and Create buckets.
Step 1: Create a Dataset and a JindoRuntime
A Dataset describes the remote storage location of the model files. A JindoRuntime provides a caching layer in front of OSS so that subsequent reads hit memory rather than the network. Together, they significantly reduce model loading time during inference.
-
Create a Secret to store your OSS credentials.
kubectl apply -f - <<EOF apiVersion: v1 kind: Secret metadata: name: fluid-oss-secret stringData: fs.oss.accessKeyId: <Your AccessKey ID> fs.oss.accessKeySecret: <Your AccessKey secret> EOFReplace
<Your AccessKey ID>and<Your AccessKey secret>with your credentials. To get an AccessKey pair, see Obtain an AccessKey. Expected output:secret/fluid-oss-secret created -
Create a file named
dataset.yamlwith the following content. This configuration creates a Dataset pointing to the OSS path where the model is stored, and a JindoRuntime that caches model data in 20 GiB of shared memory across two replicas.# Dataset: describes the remote data source and mount configuration. apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: qwen2-oss spec: mounts: - mountPoint: oss://<oss_bucket>/qwen2-1.5b # Replace with your OSS path. name: qwen2 path: / options: fs.oss.endpoint: <oss_endpoint> # Replace with your OSS endpoint. encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: fluid-oss-secret key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: fluid-oss-secret key: fs.oss.accessKeySecret accessModes: - ReadWriteMany # JindoRuntime: starts a JindoFS cluster that provides caching services. --- apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: qwen2-oss spec: replicas: 2 tieredstore: levels: - mediumtype: MEM volumeType: emptyDir path: /dev/shm quota: 20Gi high: "0.95" low: "0.7" fuse: properties: fs.oss.read.buffer.size: "8388608" # 8 MiB read buffer fs.oss.download.thread.concurrency: "200" fs.oss.read.readahead.max.buffer.count: "200" fs.oss.read.sequence.ambiguity.range: "2147483647" args: - -oauto_cache - -oattr_timeout=1 - -oentry_timeout=1 - -onegative_timeout=1For more information about Dataset and JindoRuntime configuration, see Accelerate access to OSS files using JindoFS.
-
Apply the configuration.
kubectl apply -f dataset.yamlExpected output:
dataset.data.fluid.io/qwen2-oss created jindoruntime.data.fluid.io/qwen2-oss created -
Verify that the Dataset is bound and ready.
kubectl get dataset qwen2-ossExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE qwen2-oss 0.00B 0.00B 20.00GiB 0.0% Bound 57sThe
PHASE: Boundstatus confirms the Dataset is ready.
Step 2: Create a Dataflow
The Dataflow automates the full model preparation pipeline:
-
Download — pull Qwen2-1.5B-Instruct from ModelScope into the OSS-backed Dataset.
-
Convert and build — convert the model checkpoint to float16 format and compile it into a TensorRT engine.
-
Warm up cache — pre-load the engine and Triton backend configuration into the JindoRuntime memory cache.
The convert-and-build step is the most time-consuming. It requires one GPU and up to 30 GiB of memory, and typically takes 15–20 minutes. See Troubleshooting if it fails.
-
Create a file named
dataflow.yamlwith the following content.# Step 1: Download Qwen2-1.5B-Instruct from ModelScope. apiVersion: data.fluid.io/v1alpha1 kind: DataProcess metadata: name: step1-download-model spec: dataset: name: qwen2-oss namespace: default mountPath: /mnt/models/ processor: script: image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/base imageTag: ubuntu22.04 imagePullPolicy: IfNotPresent restartPolicy: OnFailure command: - bash source: | #!/bin/bash echo "Downloading model..." if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct" ]; then echo "Directory ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct exists. Skipping model download." else apt update && apt install -y git git-lfs git clone https://www.modelscope.cn/qwen/Qwen2-1.5B-Instruct.git Qwen2-1.5B-Instruct mv Qwen2-1.5B-Instruct ${MODEL_MOUNT_PATH} fi env: - name: MODEL_MOUNT_PATH value: "/mnt/models" # Step 2: Convert the model checkpoint and build the TensorRT engine. --- apiVersion: data.fluid.io/v1alpha1 kind: DataProcess metadata: name: step2-trtllm-convert spec: runAfter: kind: DataProcess name: step1-download-model namespace: default dataset: name: qwen2-oss namespace: default mountPath: /mnt/models/ processor: script: image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tritonserver-build imageTag: 24.07-trtllm-python-py3 imagePullPolicy: IfNotPresent restartPolicy: OnFailure command: - bash source: | #!/bin/bash set -ex cd /tensorrtllm_backend/tensorrt_llm/examples/qwen if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt" ]; then echo "Directory ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt exists. Skipping checkpoint conversion." else echo "Converting checkpoint..." python3 convert_checkpoint.py --model_dir ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct --output_dir /root/Qwen2-1.5B-Instruct-ckpt --dtype float16 echo "Writing TensorRT-LLM model checkpoint to OSS bucket..." mv /root/Qwen2-1.5B-Instruct-ckpt ${MODEL_MOUNT_PATH} fi sleep 2 if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-engine" ]; then echo "Directory $OUTPUT_DIR/Qwen2-1.5B-Instruct-engine exists. Skipping engine build." else echo "Building TensorRT-LLM engine..." trtllm-build --checkpoint_dir ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt \ --gemm_plugin float16 \ --paged_kv_cache enable \ --output_dir /root/Qwen2-1.5B-Instruct-engine echo "Writing TensorRT-LLM engine to OSS bucket..." mv /root/Qwen2-1.5B-Instruct-engine ${MODEL_MOUNT_PATH} fi if [ -d "${MODEL_MOUNT_PATH}/tensorrtllm_backend" ]; then echo "Directory $OUTPUT_DIR/tensorrtllm_backend exists. Skipping tensorrtllm_backend configuration." else echo "Configuring model..." cd /tensorrtllm_backend cp all_models/inflight_batcher_llm/ qwen2_ifb -r export QWEN2_MODEL=${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct export ENGINE_PATH=${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-engine python3 tools/fill_template.py -i qwen2_ifb/preprocessing/config.pbtxt tokenizer_dir:${QWEN2_MODEL},triton_max_batch_size:8,preprocessing_instance_count:1 python3 tools/fill_template.py -i qwen2_ifb/postprocessing/config.pbtxt tokenizer_dir:${QWEN2_MODEL},triton_max_batch_size:8,postprocessing_instance_count:1 python3 tools/fill_template.py -i qwen2_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i qwen2_ifb/ensemble/config.pbtxt triton_max_batch_size:8 python3 tools/fill_template.py -i qwen2_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:1280,max_attention_window_size:1280,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0 echo "Writing TensorRT-LLM configuration to OSS bucket..." mkdir -p ${MODEL_MOUNT_PATH}/tensorrtllm_backend mv /tensorrtllm_backend/qwen2_ifb ${MODEL_MOUNT_PATH}/tensorrtllm_backend fi env: - name: MODEL_MOUNT_PATH value: "/mnt/models" resources: requests: cpu: 2 memory: 10Gi nvidia.com/gpu: 1 limits: cpu: 12 memory: 30Gi nvidia.com/gpu: 1 # Step 3: Pre-load the engine and backend configuration into the JindoRuntime cache. --- apiVersion: data.fluid.io/v1alpha1 kind: DataLoad metadata: name: step3-warmup-cache spec: runAfter: kind: DataProcess name: step2-trtllm-convert namespace: default dataset: name: qwen2-oss namespace: default loadMetadata: true target: - path: /Qwen2-1.5B-Instruct-engine - path: /tensorrtllm_backend -
Create the Dataflow.
kubectl create -f dataflow.yamlExpected output:
dataprocess.data.fluid.io/step1-download-model created dataprocess.data.fluid.io/step2-trtllm-convert created dataload.data.fluid.io/step3-warmup-cache created -
Monitor progress and wait for all steps to complete.
kubectl get dataprocessExpected output (both steps show
Complete):NAME DATASET PHASE AGE DURATION step1-download-model qwen2-oss Complete 23m 3m2s step2-trtllm-convert qwen2-oss Complete 23m 19m58sIf a step fails or gets stuck, see Troubleshooting.
Step 3: Deploy the inference service
The three most important parameters for this deployment are:
-
--gpus=1— allocates one GPU per pod for model inference -
--data=qwen2-oss:/mnt/models— mounts the Fluid-managed PersistentVolumeClaim (PVC) containing the TensorRT engine and Triton configuration -
--image— specifies the Triton container image with TensorRT-LLM backend support
For descriptions of all parameters, see Appendix: Arena command parameter reference.
-
Deploy the inference service using Arena.
arena serve custom \ --name=qwen2-chat \ --version=v1 \ --gpus=1 \ --replicas=1 \ --restful-port=8000 \ --readiness-probe-action="tcpSocket" \ --readiness-probe-action-option="port: 8000" \ --readiness-probe-option="initialDelaySeconds: 30" \ --readiness-probe-option="periodSeconds: 30" \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tritonserver:24.07-trtllm-python-py3 \ --data=qwen2-oss:/mnt/models \ "tritonserver --model-repository=/mnt/models/tensorrtllm_backend/qwen2_ifb --http-port=8000 --grpc-port=8001 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_"Expected output:
service/qwen2-chat-v1 created deployment.apps/qwen2-chat-v1-custom-serving created INFO[0003] The Job qwen2-chat has been submitted successfully INFO[0003] You can run `arena serve get qwen2-chat --type custom-serving -n default` to check the job status -
Wait for the service to become ready.
arena serve get qwen2-chatExpected output:
Name: qwen2-chat Namespace: default Type: Custom Version: v1 Desired: 1 Available: 1 Age: 1m Address: 192.XX.XX.XX Port: RESTFUL:8000 GPU: 1 Instances: NAME STATUS AGE READY RESTARTS GPU NODE ---- ------ --- ----- -------- --- ---- qwen2-chat-v1-custom-serving-657869c698-hl665 Running 1m 1/1 0 1 ap-southeast-1.192.XX.XX.XXProceed to the next step once
Available: 1and the pod status showsREADY 1/1.
Step 4: Validate the inference service
-
Set up port forwarding to access the service locally.
ImportantPort forwarding is for development and debugging only. It does not provide production-level reliability, security, or scalability. For production networking, see Manage Ingresses.
kubectl port-forward svc/qwen2-chat-v1 8000:8000Expected output:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000 -
Send a test inference request.
curl -X POST localhost:8000/v2/models/ensemble/generate \ -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'Expected output:
{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" Machine learning is a type of artificial intelligence that allows computer systems to learn from data without being explicitly programmed."}
(Optional) Step 5: Clean up the environment
To delete the inference service, run:
arena serve delete qwen2-chat
Troubleshooting
The engine build runs out of GPU memory
Symptom: step2-trtllm-convert fails with an out-of-memory (OOM) error during the trtllm-build phase.
Check the pod logs for the specific error:
kubectl logs -l fluid.io/dataprocess=step2-trtllm-convert --tail=50
The trtllm-build command in this tutorial allocates one A10 GPU (24 GiB). If the GPU node is shared with other workloads, free up GPU memory before retrying. To retry from the failed step without re-running the download, delete only step2-trtllm-convert and re-create it — the script skips already-completed stages by checking for existing directories.
The OSS path is not found
Symptom: step1-download-model fails, or step2-trtllm-convert cannot find the model files at ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct.
Verify the Dataset mount is correct:
kubectl describe dataset qwen2-oss
Check that mountPoint in dataset.yaml matches your OSS bucket path (format: oss://<bucket-name>/qwen2-1.5b), and that fs.oss.endpoint is set to the endpoint for the region where your bucket is located.
A DataProcess job is stuck in a non-Complete phase
Symptom: kubectl get dataprocess shows a step in Pending or Running state for an unusually long time.
Describe the job to check for scheduling issues or errors:
kubectl describe dataprocess step2-trtllm-convert
Check the Events section for messages such as insufficient GPU resources, image pull failures, or node taints preventing scheduling. The engine build step (step2-trtllm-convert) typically takes 15–20 minutes; allow that time before investigating.
Appendix: Arena command parameter reference
The following table describes all parameters used in the arena serve custom command in Step 3.
| Parameter | Description | Example |
|---|---|---|
serve custom |
Arena subcommand. Deploys a custom model service instead of a preset type such as tfserving or triton. |
— |
--name |
Service name. A unique name for the service, used in subsequent management operations such as viewing logs and deleting the service. | modelscope |
--version |
Service version. A version label for the service, used in version management and phased releases. | v1 |
--gpus |
GPU count. The number of GPUs allocated to each pod. Required when the model needs GPUs for inference. | 1 |
--replicas |
Replica count. The number of service pods to run. Increasing replicas improves concurrent throughput and availability. | 1 |
--restful-port |
RESTful port. The port on which the service exposes its REST API for inference requests. | 8000 |
--readiness-probe-action |
Readiness probe type. Sets the check method for the Kubernetes readiness probe, which determines when the container is ready to receive traffic. | tcpSocket |
--readiness-probe-action-option |
Probe options. Parameters for the probe type. For tcpSocket, specifies the port to check. |
port: 8000 |
--readiness-probe-option |
Additional probe settings. Can be specified multiple times. Sets the initial delay and check interval. | initialDelaySeconds: 30, periodSeconds: 30 |
--data |
Volume mount. Mounts a PVC to a path in the container. Format: PVC-name:mount-path. Used to mount model files from the Fluid-managed Dataset. |
llm-model:/Qwen1.5-4B-Chat |
--image |
Container image. The full URL of the container image. Defines the runtime environment for the service. | kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 |
[COMMAND] |
Start command. The command to run after the container starts. Launches the Triton server with the model repository and port configuration. | "MODEL_ID=/Qwen1.5-4B-Chat python3 server.py" |