Run Qwen2 LLM Inference on ACK with TensorRT-LLM - ACK

Deploy the Qwen2-1.5B-Instruct model as an inference service on ACK using TensorRT-LLM and Triton Inference Server. This tutorial uses an A10 GPU and walks through the full pipeline: downloading the model from ModelScope, converting it to a TensorRT engine, pre-loading the engine into cache with Fluid, and serving it through Triton.

Background

Qwen2-1.5B-Instruct

Qwen2-1.5B-Instruct is a Transformer-based large language model (LLM) with 1.5 billion parameters, trained on web text, professional books, and code. For more information, see the Qwen2 GitHub repository.

Triton Inference Server

Triton Inference Server is an open source inference serving framework from NVIDIA. It supports multiple ML backends — including TensorRT, TensorFlow, PyTorch, ONNX, and vLLM — and is optimized for real-time, batch, and audio and video streaming inference workloads. For more information, see the Triton Inference Server GitHub repository.

TensorRT-LLM

TensorRT-LLM is an open source library from NVIDIA that compiles LLMs into TensorRT engines optimized for NVIDIA GPUs. It supports both Tensor Parallelism and Pipeline Parallelism, and integrates with Triton as the TensorRT-LLM backend. For more information, see the TensorRT-LLM GitHub repository.

Prerequisites

Before you begin, make sure you have:

An ACK Managed Cluster Pro Edition (version 1.22 or later) with A10 GPU nodes. GPU nodes must use driver version 525. To pin the driver to version 525.105.17, add the label ack.aliyun.com/nvidia-driver-version:525.105.17 to the GPU node pool. For more information, see Create an ACK managed cluster and Customize the GPU driver version of a node by specifying a version number.
The cloud-native AI suite installed with the ack-fluid component deployed.
- If the cloud-native AI suite is not yet installed: deploy Fluid and enable data caching acceleration. See Deploy the cloud-native AI suite.
- If the cloud-native AI suite is already installed: go to the Marketplace page in the ACK console and deploy the ack-fluid component.
Important
If you have open source Fluid installed, uninstall it before deploying ack-fluid.
The latest version of the Arena client. See Configure the Arena client.
Object Storage Service (OSS) activated with a bucket created. See Activate OSS and Create buckets.

Step 1: Create a Dataset and a JindoRuntime

A Dataset describes the remote storage location of the model files. A JindoRuntime provides a caching layer in front of OSS so that subsequent reads hit memory rather than the network. Together, they significantly reduce model loading time during inference.

Create a Secret to store your OSS credentials.

kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: fluid-oss-secret
stringData:
  fs.oss.accessKeyId: <Your AccessKey ID>
  fs.oss.accessKeySecret: <Your AccessKey secret>
EOF

Replace <Your AccessKey ID> and <Your AccessKey secret> with your credentials. To get an AccessKey pair, see Obtain an AccessKey. Expected output:

secret/fluid-oss-secret created

Create a file named dataset.yaml with the following content. This configuration creates a Dataset pointing to the OSS path where the model is stored, and a JindoRuntime that caches model data in 20 GiB of shared memory across two replicas.

# Dataset: describes the remote data source and mount configuration.
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: qwen2-oss
spec:
  mounts:
  - mountPoint: oss://<oss_bucket>/qwen2-1.5b   # Replace with your OSS path.
    name: qwen2
    path: /
    options:
      fs.oss.endpoint: <oss_endpoint>            # Replace with your OSS endpoint.
    encryptOptions:
      - name: fs.oss.accessKeyId
        valueFrom:
          secretKeyRef:
            name: fluid-oss-secret
            key: fs.oss.accessKeyId
      - name: fs.oss.accessKeySecret
        valueFrom:
          secretKeyRef:
            name: fluid-oss-secret
            key: fs.oss.accessKeySecret
  accessModes:
    - ReadWriteMany
# JindoRuntime: starts a JindoFS cluster that provides caching services.
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: qwen2-oss
spec:
  replicas: 2
  tieredstore:
    levels:
      - mediumtype: MEM
        volumeType: emptyDir
        path: /dev/shm
        quota: 20Gi
        high: "0.95"
        low: "0.7"
  fuse:
    properties:
      fs.oss.read.buffer.size: "8388608"              # 8 MiB read buffer
      fs.oss.download.thread.concurrency: "200"
      fs.oss.read.readahead.max.buffer.count: "200"
      fs.oss.read.sequence.ambiguity.range: "2147483647"
    args:
      - -oauto_cache
      - -oattr_timeout=1
      - -oentry_timeout=1
      - -onegative_timeout=1

For more information about Dataset and JindoRuntime configuration, see Accelerate access to OSS files using JindoFS.

Apply the configuration.

kubectl apply -f dataset.yaml

Expected output:

dataset.data.fluid.io/qwen2-oss created
jindoruntime.data.fluid.io/qwen2-oss created

Verify that the Dataset is bound and ready.

kubectl get dataset qwen2-oss

Expected output:

NAME        UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
qwen2-oss   0.00B            0.00B    20.00GiB         0.0%                Bound   57s

The PHASE: Bound status confirms the Dataset is ready.

Step 2: Create a Dataflow

The Dataflow automates the full model preparation pipeline:

Download — pull Qwen2-1.5B-Instruct from ModelScope into the OSS-backed Dataset.
Convert and build — convert the model checkpoint to float16 format and compile it into a TensorRT engine.
Warm up cache — pre-load the engine and Triton backend configuration into the JindoRuntime memory cache.

The convert-and-build step is the most time-consuming. It requires one GPU and up to 30 GiB of memory, and typically takes 15–20 minutes. See Troubleshooting if it fails.

Create a file named dataflow.yaml with the following content.

# Step 1: Download Qwen2-1.5B-Instruct from ModelScope.
apiVersion: data.fluid.io/v1alpha1
kind: DataProcess
metadata:
  name: step1-download-model
spec:
  dataset:
    name: qwen2-oss
    namespace: default
    mountPath: /mnt/models/
  processor:
    script:
      image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/base
      imageTag: ubuntu22.04
      imagePullPolicy: IfNotPresent
      restartPolicy: OnFailure
      command:
      - bash
      source: |
        #!/bin/bash
        echo "Downloading model..."
        if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct" ]; then
            echo "Directory ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct exists. Skipping model download."
        else
            apt update && apt install -y git git-lfs
            git clone https://www.modelscope.cn/qwen/Qwen2-1.5B-Instruct.git Qwen2-1.5B-Instruct
            mv Qwen2-1.5B-Instruct ${MODEL_MOUNT_PATH}
        fi
      env:
      - name: MODEL_MOUNT_PATH
        value: "/mnt/models"
# Step 2: Convert the model checkpoint and build the TensorRT engine.
---
apiVersion: data.fluid.io/v1alpha1
kind: DataProcess
metadata:
  name: step2-trtllm-convert
spec:
  runAfter:
    kind: DataProcess
    name: step1-download-model
    namespace: default
  dataset:
    name: qwen2-oss
    namespace: default
    mountPath: /mnt/models/
  processor:
    script:
      image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tritonserver-build
      imageTag: 24.07-trtllm-python-py3
      imagePullPolicy: IfNotPresent
      restartPolicy: OnFailure
      command:
      - bash
      source: |
        #!/bin/bash
        set -ex

        cd /tensorrtllm_backend/tensorrt_llm/examples/qwen
        if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt" ]; then
            echo "Directory ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt exists. Skipping checkpoint conversion."
        else
            echo "Converting checkpoint..."
            python3 convert_checkpoint.py --model_dir ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct --output_dir /root/Qwen2-1.5B-Instruct-ckpt --dtype float16

            echo "Writing TensorRT-LLM model checkpoint to OSS bucket..."
            mv /root/Qwen2-1.5B-Instruct-ckpt ${MODEL_MOUNT_PATH}
        fi

        sleep 2

        if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-engine" ]; then
            echo "Directory $OUTPUT_DIR/Qwen2-1.5B-Instruct-engine exists. Skipping engine build."
        else
            echo "Building TensorRT-LLM engine..."
            trtllm-build --checkpoint_dir ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt \
            --gemm_plugin float16 \
            --paged_kv_cache enable \
            --output_dir /root/Qwen2-1.5B-Instruct-engine

            echo "Writing TensorRT-LLM engine to OSS bucket..."
            mv /root/Qwen2-1.5B-Instruct-engine ${MODEL_MOUNT_PATH}
        fi

        if [ -d "${MODEL_MOUNT_PATH}/tensorrtllm_backend" ]; then
            echo "Directory $OUTPUT_DIR/tensorrtllm_backend exists. Skipping tensorrtllm_backend configuration."
        else
            echo "Configuring model..."
            cd /tensorrtllm_backend
            cp all_models/inflight_batcher_llm/ qwen2_ifb -r
            export QWEN2_MODEL=${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct
            export ENGINE_PATH=${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-engine

            python3 tools/fill_template.py -i qwen2_ifb/preprocessing/config.pbtxt tokenizer_dir:${QWEN2_MODEL},triton_max_batch_size:8,preprocessing_instance_count:1
            python3 tools/fill_template.py -i qwen2_ifb/postprocessing/config.pbtxt tokenizer_dir:${QWEN2_MODEL},triton_max_batch_size:8,postprocessing_instance_count:1
            python3 tools/fill_template.py -i qwen2_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
            python3 tools/fill_template.py -i qwen2_ifb/ensemble/config.pbtxt triton_max_batch_size:8
            python3 tools/fill_template.py -i qwen2_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:1280,max_attention_window_size:1280,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

            echo "Writing TensorRT-LLM configuration to OSS bucket..."
            mkdir -p ${MODEL_MOUNT_PATH}/tensorrtllm_backend
            mv /tensorrtllm_backend/qwen2_ifb ${MODEL_MOUNT_PATH}/tensorrtllm_backend
        fi
      env:
      - name: MODEL_MOUNT_PATH
        value: "/mnt/models"
      resources:
        requests:
          cpu: 2
          memory: 10Gi
          nvidia.com/gpu: 1
        limits:
          cpu: 12
          memory: 30Gi
          nvidia.com/gpu: 1
# Step 3: Pre-load the engine and backend configuration into the JindoRuntime cache.
---
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: step3-warmup-cache
spec:
  runAfter:
    kind: DataProcess
    name: step2-trtllm-convert
    namespace: default
  dataset:
    name: qwen2-oss
    namespace: default
  loadMetadata: true
  target:
  - path: /Qwen2-1.5B-Instruct-engine
  - path: /tensorrtllm_backend

Create the Dataflow.

kubectl create -f dataflow.yaml

Expected output:

dataprocess.data.fluid.io/step1-download-model created
dataprocess.data.fluid.io/step2-trtllm-convert created
dataload.data.fluid.io/step3-warmup-cache created

Monitor progress and wait for all steps to complete.

kubectl get dataprocess

Expected output (both steps show Complete):

NAME                   DATASET     PHASE      AGE   DURATION
step1-download-model   qwen2-oss   Complete   23m   3m2s
step2-trtllm-convert   qwen2-oss   Complete   23m   19m58s

If a step fails or gets stuck, see Troubleshooting.

Step 3: Deploy the inference service

The three most important parameters for this deployment are:

--gpus=1 — allocates one GPU per pod for model inference
--data=qwen2-oss:/mnt/models — mounts the Fluid-managed PersistentVolumeClaim (PVC) containing the TensorRT engine and Triton configuration
--image — specifies the Triton container image with TensorRT-LLM backend support

For descriptions of all parameters, see Appendix: Arena command parameter reference.

Deploy the inference service using Arena.

arena serve custom \
  --name=qwen2-chat \
  --version=v1 \
  --gpus=1 \
  --replicas=1 \
  --restful-port=8000 \
  --readiness-probe-action="tcpSocket" \
  --readiness-probe-action-option="port: 8000" \
  --readiness-probe-option="initialDelaySeconds: 30" \
  --readiness-probe-option="periodSeconds: 30" \
  --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tritonserver:24.07-trtllm-python-py3 \
  --data=qwen2-oss:/mnt/models \
  "tritonserver --model-repository=/mnt/models/tensorrtllm_backend/qwen2_ifb --http-port=8000 --grpc-port=8001 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_"

Expected output:

service/qwen2-chat-v1 created
deployment.apps/qwen2-chat-v1-custom-serving created
INFO[0003] The Job qwen2-chat has been submitted successfully
INFO[0003] You can run `arena serve get qwen2-chat --type custom-serving -n default` to check the job status

Wait for the service to become ready.

arena serve get qwen2-chat

Expected output:

Name:       qwen2-chat
Namespace:  default
Type:       Custom
Version:    v1
Desired:    1
Available:  1
Age:        1m
Address:    192.XX.XX.XX
Port:       RESTFUL:8000
GPU:        1

Instances:
  NAME                                           STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                           ------   ---  -----  --------  ---  ----
  qwen2-chat-v1-custom-serving-657869c698-hl665  Running  1m   1/1    0         1    ap-southeast-1.192.XX.XX.XX

Proceed to the next step once Available: 1 and the pod status shows READY 1/1.

Step 4: Validate the inference service

Set up port forwarding to access the service locally.

Important
Port forwarding is for development and debugging only. It does not provide production-level reliability, security, or scalability. For production networking, see Manage Ingresses.
```
kubectl port-forward svc/qwen2-chat-v1 8000:8000
```
Expected output:
```
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
```

Send a test inference request.

curl -X POST localhost:8000/v2/models/ensemble/generate \
  -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'

Expected output:

{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" Machine learning is a type of artificial intelligence that allows computer systems to learn from data without being explicitly programmed."}

(Optional) Step 5: Clean up the environment

To delete the inference service, run:

arena serve delete qwen2-chat

Troubleshooting

The engine build runs out of GPU memory

Symptom: step2-trtllm-convert fails with an out-of-memory (OOM) error during the trtllm-build phase.

Check the pod logs for the specific error:

kubectl logs -l fluid.io/dataprocess=step2-trtllm-convert --tail=50

The trtllm-build command in this tutorial allocates one A10 GPU (24 GiB). If the GPU node is shared with other workloads, free up GPU memory before retrying. To retry from the failed step without re-running the download, delete only step2-trtllm-convert and re-create it — the script skips already-completed stages by checking for existing directories.

The OSS path is not found

Symptom: step1-download-model fails, or step2-trtllm-convert cannot find the model files at ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct.

Verify the Dataset mount is correct:

kubectl describe dataset qwen2-oss

Check that mountPoint in dataset.yaml matches your OSS bucket path (format: oss://<bucket-name>/qwen2-1.5b), and that fs.oss.endpoint is set to the endpoint for the region where your bucket is located.

A DataProcess job is stuck in a non-Complete phase

Symptom: kubectl get dataprocess shows a step in Pending or Running state for an unusually long time.

Describe the job to check for scheduling issues or errors:

kubectl describe dataprocess step2-trtllm-convert

Check the Events section for messages such as insufficient GPU resources, image pull failures, or node taints preventing scheduling. The engine build step (step2-trtllm-convert) typically takes 15–20 minutes; allow that time before investigating.

Appendix: Arena command parameter reference

The following table describes all parameters used in the arena serve custom command in Step 3.

Parameter	Description	Example
`serve custom`	Arena subcommand. Deploys a custom model service instead of a preset type such as `tfserving` or `triton`.	—
`--name`	Service name. A unique name for the service, used in subsequent management operations such as viewing logs and deleting the service.	`modelscope`
`--version`	Service version. A version label for the service, used in version management and phased releases.	`v1`
`--gpus`	GPU count. The number of GPUs allocated to each pod. Required when the model needs GPUs for inference.	`1`
`--replicas`	Replica count. The number of service pods to run. Increasing replicas improves concurrent throughput and availability.	`1`
`--restful-port`	RESTful port. The port on which the service exposes its REST API for inference requests.	`8000`
`--readiness-probe-action`	Readiness probe type. Sets the check method for the Kubernetes readiness probe, which determines when the container is ready to receive traffic.	`tcpSocket`
`--readiness-probe-action-option`	Probe options. Parameters for the probe type. For `tcpSocket`, specifies the port to check.	`port: 8000`
`--readiness-probe-option`	Additional probe settings. Can be specified multiple times. Sets the initial delay and check interval.	`initialDelaySeconds: 30`, `periodSeconds: 30`
`--data`	Volume mount. Mounts a PVC to a path in the container. Format: `PVC-name:mount-path`. Used to mount model files from the Fluid-managed Dataset.	`llm-model:/Qwen1.5-4B-Chat`
`--image`	Container image. The full URL of the container image. Defines the runtime environment for the service.	`kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1`
`[COMMAND]`	Start command. The command to run after the container starts. Launches the Triton server with the model repository and port configuration.	`"MODEL_ID=/Qwen1.5-4B-Chat python3 server.py"`