This topic uses the Qwen1.5-4B-Chat model and the A10 and T4 GPUs as an example to demonstrate how to use the rtp-llm framework to deploy Qwen inference services in Container Service for Kubernetes (ACK).
Background information
Qwen1.5-4B-Chat
Qwen1.5-4B-Chat is a 4-billion-parameter model large language model (LLM) developed by Alibaba Cloud based on Transformer. This model is trained based on ultra-large amounts of training data, which covers a variety of web-based text, books of specialized sectors, and code. For more information, see Qwen GitHub repository.
rtp-llm
rtp-llm is an inference acceleration engine developed by the Alibaba large language model (LLM) prediction team to improve the efficiency and performance of LLM inference. rtp-llm provides the following features:
Provides high-performance CUDA kernels, including PagedAttention, FlashAttention, and FlashDecoding.
Adopts the eightOnly INT8 and WeightOnly INT4 quantization technologies.
Supports mainstream algorithms, including General Purpose Quantization (GPTQ) and Approximate Weight Quantization (AWQ).
Uses a adaptive KVCache quantization framework and is optimized for overheads in continuous batching.
Optimized for the V100 GPU.
For more information, see rtp-llm.
Prerequisites
An ACK Pro cluster that contains GPU-accelerated nodes is created. The Kubernetes version of the cluster is 1.22 or later. Each GPU-accelerated node provides 16 GB of GPU memory or above. For more information, see Create an ACK managed cluster.
We recommend that you install a GPU driver whose version is 525. You can add the
ack.aliyun.com/nvidia-driver-version:525.105.17
label to GPU-accelerated nodes to specify the GPU driver version as 525.105.17. For more information, see Specify an NVIDIA driver version for nodes by adding a label.The latest version of the Arena client is installed. For more information, see Configure the Arena client.
Step 1: Prepare model data
This section uses the Qwen1.5-4B-Chat model as an example to demonstrate how to download models from and upload models to Object Storage Service (OSS) and how to create persistent volumes (PVs) and persistent volume claims (PVCs) in ACK clusters.
For more information about how to upload a model to Apsara File Storage NAS (NAS), see Mount a statically provisioned NAS volume.
Download the model file.
Run the following command to install Git:
# Run yum install git or apt install git. yum install git
Run the following command to install the Git Large File Support (LFS) plug-in:
# Run yum install git-lfs or apt install git-lfs. yum install git-lfs
Run the following command to clone the Qwen1.5-4B-Chat repository on ModelScope to the local environment:
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
Run the following command to enter the Qwen1.5-4B-Chat directory and pull large files managed by LFS:
cd Qwen1.5-4B-Chat git lfs pull
Upload the Qwen1.5-4B-Chat model file to OSS.
Log on to the OSS console, and view and record the name of the OSS bucket that you created.
For more information about how to create an OSS bucket, see Create a bucket.
Install and configure ossutil to manage OSS resources. For more information, see Install ossutil.
Run the following command to create a directory named Qwen1.5-4B-Chat in OSS:
ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
Run the following command to upload the model file to OSS:
ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
Configure PVs and PVCs in the destination cluster. For more information, see Mount a statically provisioned OSS volume.
The following table describes the parameters of the PV.
Parameter
Description
PV Type
OSS
Volume Name
llm-model
Access Certificate
Specify the AccessKey ID and AccessKey secret used to access the OSS bucket.
Bucket ID
Specify the name of the OSS bucket that you created.
OSS Path
Select the path of the model, such as /models/Qwen1.5-4B-Chat.
The following table describes the parameters of the PVC.
Parameter
Description
PVC Type
OSS
Volume Name
llm-model
Allocation Mode
Select Existing Volumes.
Existing Volumes
Click the Existing Volumes hyperlink and select the PV that you created.
Step 2: Deploy an inference service
Run the following command to deploy an inference service from the Qwen1.5-4B-Chat model.
Use Arena to deploy a custom inference service. The name of the service is rtp-llm-qwen and its version is v1. The service uses one GPU and has one replica. Readiness probes are configured for the service. Models are considered a special type of data. Therefore, set the
--data
parameter to mount the model PVC namedllm-model
to the/model/Qwen1.5-4B-Chat
directory in containers.Use a single A10 GPU
arena serve custom \ --name=rtp-llm-qwen \ --version=v1 \ --gpus=1 \ --replicas=1 \ --readiness-probe-action="tcpSocket" \ --readiness-probe-action-option="port: 8000" \ --readiness-probe-option="initialDelaySeconds: 30" \ --readiness-probe-option="periodSeconds: 30" \ --restful-port=8000 \ --image=ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/rtp_llm:0.1.12-cuda12-ubuntu22.04 \ --data=llm-model:/model/Qwen1.5-4B-Chat \ "MODEL_TYPE=qwen_2 START_PORT=8000 CHECKPOINT_PATH=/model/Qwen1.5-4B-Chat TOKENIZER_PATH=/model/Qwen1.5-4B-Chat python3 -m maga_transformer.start_server"
Use a single T4 GPU
arena serve custom \ --name=rtp-llm-qwen \ --version=v1 \ --gpus=1 \ --replicas=1 \ --readiness-probe-action="tcpSocket" \ --readiness-probe-action-option="port: 8000" \ --readiness-probe-option="initialDelaySeconds: 30" \ --readiness-probe-option="periodSeconds: 30" \ --restful-port=8000 \ --image=ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/rtp_llm:0.1.12-cuda12-ubuntu22.04 \ --data=llm-model:/model/Qwen1.5-4B-Chat \ "MODEL_TYPE=qwen_2 START_PORT=8000 CHECKPOINT_PATH=/model/Qwen1.5-4B-Chat TOKENIZER_PATH=/model/Qwen1.5-4B-Chat MAX_SEQ_LEN=2048 python3 -m maga_transformer.start_server"
The following table describes the parameters.
Parameter
Description
--name
The name of the inference service.
--version
The version of the inference service.
--gpus
The number of GPUs for each inference service replica.
--replicas
The number of inference service replicas.
--restful-port
The port of the inference service to be exposed.
--readiness-probe-action
The connection type of readiness probes. Valid values: HttpGet, Exec, gRPC, and TCPSocket.
--readiness-probe-action-option
The connection method of readiness probes.
--readiness-probe-option
The readiness probe configuration.
--data
Mount a shared PVC to the runtime environment. The value consists of two parts separated by a colon (:). Specify the name of the PVC on the left side of the colon. You can run the
arena data list
command to query the list of existing PVCs in the cluster. Specify the runtime environment on the right side of the colon. You can also specify the local path of the training data or model. This way, your script can access the data or model in the specified PV.--image
The address of the inference service image.
Expected output:
service/rtp-llm-qwen-v1 created deployment.apps/rtp-llm-qwen-v1-custom-serving created INFO[0001] The Job rtp-llm-qwen has been submitted successfully INFO[0001] You can run `arena serve get rtp-llm-qwen --type custom-serving -n default` to check the job status
The output indicates that the inference service is deployed.
Run the following command to query the detailed information of the service and wait until the service is ready:
arena serve get rtp-llm-qwen
Expected output:
Name: rtp-llm-qwen Namespace: default Type: Custom Version: v1 Desired: 1 Available: 1 Age: 1h Address: 192.168.XX.XX Port: RESTFUL:8000 GPU: 1 Instances: NAME STATUS AGE READY RESTARTS GPU NODE ---- ------ --- ----- -------- --- ---- rtp-llm-qwen-v1-custom-serving-696f699485-mn56v Running 1h 1/1 0 1 cn-beijing.192.168.XX.XX
The output indicates that a pod (rtp-llm-qwen-v1-custom-serving-696f699485-mn56v) is running for the inference service and is ready to provide services.
Step 3: Verify the inference service
Run the following command to set up port forwarding between the inference service and local environment.
ImportantPort forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress overview.
kubectl port-forward svc/rtp-llm-qwen-v1 8000:8000
Expected output:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000
Run the following command to send a request to the Triton inference service.
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "/model/Qwen1.5-4B-Chat/", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'
Expected output:
{"id":"chat-","object":"chat.completion","created":1717383026,"model":"AsyncModel","choices":[{"index":0,"message":{"role":"assistant","content":"OK. What test do you want me to run?"},"finish_reason":"stop"}],"usage":{"prompt_tokens":21,"total_tokens":31,"completion_tokens":10}}
The output indicates that the model can generate a response based on the given prompt. In this example, the prompt is a test request.
(Optional) Step 4: Clear the environment
If you no longer need the resources, clear the environment promptly.
Run the following command to delete the inference service:
arena serve del rtp-llm-qwen
Run the following command to delete the PV and PVC:
kubectl delete pvc llm-model kubectl delete pv llm-model