This topic describes the release notes for inference-nv-pytorch 25.04.
Main features and bug fix list
Main features
vLLM upgraded to v0.8.5, supporting Qwen3 models
SGLang image PyTorch version upgraded to 2.6.0, SGLang version upgraded to v0.4.6.post1, supporting Qwen3 models
Bug fix
None
Content
inference-nv-pytorch | inference-nv-pytorch | |
Tag | 25.04-vllm0.8.5-pytorch2.6-cu124-20250430-serverless | 25.04-sglang0.4.6.post1-pytorch2.6-cu124-20250430-serverless |
Scenarios | LLM inference | LLM inference |
Frame | pytorch | pytorch |
Requirements | NVIDIA Driver release >= 550 | NVIDIA Driver release >= 550 |
System component |
|
|
Asset
Public network image
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.04-vllm0.8.5-pytorch2.6-cu124-20250430-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.04-sglang0.4.6.post1-pytorch2.6-cu124-20250430-serverless
VPC image
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
{region-id}indicates the region where your ACS is activated, such as cn-beijing and cn-wulanchabu.{image:tag}indicates the name and tag of the image.
Currently, you can pull only images in the China (Beijing) region over a VPC.
The 25.04-vllm0.8.5-pytorch2.6-cu124-20250430-serverless and 25.04-sglang0.4.6.post1-pytorch2.6-cu124-20250430-serverless images are applicable to ACS product form and multi-tenant product form of Lingjun, but not applicable to single-tenant product form of Lingjun.
Driver requirements
NVIDIA Driver release >= 550
Quick Start
The following example uses only Docker to pull the inference-nv-pytorch image and uses the Qwen2.5-7B-Instruct model to test inference services.
To use the inference-nv-pytorch image in ACS, you must select the image from the artifact center page of the console where you create workloads, or specify the image in a YAML file. For more information, refer to the following topics:
Pull the inference container image.
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]Download an open source model in the modelscope format.
pip install modelscope cd /mnt modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-InstructRun the following command to log on to the container.
docker run -d -t --network=host --privileged --init --ipc=host \ --ulimit memlock=-1 --ulimit stack=67108864 \ -v /mnt/:/mnt/ \ egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]Run an inference test to test the inference conversation feature of vLLM.
Start the Server service.
python3 -m vllm.entrypoints.openai.api_server \ --model /mnt/Qwen2.5-7B-Instruct \ --trust-remote-code --disable-custom-all-reduce \ --tensor-parallel-size 1Test on the client.
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/mnt/Qwen2.5-7B-Instruct", "messages": [ {"role": "system", "content": "You are a friendly AI assistant."}, {"role": "user", "content": "Please introduce deep learning."} ]}'For more information about how to work with vLLM, see vLLM.
Known issues
None