This topic describes the release notes for inference-nv-pytorch 25.05.
Main features and bug fixes
Main features
The operating system of the vLLM image has been upgraded to Ubuntu 24.04, the Python version has been upgraded to 3.12, the CUDA version has been upgraded to 12.8, and the vLLM version has been upgraded to v0.8.5.post1.
The SGLang version in the SGLang image has been upgraded to v0.4.6.post4.
Bug fixes
None
Content
inference-nv-pytorch | inference-nv-pytorch | |
Tag | 25.05-vllm0.8.5.post1-pytorch2.7-cu128-20250513-serverless | 25.05-sglang0.4.6.post4-pytorch2.6-cu124-20250513-serverless |
Scenarios | LLM inference | LLM inference |
Framework | PyTorch | PyTorch |
Requirements | NVIDIA driver release >= 570 | NVIDIA driver release >= 550 |
System components |
|
|
Assets
Public network images
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.05-vllm0.8.5.post1-pytorch2.7-cu128-20250513-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.05-sglang0.4.6.post4-pytorch2.6-cu124-20250513-serverless
VPC image
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
{region-id}indicates the region where your ACS is activated, such as cn-beijing and cn-wulanchabu.{image:tag}indicates the name and tag of the image.
Currently, you can pull only images in the China (Beijing) region over a VPC.
The 25.05-vllm0.8.5.post1-pytorch2.7-cu128-20250513-serverless and 25.05-sglang0.4.6.post4-pytorch2.6-cu124-20250513-serverless images are applicable to ACS services and Lingjun multi-tenant services, but are not applicable to Lingjun single-tenant services.
Driver requirements
For CUDA 12.8 images: NVIDIA driver release >= 570
For CUDA 12.4 images: NVIDIA driver release >= 550
Quick Start
The following example uses only Docker to pull the inference-nv-pytorch image and uses the Qwen2.5-7B-Instruct model to test inference services.
To use the inference-nv-pytorch image in ACS, you must select the image from the artifact center page of the console where you create workloads, or specify the image in a YAML file. For more information, refer to the following topics:
Pull the inference container image.
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]Download an open source model in the modelscope format.
pip install modelscope cd /mnt modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-InstructRun the following command to log on to the container.
docker run -d -t --network=host --privileged --init --ipc=host \ --ulimit memlock=-1 --ulimit stack=67108864 \ -v /mnt/:/mnt/ \ egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]Run an inference test to test the inference conversation feature of vLLM.
Start the Server service.
python3 -m vllm.entrypoints.openai.api_server \ --model /mnt/Qwen2.5-7B-Instruct \ --trust-remote-code --disable-custom-all-reduce \ --tensor-parallel-size 1Test on the client.
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/mnt/Qwen2.5-7B-Instruct", "messages": [ {"role": "system", "content": "You are a friendly AI assistant."}, {"role": "user", "content": "Please introduce deep learning."} ]}'For more information about how to work with vLLM, see vLLM.
Known issues
None