The inference-nv-pytorch 25.05 release upgrades the vLLM image stack to Ubuntu 24.04, Python 3.12, CUDA 12.8, and vLLM v0.8.5.post1, and upgrades SGLang to v0.4.6.post4 in the SGLang image.
What's new
Main features
vLLM image: Upgraded to Ubuntu 24.04, Python 3.12, CUDA 12.8, and vLLM v0.8.5.post1.
SGLang image: Upgraded SGLang to v0.4.6.post4.
Bug fixes
None.
Image contents
The following table lists the two images included in this release and their system components.
inference-nv-pytorch | inference-nv-pytorch | |
Tag |
|
|
Scenario | LLM inference | LLM inference |
Framework | PyTorch | PyTorch |
Driver requirement | NVIDIA driver >= 570 | NVIDIA driver >= 550 |
System components |
|
|
Both images are compatible with ACS services and Lingjun multi-tenant services, but are not compatible with Lingjun single-tenant services.
Assets
Public network images
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.05-vllm0.8.5.post1-pytorch2.7-cu128-20250513-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.05-sglang0.4.6.post4-pytorch2.6-cu124-20250513-serverlessVPC image
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}Replace the placeholders with actual values:
| Placeholder | Description | Example |
|---|---|---|
{region-id} | The region where your ACS is activated | cn-beijing, cn-wulanchabu |
{image:tag} | The name and tag of the image | inference-nv-pytorch:25.05-vllm0.8.5.post1-pytorch2.7-cu128-20250513-serverless |
Currently, you can pull VPC images only in the China (Beijing) region.
Driver requirements
CUDA 12.8 images: NVIDIA driver release >= 570
CUDA 12.4 images: NVIDIA driver release >= 550
Quick start
The following example pulls the inference-nv-pytorch image with Docker and runs a test inference using the Qwen2.5-7B-Instruct model.
To use the inference-nv-pytorch image in ACS, select the image from the artifact center page in the console when creating workloads, or specify it in a YAML file. For more information, see:
Pull the inference container image.
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]Download the Qwen2.5-7B-Instruct model from ModelScope.
pip install modelscope cd /mnt modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-InstructStart the container.
docker run -d -t --network=host --privileged --init --ipc=host \ --ulimit memlock=-1 --ulimit stack=67108864 \ -v /mnt/:/mnt/ \ egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]Start the vLLM server and test inference.
Start the server.
python3 -m vllm.entrypoints.openai.api_server \ --model /mnt/Qwen2.5-7B-Instruct \ --trust-remote-code --disable-custom-all-reduce \ --tensor-parallel-size 1Send a test request from the client.
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/mnt/Qwen2.5-7B-Instruct", "messages": [ {"role": "system", "content": "You are a friendly AI assistant."}, {"role": "user", "content": "Please introduce deep learning."} ]}'For more information about working with vLLM, see vLLM.
Known issues
None.