inference-nv-pytorch 25.04 - Container Compute Service - Alibaba Cloud Documentation Center

This topic describes the release notes for inference-nv-pytorch 25.04.

Main features and bug fix list

Main features

vLLM upgraded to v0.8.5, supporting Qwen3 models
SGLang image PyTorch version upgraded to 2.6.0, SGLang version upgraded to v0.4.6.post1, supporting Qwen3 models

Bug fix

None

Content

	inference-nv-pytorch	inference-nv-pytorch
Tag	25.04-vllm0.8.5-pytorch2.6-cu124-20250430-serverless	25.04-sglang0.4.6.post1-pytorch2.6-cu124-20250430-serverless
Scenarios	LLM inference	LLM inference
Frame	pytorch	pytorch
Requirements	NVIDIA Driver release >= 550	NVIDIA Driver release >= 550
System component	Ubuntu 22.04 Python 3.10 Torch 2.6.0+cu124 CUDA 12.4 ACCL-N 2.23.4.12 accelerate 1.6.0 diffusers 0.33.1 flash_attn 2.7.4.post1 transformer 4.51.3 vllm 0.8.5 ray 2.43.0 triton 3.2.0 xgrammar 0.1.18	Ubuntu 22.04 Python 3.10 Torch 2.6.0+cu124 CUDA 12.4 ACCL-N 2.23.4.12 transformers 4.51.1 triton 3.2.0 xgrammar 0.1.17 flashinfer-python 0.2.3 sglang 0.4.6.post1 sgl-kernel 0.1.0

Asset

Public network image

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.04-vllm0.8.5-pytorch2.6-cu124-20250430-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.04-sglang0.4.6.post1-pytorch2.6-cu124-20250430-serverless

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
{region-id} indicates the region where your ACS is activated, such as cn-beijing and cn-wulanchabu.
{image:tag} indicates the name and tag of the image.

Important

Currently, you can pull only images in the China (Beijing) region over a VPC.

Note

The 25.04-vllm0.8.5-pytorch2.6-cu124-20250430-serverless and 25.04-sglang0.4.6.post1-pytorch2.6-cu124-20250430-serverless images are applicable to ACS product form and multi-tenant product form of Lingjun, but not applicable to single-tenant product form of Lingjun.

Driver requirements

NVIDIA Driver release >= 550

Quick Start

The following example uses only Docker to pull the inference-nv-pytorch image and uses the Qwen2.5-7B-Instruct model to test inference services.

Note

To use the inference-nv-pytorch image in ACS, you must select the image from the artifact center page of the console where you create workloads, or specify the image in a YAML file. For more information, refer to the following topics:

Pull the inference container image.

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Download an open source model in the modelscope format.

pip install modelscope
cd /mnt
modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct

Run the following command to log on to the container.

docker run -d -t --network=host --privileged --init --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864  \
-v /mnt/:/mnt/ \
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Run an inference test to test the inference conversation feature of vLLM.

Start the Server service.

python3 -m vllm.entrypoints.openai.api_server \
--model /mnt/Qwen2.5-7B-Instruct \
--trust-remote-code --disable-custom-all-reduce \
--tensor-parallel-size 1

Test on the client.

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "/mnt/Qwen2.5-7B-Instruct",  
    "messages": [
    {"role": "system", "content": "You are a friendly AI assistant."},
    {"role": "user", "content": "Please introduce deep learning."}
    ]}'

For more information about how to work with vLLM, see vLLM.

Known issues

None