inference-nv-pytorch 25.05 GPU Image Overview (CUDA 12.8, vLLM) - Container Compute Service

The inference-nv-pytorch 25.05 release upgrades the vLLM image stack to Ubuntu 24.04, Python 3.12, CUDA 12.8, and vLLM v0.8.5.post1, and upgrades SGLang to v0.4.6.post4 in the SGLang image.

What's new

Main features

vLLM image: Upgraded to Ubuntu 24.04, Python 3.12, CUDA 12.8, and vLLM v0.8.5.post1.
SGLang image: Upgraded SGLang to v0.4.6.post4.

Bug fixes

None.

Image contents

The following table lists the two images included in this release and their system components.

	inference-nv-pytorch	inference-nv-pytorch
Tag	`25.05-vllm0.8.5.post1-pytorch2.7-cu128-20250513-serverless`	`25.05-sglang0.4.6.post4-pytorch2.6-cu124-20250513-serverless`
Scenario	LLM inference	LLM inference
Framework	PyTorch	PyTorch
Driver requirement	NVIDIA driver >= 570	NVIDIA driver >= 550
System components	Ubuntu 24.04 Python 3.12 Torch 2.7.0+cu128 CUDA 12.8 NCCL 2.26.5 transformer 4.51.3 vllm 0.8.5.post2.dev0+g3015d5634.d20250513.cu128 ray 2.46.0 triton 3.3.0 xgrammar 0.1.18	Ubuntu 22.04 Python 3.10 Torch 2.6.0+cu124 CUDA 12.4 NCCL 2.26.5 accelerate 1.6.0 transformers 4.51.1 triton 3.2.0 xgrammar 0.1.19 flashinfer-python 0.2.5 sglang 0.4.6.post4 sgl-kernel 0.1.2.post1

Note

Both images are compatible with ACS services and Lingjun multi-tenant services, but are not compatible with Lingjun single-tenant services.

Assets

Public network images

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.05-vllm0.8.5.post1-pytorch2.7-cu128-20250513-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.05-sglang0.4.6.post4-pytorch2.6-cu124-20250513-serverless

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace the placeholders with actual values:

Placeholder	Description	Example
`{region-id}`	The region where your ACS is activated	`cn-beijing`, `cn-wulanchabu`
`{image:tag}`	The name and tag of the image	`inference-nv-pytorch:25.05-vllm0.8.5.post1-pytorch2.7-cu128-20250513-serverless`

Important

Currently, you can pull VPC images only in the China (Beijing) region.

Driver requirements

CUDA 12.8 images: NVIDIA driver release >= 570
CUDA 12.4 images: NVIDIA driver release >= 550

Quick start

The following example pulls the inference-nv-pytorch image with Docker and runs a test inference using the Qwen2.5-7B-Instruct model.

Note

To use the inference-nv-pytorch image in ACS, select the image from the artifact center page in the console when creating workloads, or specify it in a YAML file. For more information, see:

Pull the inference container image.

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Download the Qwen2.5-7B-Instruct model from ModelScope.

pip install modelscope
cd /mnt
modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct

Start the container.

docker run -d -t --network=host --privileged --init --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v /mnt/:/mnt/ \
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Start the vLLM server and test inference.

Start the server.

python3 -m vllm.entrypoints.openai.api_server \
--model /mnt/Qwen2.5-7B-Instruct \
--trust-remote-code --disable-custom-all-reduce \
--tensor-parallel-size 1

Send a test request from the client.

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "/mnt/Qwen2.5-7B-Instruct",
    "messages": [
    {"role": "system", "content": "You are a friendly AI assistant."},
    {"role": "user", "content": "Please introduce deep learning."}
    ]}'

For more information about working with vLLM, see vLLM.

Known issues

None.