All Products
Search
Document Center

Container Compute Service:inference-nv-pytorch 25.07

Last Updated:Mar 26, 2026

This release upgrades vLLM, SGLang, and deepgpu-comfyui, and resolves a multi-node inference bug in DeepSeek-R1 deployments.

What's new

Framework upgrades

FrameworkVersion
vLLMv0.9.2
SGLangv0.4.9.post1
deepgpu-comfyuiv1.1.7

Bug fix

vLLM 0.9.2 encountered a PPMissingLayer error when running the DeepSeek-R1 model in a multi-node (dual-machine) configuration. This release pre-applies the fix from upstream PR #20665, so distributed inference on multi-node clusters works without manual patching.

Image specifications

This release provides two image variants, both targeting Large Language Model (LLM) inference on PyTorch with CUDA 12.8.

vLLM imageSGLang image
Image tag25.07-vllm0.9.2-pytorch2.7-cu128-20250714-serverless25.07-sglang0.4.9-pytorch2.7-cu128-20250710-serverless
Use caseLLM inferenceLLM inference
FrameworkPyTorchPyTorch
Driver requirementNVIDIA Driver ≥570NVIDIA Driver ≥570

System components — vLLM image

ComponentVersion
Ubuntu24.04
Python3.12
Torch2.7.1+cu128
CUDA12.8
NCCL2.27.5
accelerate1.8.1
diffusers0.34.0
deepgpu-comfyui1.1.7
deepgpu-torch0.0.24+torch2.7.0cu128
flash_attn2.8.1
imageio2.37.0
imageio-ffmpeg0.6.0
ray2.47.1
transformers4.53.1
vllm0.9.3.dev0+ga5dd03c1e.d20250709
xgrammar0.1.19
triton3.3.1

System components — SGLang image

ComponentVersion
Ubuntu24.04
Python3.12
Torch2.7.1+cu128
CUDA12.8
NCCL2.27.5
accelerate1.8.1
diffusers0.34.0
deepgpu-comfyui1.1.7
deepgpu-torch0.0.24+torch2.7.0cu128
flash_attn2.8.1
flash_mla1.0.0+9edee0c
flashinfer-python0.2.7.post1
imageio2.37.0
imageio-ffmpeg0.6.0
transformers4.53.0
sgl-kernel0.2.4
sglang0.4.9.post1
xgrammar0.1.20
triton3.3.1
torchao0.9.0

Image access

Public images

Pull either image directly from the public registry:

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.07-vllm0.9.2-pytorch2.7-cu128-20250714-serverless

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.07-sglang0.4.9-pytorch2.7-cu128-20250710-serverless

VPC images

For lower-latency pulls within a Virtual Private Cloud (VPC), use:

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace {region-id} with the region where your Alibaba Cloud Container Compute Service (ACS) is activated (for example, cn-beijing or cn-wulanchabu), and {image:tag} with the image name and tag.

Important

VPC image pulling is currently supported only in the China (Beijing) region.

Note

Both images are compatible with ACS clusters and Lingjun multi-tenant clusters. They are not supported on Lingjun single-tenant clusters.

Driver requirement

CUDA 12.8 images require NVIDIA Driver 570 or later.

Quick start

This example pulls the vLLM image, downloads the Qwen2.5-7B-Instruct model, and runs an inference test.

Note

For ACS deployments, select the image from the Artifact Center in the console or specify it in your YAML configuration. See the following guides for end-to-end deployment instructions:

  1. Pull the image.

    docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  2. Download the model from ModelScope.

    pip install modelscope
    cd /mnt
    modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct
  3. Launch the container.

    docker run -d -t --network=host --privileged --init --ipc=host \
      --ulimit memlock=-1 --ulimit stack=67108864 \
      -v /mnt/:/mnt/ \
      egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  4. Start the vLLM inference server.

    python3 -m vllm.entrypoints.openai.api_server \
      --model /mnt/Qwen2.5-7B-Instruct \
      --trust-remote-code --disable-custom-all-reduce \
      --tensor-parallel-size 1
  5. Send a test request from the client.

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "/mnt/Qwen2.5-7B-Instruct",
        "messages": [
          {"role": "system", "content": "You are a friendly AI assistant."},
          {"role": "user", "content": "Please introduce deep learning."}
        ]
      }'

    For more information, see the vLLM documentation.

Known issues

The deepgpu-comfyui plugin for Wanx model video generation supports only gn8is instance types.