All Products
Search
Document Center

Container Compute Service:inference-nv-pytorch 25.06

Last Updated:Mar 26, 2026

This release updates vLLM to v0.9.0.1 and SGLang to v0.4.7, and introduces the deepgpu-comfyui plug-in for accelerated ComfyUI inference on L20 GPUs. Use this page to confirm what changed, verify that your environment meets the driver requirements, and run the quick start to validate the image.

What's new

Main features

  • vLLM is updated to v0.9.0.1.

  • SGLang is updated to v0.4.7.

  • The deepgpu-comfyui plug-in is introduced to accelerate ComfyUI services on L20 GPUs for Wan2.1 and FLUX model inference. Performance improves by 8%–40% compared to PyTorch.

Bugs fixed

None.

System components

This release includes two image variants: one based on vLLM and one based on SGLang. The table below lists their image tags, target scenarios, and component versions.

vLLM image SGLang image
Image tag 25.06-vllm0.9.0.1-pytorch2.7-cu128-20250609-serverless 25.06-sglang0.4.7-pytorch2.7-cu128-20250611-serverless
Scenario LLM reasoning LLM inference
Framework PyTorch PyTorch
NVIDIA driver requirement >= 570 >= 550
Ubuntu 24.04 22.04
Python 3.12 3.10
Torch 2.7.1+cu128 2.7.1+cu128
CUDA 12.8 12.8
NCCL 2.27.3 2.27.3
accelerate 1.7.0 1.7.0
diffusers 0.33.1 0.33.1
deepgpu-comfyui 1.1.5 1.1.5
deepgpu-torch 0.0.21+torch2.7.0cu128 0.0.21+torch2.7.0cu128
flash_attn 2.7.4.post1 2.7.4.post1
flash_mla 1.0.0+9edee0c
flashinfer-python 0.2.6.post1
imageio 2.37.0 2.37.0
imageio-ffmpeg 0.6.0 0.6.0
ray 2.46.0 2.46.0
transformers 4.52.4 4.52.3
vllm 0.9.0.2.dev0+g5fbbfe9a4.d20250609
sgl-kernel 0.1.7
sglang 0.4.7
xgrammar 0.1.19 0.1.19
triton 3.3.1 3.3.1
torchao 0.9.0

Image access

Public images

Pull either image directly from the public registry:

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.06-vllm0.9.0.1-pytorch2.7-cu128-20250609-serverless

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.06-sglang0.4.7-pytorch2.7-cu128-20250611-serverless

VPC images

For VPC access, use the following address pattern:

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace {region-id} with the region where your ACS is activated, such as cn-beijing or cn-wulanchabu. Replace {image:tag} with the image name and tag.

Important

VPC image pulls are available only in the China (Beijing) region.

Important

Both images (25.06-vllm0.9.0.1-pytorch2.7-cu128-20250609-serverless and 25.06-sglang0.4.7-pytorch2.7-cu128-20250611-serverless) are compatible with ACS services and FLUX multi-tenant services, but are not compatible with FLUX single-tenant services.

Driver requirements

Both images require CUDA 12.8. The minimum NVIDIA driver version differs by image:

Image Minimum NVIDIA driver version
vLLM image 570
SGLang image 550

Quick start

The following example pulls the inference-nv-pytorch image with Docker and runs an inference test using the Qwen2.5-7B-Instruct model.

To use the inference-nv-pytorch image in ACS, select the image from the artifact center page when creating workloads, or specify it in a YAML file. For details, see:
Use ACS GPU compute power to deploy a model inference service from a DeepSeek distilled model
Use ACS GPU compute power to deploy a model inference service based on the DeepSeek full version
Use ACS GPU compute power to deploy a distributed model inference service based on the DeepSeek full version
  1. Pull the inference container image.

    docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  2. Download the Qwen2.5-7B-Instruct model from ModelScope.

    pip install modelscope
    cd /mnt
    modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct
  3. Start the container.

    docker run -d -t --network=host --privileged --init --ipc=host \
    --ulimit memlock=-1 --ulimit stack=67108864 \
    -v /mnt/:/mnt/ \
    egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  4. Run the inference test.

    1. Start the vLLM API server.

      python3 -m vllm.entrypoints.openai.api_server \
      --model /mnt/Qwen2.5-7B-Instruct \
      --trust-remote-code --disable-custom-all-reduce \
      --tensor-parallel-size 1
    2. Send a test request from the client.

      curl http://localhost:8000/v1/chat/completions \
          -H "Content-Type: application/json" \
          -d '{
          "model": "/mnt/Qwen2.5-7B-Instruct",
          "messages": [
          {"role": "system", "content": "You are a friendly AI assistant."},
          {"role": "user", "content": "Please introduce deep learning."}
          ]}'

      For more information about vLLM, see vLLM.

Known issues

  • The deepgpu-comfyui plug-in supports only GN8IS for accelerating video generation based on the Wanx model.