All Products
Search
Document Center

Container Compute Service:inference-nv-pytorch 25.04

Last Updated:Mar 25, 2026

This release upgrades vLLM to v0.8.5 and SGLang to v0.4.6.post1, both with PyTorch 2.6.0 and Qwen3 model support.

What's new

Updated

  • vLLM upgraded to v0.8.5

  • SGLang image PyTorch version upgraded to 2.6.0

  • SGLang upgraded to v0.4.6.post1

Added

  • Qwen3 model support for both vLLM and SGLang images

Bug fixes

None

Image details

Both images target LLM inference on PyTorch with CUDA 12.4 and require NVIDIA Driver release >= 550.

vLLM imageSGLang image
Tag25.04-vllm0.8.5-pytorch2.6-cu124-20250430-serverless25.04-sglang0.4.6.post1-pytorch2.6-cu124-20250430-serverless
ScenarioLLM inferenceLLM inference
Framepytorchpytorch
Driver requirementNVIDIA Driver release >= 550NVIDIA Driver release >= 550
Ubuntu22.0422.04
Python3.103.10
Torch2.6.0+cu1242.6.0+cu124
CUDA12.412.4
ACCL-N2.23.4.122.23.4.12
accelerate1.6.0
diffusers0.33.1
flash_attn2.7.4.post1
flashinfer-python0.2.3
transformer / transformers4.51.34.51.1
vllm0.8.5
sglang0.4.6.post1
sgl-kernel0.1.0
ray2.43.0
triton3.2.03.2.0
xgrammar0.1.180.1.17

Image registry

Public network

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.04-vllm0.8.5-pytorch2.6-cu124-20250430-serverless

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.04-sglang0.4.6.post1-pytorch2.6-cu124-20250430-serverless

VPC

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace {region-id} with the region where your ACS is activated, such as cn-beijing or cn-wulanchabu. Replace {image:tag} with the image name and tag.

Important

VPC image pulls are currently supported only in the China (Beijing) region.

Note

The 25.04-vllm0.8.5-pytorch2.6-cu124-20250430-serverless and 25.04-sglang0.4.6.post1-pytorch2.6-cu124-20250430-serverless images are compatible with the ACS product form and the multi-tenant product form of Lingjun. They are not compatible with the single-tenant product form of Lingjun.

Quick start

The following example pulls the inference-nv-pytorch image and runs an inference test using the Qwen2.5-7B-Instruct model with vLLM.

  1. Pull the inference container image.

    docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  2. Download an open-source model in ModelScope format.

    pip install modelscope
    cd /mnt
    modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct
  3. Start the container.

    docker run -d -t --network=host --privileged --init --ipc=host \
    --ulimit memlock=-1 --ulimit stack=67108864 \
    -v /mnt/:/mnt/ \
    egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  4. Start the vLLM API server inside the container.

    python3 -m vllm.entrypoints.openai.api_server \
    --model /mnt/Qwen2.5-7B-Instruct \
    --trust-remote-code --disable-custom-all-reduce \
    --tensor-parallel-size 1
  5. Send a test request from the client.

    curl http://localhost:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
        "model": "/mnt/Qwen2.5-7B-Instruct",
        "messages": [
        {"role": "system", "content": "You are a friendly AI assistant."},
        {"role": "user", "content": "Please introduce deep learning."}
        ]}'

    For more information about working with vLLM, see the vLLM documentation.

Known issues

None