All Products
Search
Document Center

Container Compute Service:inference-nv-pytorch 25.05

Last Updated:Mar 25, 2026

The inference-nv-pytorch 25.05 release upgrades the vLLM image stack to Ubuntu 24.04, Python 3.12, CUDA 12.8, and vLLM v0.8.5.post1, and upgrades SGLang to v0.4.6.post4 in the SGLang image.

What's new

Main features

  • vLLM image: Upgraded to Ubuntu 24.04, Python 3.12, CUDA 12.8, and vLLM v0.8.5.post1.

  • SGLang image: Upgraded SGLang to v0.4.6.post4.

Bug fixes

None.

Image contents

The following table lists the two images included in this release and their system components.

inference-nv-pytorch

inference-nv-pytorch

Tag

25.05-vllm0.8.5.post1-pytorch2.7-cu128-20250513-serverless

25.05-sglang0.4.6.post4-pytorch2.6-cu124-20250513-serverless

Scenario

LLM inference

LLM inference

Framework

PyTorch

PyTorch

Driver requirement

NVIDIA driver >= 570

NVIDIA driver >= 550

System components

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.7.0+cu128

  • CUDA 12.8

  • NCCL 2.26.5

  • transformer 4.51.3

  • vllm 0.8.5.post2.dev0+g3015d5634.d20250513.cu128

  • ray 2.46.0

  • triton 3.3.0

  • xgrammar 0.1.18

  • Ubuntu 22.04

  • Python 3.10

  • Torch 2.6.0+cu124

  • CUDA 12.4

  • NCCL 2.26.5

  • accelerate 1.6.0

  • transformers 4.51.1

  • triton 3.2.0

  • xgrammar 0.1.19

  • flashinfer-python 0.2.5

  • sglang 0.4.6.post4

  • sgl-kernel 0.1.2.post1

Note

Both images are compatible with ACS services and Lingjun multi-tenant services, but are not compatible with Lingjun single-tenant services.

Assets

Public network images

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.05-vllm0.8.5.post1-pytorch2.7-cu128-20250513-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.05-sglang0.4.6.post4-pytorch2.6-cu124-20250513-serverless

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace the placeholders with actual values:

PlaceholderDescriptionExample
{region-id}The region where your ACS is activatedcn-beijing, cn-wulanchabu
{image:tag}The name and tag of the imageinference-nv-pytorch:25.05-vllm0.8.5.post1-pytorch2.7-cu128-20250513-serverless
Important

Currently, you can pull VPC images only in the China (Beijing) region.

Driver requirements

  • CUDA 12.8 images: NVIDIA driver release >= 570

  • CUDA 12.4 images: NVIDIA driver release >= 550

Quick start

The following example pulls the inference-nv-pytorch image with Docker and runs a test inference using the Qwen2.5-7B-Instruct model.

  1. Pull the inference container image.

    docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  2. Download the Qwen2.5-7B-Instruct model from ModelScope.

    pip install modelscope
    cd /mnt
    modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct
  3. Start the container.

    docker run -d -t --network=host --privileged --init --ipc=host \
    --ulimit memlock=-1 --ulimit stack=67108864 \
    -v /mnt/:/mnt/ \
    egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  4. Start the vLLM server and test inference.

    1. Start the server.

      python3 -m vllm.entrypoints.openai.api_server \
      --model /mnt/Qwen2.5-7B-Instruct \
      --trust-remote-code --disable-custom-all-reduce \
      --tensor-parallel-size 1
    2. Send a test request from the client.

      curl http://localhost:8000/v1/chat/completions \
          -H "Content-Type: application/json" \
          -d '{
          "model": "/mnt/Qwen2.5-7B-Instruct",
          "messages": [
          {"role": "system", "content": "You are a friendly AI assistant."},
          {"role": "user", "content": "Please introduce deep learning."}
          ]}'

      For more information about working with vLLM, see vLLM.

Known issues

None.