All Products
Search
Document Center

Container Compute Service:inference-nv-pytorch 25.05

Last Updated:May 16, 2025

This topic describes the release notes for inference-nv-pytorch 25.05.

Main features and bug fixes

Main features

  • The operating system of the vLLM image has been upgraded to Ubuntu 24.04, the Python version has been upgraded to 3.12, the CUDA version has been upgraded to 12.8, and the vLLM version has been upgraded to v0.8.5.post1.

  • The SGLang version in the SGLang image has been upgraded to v0.4.6.post4.

Bug fixes

None

Content

inference-nv-pytorch

inference-nv-pytorch

Tag

25.05-vllm0.8.5.post1-pytorch2.7-cu128-20250513-serverless

25.05-sglang0.4.6.post4-pytorch2.6-cu124-20250513-serverless

Scenarios

LLM inference

LLM inference

Framework

PyTorch

PyTorch

Requirements

NVIDIA driver release >= 570

NVIDIA driver release >= 550

System components

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.7.0+cu128

  • CUDA 12.8

  • NCCL 2.26.5

  • transformer 4.51.3

  • vllm 0.8.5.post2.dev0+g3015d5634.d20250513.cu128

  • ray 2.46.0

  • triton 3.3.0

  • xgrammar 0.1.18

  • Ubuntu 22.04

  • Python 3.10

  • Torch 2.6.0+cu124

  • CUDA 12.4

  • NCCL 2.26.5

  • accelerate 1.6.0

  • transformers 4.51.1

  • triton 3.2.0

  • xgrammar 0.1.19

  • flashinfer-python 0.2.5

  • sglang 0.4.6.post4

  • sgl-kernel 0.1.2.post1

Assets

Public network images

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.05-vllm0.8.5.post1-pytorch2.7-cu128-20250513-serverless

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.05-sglang0.4.6.post4-pytorch2.6-cu124-20250513-serverless

VPC image

  • acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

    {region-id} indicates the region where your ACS is activated, such as cn-beijing and cn-wulanchabu.
    {image:tag} indicates the name and tag of the image.
Important

Currently, you can pull only images in the China (Beijing) region over a VPC.

Note

The 25.05-vllm0.8.5.post1-pytorch2.7-cu128-20250513-serverless and 25.05-sglang0.4.6.post4-pytorch2.6-cu124-20250513-serverless images are applicable to ACS services and Lingjun multi-tenant services, but are not applicable to Lingjun single-tenant services.

Driver requirements

For CUDA 12.8 images: NVIDIA driver release >= 570

For CUDA 12.4 images: NVIDIA driver release >= 550

Quick Start

The following example uses only Docker to pull the inference-nv-pytorch image and uses the Qwen2.5-7B-Instruct model to test inference services.

Note

To use the inference-nv-pytorch image in ACS, you must select the image from the artifact center page of the console where you create workloads, or specify the image in a YAML file. For more information, refer to the following topics:

  1. Pull the inference container image.

    docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  2. Download an open source model in the modelscope format.

    pip install modelscope
    cd /mnt
    modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct
  3. Run the following command to log on to the container.

    docker run -d -t --network=host --privileged --init --ipc=host \
    --ulimit memlock=-1 --ulimit stack=67108864  \
    -v /mnt/:/mnt/ \
    egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  4. Run an inference test to test the inference conversation feature of vLLM.

    1. Start the Server service.

      python3 -m vllm.entrypoints.openai.api_server \
      --model /mnt/Qwen2.5-7B-Instruct \
      --trust-remote-code --disable-custom-all-reduce \
      --tensor-parallel-size 1
    2. Test on the client.

      curl http://localhost:8000/v1/chat/completions \
          -H "Content-Type: application/json" \
          -d '{
          "model": "/mnt/Qwen2.5-7B-Instruct",  
          "messages": [
          {"role": "system", "content": "You are a friendly AI assistant."},
          {"role": "user", "content": "Please introduce deep learning."}
          ]}'

      For more information about how to work with vLLM, see vLLM.

Known issues

None