inference-nv-pytorch 26.05 - Container Compute Service - Alibaba Cloud Documentation Center

This topic describes the release notes for inference-nv-pytorch 26.05.

Main features and bug fixes

Main features

Starting with version 26.05, support for CUDA 12.8 is discontinued. Only images for CUDA 13.0 are available.
- The CUDA 13.0 image supports both amd64 and aarch64 architectures.
In the vLLM image, Torch is upgraded to 2.11.0, and vLLM is upgraded to v0.20.2.
In the SGLang image, Torch is upgraded to 2.11.0, and SGLang is upgraded to v0.5.11.

Bug fixes

None.

Image name	inference-nv-pytorch
Tag	26.05-vllm0.20.2-pytorch2.11-cu130-20260513-serverless		26.05-sglang0.5.11-pytorch2.11-cu130-20260513-serverless
Supported architecture	amd64	aarch64	amd64	aarch64
Use case	large model inference	large model inference	large model inference	large model inference
Framework	pytorch	pytorch	pytorch	pytorch
Requirements	NVIDIA driver release 580 or later	NVIDIA driver release 580 or later	NVIDIA driver release 580 or later	NVIDIA driver release 580 or later
System components	Ubuntu 24.04 Python 3.12 Torch 2.11.0 CUDA 13.0.2 NCCL 2.29.7 diffusers 0.38.0 flash_attn 2.8.3 flash_attn_3 3.0.0 flashinfer-python 0.6.8 imageio-ffmpeg 0.6.0 ray 2.55.1 transformers 5.8.1 triton 3.6.0 torchvision 0.26.0 vllm 0.20.2 xfuser 0.4.5 xgrammar 0.2.0 ljperf 0.1.0+d0e4a408	Ubuntu 24.04 Python 3.12 Torch 2.11.0 CUDA 13.0.2 NCCL 2.29.7 flash_attn 2.8.4 flashinfer-python 0.6.8 transformers 4.57.6 vllm 0.20.2 triton 3.6.0 torchvision 0.26.0 xgrammar 0.1.33 ljperf 0.1.0+477686c5 ray 2.55.1	Ubuntu 24.04 Python 3.12 Torch 2.11.0 CUDA 13.0.2 NCCL 2.29.7 diffusers 0.38.0 decord 0.6.0 flash_attn 2.8.3 flash_attn_3 3.0.0 flashinfer-python 0.6.8 imageio-ffmpeg 0.6.0 ray 2.55.1 transformers 5.6.0 sgl-kernel 0.4.2 sglang 0.5.11 xgrammar 0.1.32 triton 3.6.0 torchao 0.17.0 torchaudio 2.11.0 torchvision 0.26.0 xfuser 0.4.5 ljperf 0.1.0+d0e4a408	Ubuntu 24.04 Python 3.12 Torch 2.11.0 CUDA 13.0.2 NCCL 2.29.7 diffusers 0.38.0 decord2 3.3.0 flash_attn 2.8.4 flashinfer-python 0.6.8 imageio-ffmpeg 0.6.0 ray 2.55.1 transformers 5.6.0 sgl-kernel 0.4.2 sglang 0.5.11 xgrammar 0.1.32 triton 3.6.0 torchao 0.17.0 torchaudio 2.11.0 torchvision 0.26.0 xfuser 0.4.5 ljperf 0.1.0+477686c5

Asset

Public image

CUDA 13.0 asset

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:26.05-vllm0.20.2-pytorch2.11-cu130-20260513-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:26.05-sglang0.5.11-pytorch2.11-cu130-20260513-serverless

VPC image

To pull an ACS AI container image within a VPC, replace the public image URI egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag} with acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}.

{region-id}: The ID of an available ACS region, such as cn-beijing and cn-wulanchabu.
{image:tag}: The name and tag of the AI container image. Examples: inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless and training-nv-pytorch:25.10-serverless.

Note

These images are for ACS and EGS multi-tenant. Do not use them in EGS dedicated environments.

Driver requirements

CUDA 13.0: Requires NVIDIA driver release 580 or later

Quick start

This example shows how to pull the inference-nv-pytorch image using Docker and test the inference service with the Qwen2.5-7B-Instruct model.

Note

To use the inference-nv-pytorch image in ACS, select the image from the Artifacts Center on the Create Workload page in the console. You can also specify the image reference in a YAML file. For more information, see the following topics about building model inference services with ACS GPU resources:

Pull the inference container image.

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Download the open-source model from ModelScope.

pip install modelscope
cd /mnt
modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct

Run the following command to enter the container.

docker run -d -t --network=host --privileged --init --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864  \
-v /mnt/:/mnt/ \
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Run an inference test for the vLLM conversational feature.

Start the server service.

python3 -m vllm.entrypoints.openai.api_server \
--model /mnt/Qwen2.5-7B-Instruct \
--trust-remote-code --disable-custom-all-reduce \
--tensor-parallel-size 1

Test on the client.

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "/mnt/Qwen2.5-7B-Instruct",  
    "messages": [
    {"role": "system", "content": "You are a friendly AI assistant."},
    {"role": "user", "content": "Introduce deep learning."}
    ]}'

For more information about how to use vLLM, see vLLM.

Known issues

This image does not support the deepgpu-comfyui plugin.

Container Compute Service:inference-nv-pytorch 26.05