All Products
Search
Document Center

Container Compute Service:inference-nv-pytorch 26.04

Last Updated:Apr 30, 2026

These are the release notes for inference-nv-pytorch 26.04.

Main features and bug fixes

Main features

  • This release includes images for two CUDA versions: CUDA 12.8 and CUDA 13.0.

    • The CUDA 12.8 images are for the amd64 architecture only.

    • The CUDA 13.0 images are for the amd64 and aarch64 architectures.

  • In the vLLM images, Torch is upgraded to 2.10.0 and vLLM is upgraded to v0.19.0.

  • In the SGLang images, Torch is upgraded to 2.10.0 and SGLang is upgraded to v0.5.10.post1.

Bug fixes

No bug fixes are included in this release.

Contents

Image name

inference-nv-pytorch

Tag

26.04-vllm0.19.0-pytorch2.10-cu128-20260421-serverless

26.04-sglang0.5.10.post1-pytorch2.10-cu128-20260421-serverless

26.04-vllm0.19.0-pytorch2.10-cu130-20260415-serverless

26.04-sglang0.5.10.post1-pytorch2.10-cu130-20260415-serverless

Supported architecture

amd64

amd64

amd64

aarch64

amd64

aarch64

Use case

large model inference

large model inference

large model inference

large model inference

large model inference

large model inference

Framework

pytorch

pytorch

pytorch

pytorch

pytorch

pytorch

Requirements

NVIDIA Driver release >= 570

NVIDIA Driver release >= 570

NVIDIA Driver release >= 580

NVIDIA Driver release >= 580

NVIDIA Driver release >= 580

NVIDIA Driver release >= 580

System components

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.10.0

  • CUDA 12.8

  • NCCL 2.29.7

  • diffusers 0.37.1

  • flash_attn 2.8.4

  • flash_attn_3 3.0.0

  • flashinfer-python 0.6.6

  • imageio-ffmpeg 0.6.0

  • ray 2.55.0

  • transformers 4.57.6

  • triton 3.6.0

  • torchaudio 2.10.0

  • torchvision 0.25.0

  • vllm 0.19.0

  • xfuser 0.4.5

  • xgrammar 0.1.33

  • ljperf 0.1.0+477686c5

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.10.0

  • CUDA 12.8

  • NCCL 2.29.7

  • torchaudio 2.10.0

  • torchvision 0.25.0

  • diffusers 0.37.1

  • decord 0.6.0

  • flash_attn 2.8.4

  • flash_attn_3 3.0.0

  • flashinfer-python 0.6.7

  • imageio-ffmpeg 0.6.0

  • ray 2.55.0

  • transformers 5.3.0

  • sgl-kernel 0.4.1

  • sglang 0.5.10.post1

  • xgrammar 0.1.32

  • triton 3.6.0

  • torchao 0.9.0

  • xfuser 0.4.5

  • ljperf 0.1.0+477686c5

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.10.0+cu130

  • CUDA 13.0.2

  • NCCL 2.29.7

  • diffusers 0.37.1

  • flash_attn 2.8.4

  • flash_attn_3 3.0.0

  • flashinfer-python 0.6.6

  • imageio-ffmpeg 0.6.0

  • Ray 2.54.1

  • transformers 4.57.6

  • triton 3.6.0

  • torchaudio 2.10.0+cu130

  • torchvision 0.25.0+cu130

  • vllm 0.19.0

  • xfuser 0.4.5

  • xgrammar 0.1.33

  • ljperf 0.1.0+d0e4a408

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.10.0+cu130

  • CUDA 13.0.2

  • NCCL 2.29.7

  • flash_attn 2.8.4

  • flashinfer-python 0.6.6

  • transformers 4.57.6

  • vllm 0.19.0

  • triton 3.6.0

  • torchaudio 2.10.0+cu130

  • torchvision 0.25.0+cu130

  • xgrammar 0.1.33

  • ljperf 0.1.0+477686c5

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.10.0+cu130

  • CUDA 13.0.2

  • NCCL 2.29.7

  • diffusers 0.37.1

  • decord 0.6.0

  • flash_attn 2.8.4

  • flashinfer-python 0.6.7

  • imageio-ffmpeg 0.6.0

  • ray 2.55.0

  • transformers 5.3.0

  • sgl-kernel 0.4.1

  • sglang 0.5.10.post1

  • xgrammar 0.1.32

  • triton 3.6.0

  • torchao 0.9.0

  • torchaudio 2.10.0+cu130

  • torchvision 0.25.0+cu130

  • xfuser 0.4.5

  • ljperf 0.1.0+d0e4a408

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.10.0+cu130

  • CUDA 13.0.2

  • NCCL 2.29.7

  • diffusers 0.37.1

  • decord2 3.3.0

  • flash_attn 2.8.4

  • flashinfer-python 0.6.7

  • imageio-ffmpeg 0.6.0

  • ray 2.55.0

  • transformers 5.3.0

  • sgl-kernel 0.4.1

  • sglang 0.5.10.post1

  • xgrammar 0.1.32

  • triton 3.6.0

  • torchao 0.9.0

  • torchaudio 2.10.0+cu130

  • torchvision 0.25.0+cu130

  • xfuser 0.4.5

  • ljperf 0.1.0+477686c5

Asset

Public image

CUDA12.8 Asset

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:26.04-vllm0.19.0-pytorch2.10-cu128-20260421-serverless

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:26.04-sglang0.5.10.post1-pytorch2.10-cu128-20260421-serverless

CUDA13.0 Asset

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:26.04-vllm0.19.0-pytorch2.10-cu130-20260415-serverless

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:26.04-sglang0.5.10.post1-pytorch2.10-cu130-20260415-serverless

VPC image

To quickly pull ACS AI container images from within a VPC, replace the asset URI egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag} with acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}.

  • {region-id}: The region ID of an ACS available region. For example: cn-beijing and cn-wulanchabu.

  • {image:tag}: The name and tag of the AI container image. For example: inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless and training-nv-pytorch:25.10-serverless.

Note

These images are for ACS and EGS multi-tenant. Do not use them in EGS dedicated environments.

Driver requirements

  • CUDA12.8: NVIDIA Driver release >= 570

  • CUDA13.0: NVIDIA Driver release >= 580

Quick start

This example shows how to pull the inference-nv-pytorch image using Docker and test the inference service with the Qwen2.5-7B-Instruct model.

Note

To use the inference-nv-pytorch image in ACS, select the image from the Artifacts Center on the Create Workload page in the console. You can also specify the image reference in a YAML file. For more information, see the following topics about building model inference services with ACS GPU resources:

  1. Pull the inference container image.

    docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  2. Download the open-source model from ModelScope.

    pip install modelscope
    cd /mnt
    modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct
  3. Run the following command to enter the container.

    docker run -d -t --network=host --privileged --init --ipc=host \
    --ulimit memlock=-1 --ulimit stack=67108864  \
    -v /mnt/:/mnt/ \
    egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  4. Run an inference test for the vLLM conversational feature.

    1. Start the server service.

      python3 -m vllm.entrypoints.openai.api_server \
      --model /mnt/Qwen2.5-7B-Instruct \
      --trust-remote-code --disable-custom-all-reduce \
      --tensor-parallel-size 1
    2. Test on the client.

      curl http://localhost:8000/v1/chat/completions \
          -H "Content-Type: application/json" \
          -d '{
          "model": "/mnt/Qwen2.5-7B-Instruct",  
          "messages": [
          {"role": "system", "content": "You are a friendly AI assistant."},
          {"role": "user", "content": "Introduce deep learning."}
          ]}'

      For more information about how to use vLLM, see vLLM.

Known issues

  • The current image does not support the deepgpu-comfyui plug-in.

  • Driver version 550.90.07 for the ACS GU8TF instance type supports images that use CUDA 13.0.