inference-nv-pytorch 25.12 - Container Compute Service

Topik ini menjelaskan catatan rilis untuk inference-nv-pytorch versi 25.12.

Fitur utama dan perbaikan bug

Fitur utama

Citra disediakan untuk dua versi CUDA: CUDA 12.8 dan CUDA 13.0.
- Citra CUDA 12.8 hanya mendukung arsitektur amd64.
- Citra CUDA 13.0 mendukung arsitektur amd64 dan aarch64.
Versi PyTorch ditingkatkan ke 2.9.0 untuk citra vLLM dan ke 2.9.1 untuk citra SGLang.
Untuk citra CUDA 12.8, deepgpu-comfyui ditingkatkan ke 1.3.2 dan komponen optimasi deepgpu-torch ditingkatkan ke 0.1.12+torch2.9.0cu128.
Untuk kedua citra CUDA 12.8 dan CUDA 13.0, versi vLLM ditingkatkan ke v0.12.0 dan versi SGLang ditingkatkan ke v0.5.6.post2.

Perbaikan bug

Tidak ada

Isi

Nama citra	inference-nv-pytorch
Tag	25.12-vllm0.12.0-pytorch2.9-cu128-20251215-serverless	25.12-sglang0.5.6.post2-pytorch2.9-cu128-20251215-serverless	25.12-vllm0.12.0-pytorch2.9-cu130-20251215-serverless		25.12-sglang0.5.6.post2-pytorch2.9-cu130-20251215-serverless
Arsitektur yang didukung	amd64	amd64	amd64	aarch64	amd64	aarch64
Skenario aplikasi	Large model inference	Large model inference	Large model inference	Large model inference	Large model inference	Large model inference
Framework	pytorch	pytorch	pytorch	pytorch	pytorch	pytorch
Persyaratan	NVIDIA Driver release >= 570	NVIDIA Driver release >= 570	NVIDIA Driver release >= 580	NVIDIA Driver release >= 580	NVIDIA Driver release >= 580	NVIDIA Driver release >= 580
Komponen sistem	Ubuntu 24.04 Python 3.12 Torch 2.9.0+cu128 CUDA 12.8 diffusers 0.36.0 deepgpu-comfyui 1.3.2 deepgpu-torch 0.1.12+torch2.9.0cu128 flash_attn 2.8.3 flashinfer-python 0.5.3 imageio 2.37.2 imageio-ffmpeg 0.6.0 ray 2.52.1 transformers 4.57.3 triton 3.5.0 torchaudio 2.9.0+cu128 torchvision 0.24.0+cu128 vllm 0.12.0 xfuser 0.4.5 xgrammar 0.1.27 ljperf 0.1.0+477686c5	Ubuntu 24.04 Python 3.12 Torch 2.9.1+cu128 CUDA 12.8 diffusers 0.36.0 decord 0.6.0 decord2 2.0.0 deepgpu-comfyui 1.3.2 deepgpu-torch 0.1.12+torch2.9.0cu128 flash_attn 2.8.3 flash_mla 1.0.0+1408756 flashinfer-python 0.5.3 imageio 2.37.2 imageio-ffmpeg 0.6.0 ray 2.52.1 transformers 4.57.1 sgl-kernel 0.3.19 sglang 0.5.6.post2 xgrammar 0.1.27 triton 3.5.1 torchao 0.9.0 torchaudio 2.9.1 torchvision 0.24.1 xfuser 0.4.5 ljperf 0.1.0+477686c5	Ubuntu 24.04 Python 3.12 Torch 2.9.0+cu130 CUDA 13.0.2 diffusers 0.36.0 flash_attn 2.8.3 flashinfer-python 0.5.3 imageio 2.37.2 imageio-ffmpeg 0.6.0 ray 2.52.1 transformers 4.57.3 triton 3.5.0 torchaudio 2.9.0+cu130 torchvision 0.24.0+cu130 vllm 0.12.0 xfuser 0.4.5 xgrammar 0.1.27 ljperf 0.1.0+d0e4a408	Ubuntu 24.04 Python 3.12 Torch 2.9.0+cu130 CUDA 13.0.2 diffusers 0.36.0 flash_attn 2.8.3 flashinfer-python 0.5.3 transformers 4.57.1 ray 2.53.0 vllm 0.12.0 triton 3.5.0 torchaudio 2.9.0 torchvision 0.24.0 xfuser 0.4.5 xgrammar 0.1.27	Ubuntu 24.04 Python 3.12 Torch 2.9.1+cu130 CUDA 13.0.2 diffusers 0.36.0 decord 0.6.0 decord2 2.0.0 flash_attn 2.8.3 flashinfer-python 0.5.3 imageio 2.37.2 imageio-ffmpeg 0.6.0 ray 2.52.1 transformers 4.57.3 sgl-kernel 0.3.19 sglang 0.5.6.post2 xgrammar 0.1.27 triton 3.5.1 torchao 0.9.0 torchaudio 2.9.1 torchvision 0.24.1+cu130 xfuser 0.4.5 ljperf 0.1.0+d0e4a408	Ubuntu 24.04 Python 3.12 Torch 2.9.1+cu130 CUDA 13.0.2 diffusers 0.36.0 decord2 2.0.0 flash_attn 2.8.3 flashinfer-python 0.5.3 imageio 2.37.2 imageio-ffmpeg 0.6.0 transformers 4.57.1 sgl-kernel 0.3.19 sglang 0.5.6.post2 xgrammar 0.1.27 triton 3.5.1 torchao 0.9.0 torchaudio 2.9.1 torchvision 0.24.1 xfuser 0.4.5

Aset

Citra publik

Aset CUDA 12.8

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-vllm0.12.0-pytorch2.9-cu128-20251215-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-sglang0.5.6.post2-pytorch2.9-cu128-20251215-serverless

Aset CUDA 13.0

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-vllm0.12.0-pytorch2.9-cu130-20251215-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-sglang0.5.6.post2-pytorch2.9-cu130-20251215-serverless

Citra VPC

Untuk mempercepat pengambilan citra kontainer AI ACS dari dalam VPC, ganti URI aset egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag} dengan acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}.

{region-id}: ID wilayah wilayah yang tersedia tempat Produk ACS Anda berada. Contohnya, cn-beijing dan cn-wulanchabu.
{image:tag}: Nama dan tag citra kontainer AI. Contohnya, inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless dan training-nv-pytorch:25.10-serverless.

Catatan

Citra-citra ini cocok untuk produk ACS dan produk Lingjun multi-tenant. Citra ini tidak cocok untuk produk Lingjun single-tenant. Jangan gunakan citra ini dalam skenario Lingjun single-tenant.

Persyaratan driver

CUDA 12.8: NVIDIA Driver release >= 570
CUDA 13.0: NVIDIA Driver release >= 580

Mulai cepat

Contoh berikut menunjukkan cara menarik citra inference-nv-pytorch menggunakan Docker dan menguji layanan inferensi dengan model Qwen2.5-7B-Instruct.

Catatan

Untuk menggunakan citra inference-nv-pytorch di ACS, pilih citra tersebut dari halaman Artifacts saat membuat workload di Konsol, atau tentukan referensi citra dalam file YAML. Untuk informasi selengkapnya, lihat topik-topik berikut tentang membangun layanan inferensi model dengan daya komputasi GPU ACS:

Tarik citra kontainer inferensi.

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Unduh model open source dari ModelScope.

pip install modelscope
cd /mnt
modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct

Jalankan perintah berikut untuk memulai dan masuk ke kontainer.

docker run -d -t --network=host --privileged --init --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864  \
-v /mnt/:/mnt/ \
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Jalankan pengujian inferensi untuk memeriksa fitur inferensi percakapan vLLM.

Jalankan server.

python3 -m vllm.entrypoints.openai.api_server \
--model /mnt/Qwen2.5-7B-Instruct \
--trust-remote-code --disable-custom-all-reduce \
--tensor-parallel-size 1

Uji dari client.

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "/mnt/Qwen2.5-7B-Instruct",  
    "messages": [
    {"role": "system", "content": "You are a friendly AI assistant."},
    {"role": "user", "content": "Tell me about deep learning."}
    ]}'

Untuk informasi selengkapnya tentang cara menggunakan vLLM, lihat vLLM.

Isu yang diketahui

Plugin deepgpu-comfyui, yang mempercepat pembuatan video untuk model Wanx, saat ini hanya mendukung tipe instans GN8IS, G49E, dan G59.