Inferensi-nv-pytorch 25.03 - Container Compute Service

Topik ini menjelaskan catatan rilis untuk Inferensi-nv-pytorch 25.03.

Daftar fitur utama dan perbaikan bug

Fitur utama

PyTorch dalam gambar vLLM diperbarui ke 2.6.0.
vLLM diperbarui ke v0.8.2.
SGLang diperbarui ke v0.4.4.post1.
ACCL-N diperbarui ke 2.23.4.12, dengan fitur baru dan perbaikan bug yang disediakan.

Perbaikan bug

Tidak ada.

Konten

	inferensi-nv-pytorch	inferensi-nv-pytorch
Tag	25.03-vllm0.8.2-pytorch2.6-cu124-20250327-serverless	25.03-sglang0.4.4.post1-pytorch2.5-cu124-20250327-serverless
Skenario	Inferensi LLM	Inferensi LLM
Kerangka kerja	PyTorch	PyTorch
Persyaratan	Rilis driver NVIDIA >= 550	Rilis driver NVIDIA >= 550
Komponen sistem	Ubuntu 22.04 Python 3.10 Torch 2.6.0 CUDA 12.4 ACCL-N 2.23.4.12 accelerate 1.5.2 diffusers 0.32.2 flash_attn 2.7.4.post1 transformer 4.50.1 vllm 0.8.2 ray 2.44.0 triton 3.2.0	Ubuntu 22.04 Python 3.10 Torch 2.5.1 CUDA 12.4 ACCL-N 2.23.4.12 accelerate 1.5.2 diffusers 0.32.2 flash_attn 2.7.4.post1 transformer 4.48.3 vllm 0.7.2 ray 2.44.0 triton 3.2.0 flashinfer-python 0.2.3 sglang 0.4.4.post1 sgl-kernel 0.0.5

Aset

Gambar publik

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inferensi-nv-pytorch:25.03-vllm0.8.2-pytorch2.6-cu124-20250328-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inferensi-nv-pytorch:25.03-sglang0.4.4.post1-pytorch2.5-cu124-20250327-serverless

Gambar VPC

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
{region-id} menunjukkan wilayah di mana ACS Anda diaktifkan, seperti cn-beijing dan cn-wulanchabu.
{image:tag} menunjukkan nama dan tag dari gambar.

Penting

Saat ini, Anda hanya dapat menarik gambar di wilayah China (Beijing) melalui VPC.

Catatan

Gambar inferensi-nv-pytorch:25.03-vllm0.8.2-pytorch2.6-cu124-20250328-serverless dan inferensi-nv-pytorch:25.03-sglang0.4.4.post1-pytorch2.5-cu124-20250327-serverless cocok untuk produk ACS dan produk multi-tenant Lingjun. Gambar tersebut tidak cocok untuk produk single-tenant Lingjun.

Persyaratan driver

Rilis driver NVIDIA >= 550

Memulai Cepat

Contoh berikut menggunakan Docker untuk menarik gambar inferensi-nv-pytorch dan menguji layanan inferensi menggunakan model Qwen2.5-7B-Instruct.

Catatan

Untuk menggunakan gambar inferensi-nv-pytorch di ACS, Anda harus memilih gambar dari halaman pusat artefak konsol tempat Anda membuat beban kerja, atau menentukan gambar dalam file YAML. Untuk informasi lebih lanjut, lihat topik berikut:

Tarik gambar kontainer inferensi.

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inferensi-nv-pytorch:[tag]

Unduh model sumber terbuka dalam format modelscope.

pip install modelscope
cd /mnt
modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct

Jalankan perintah berikut untuk masuk ke kontainer.

docker run -d -t --network=host --privileged --init --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864  \
-v /mnt/:/mnt/ \
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inferensi-nv-pytorch:[tag]

Jalankan tes inferensi untuk menguji fitur percakapan inferensi vLLM.

Mulai layanan Server.

python3 -m vllm.entrypoints.openai.api_server \
--model /mnt/Qwen2.5-7B-Instruct \
--trust-remote-code --disable-custom-all-reduce \
--tensor-parallel-size 1

Uji pada klien.

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "/mnt/Qwen2.5-7B-Instruct",  
    "messages": [
    {"role": "system", "content": "You are a friendly AI assistant."},
    {"role": "user", "content": "Please introduce deep learning."}
    ]}'

Untuk informasi lebih lanjut tentang cara bekerja dengan vLLM, lihat vLLM.

Masalah yang diketahui

Tidak ada.