This release upgrades vLLM to v0.10.0 and SGLang to v0.4.10.post2. No bug fixes are included.
What's new
vLLM upgraded to v0.10.0
SGLang upgraded to v0.4.10.post2
Image variants
Two image variants are available: one built around vLLM and one around SGLang.
| Field | vLLM variant | SGLang variant |
|---|---|---|
| Tag | 25.08-vllm0.10.0-pytorch2.7-cu128-20250811-serverless | 25.08-sglang0.4.10.post2-pytorch2.7-cu128-20250808-serverless |
| Use case | Large model inference | Large model inference |
| Framework | PyTorch | PyTorch |
| Minimum driver | NVIDIA Driver >= 570 | NVIDIA Driver >= 570 |
System components
vLLM image
| Package type | Package | Version |
|---|---|---|
| OS | Ubuntu | 24.04 |
| Runtime | Python | 3.12 |
| Runtime | Torch | 2.7.1+cu128 |
| Runtime | CUDA | 12.8 |
| Library | NCCL | 2.27.5 |
| Library | flash_attn | 2.8.2 |
| Library | triton | 3.3.1 |
| Library | xformers | 0.0.31 |
| Library | xfuser | 0.4.4 |
| Library | xgrammar | 0.1.21 |
| Library | ray | 2.48.0 |
| Library | transformers | 4.55.0 |
| Library | diffusers | 0.34.0 |
| Library | imageio | 2.37.0 |
| Library | imageio-ffmpeg | 0.6.0 |
| Library | vllm | 0.10.0 |
| DeepGPU | deepgpu-torch | 0.0.24+torch2.7.0cu128 |
| DeepGPU | deepgpu-comfyui | 1.1.7 |
SGLang image
| Package type | Package | Version |
|---|---|---|
| OS | Ubuntu | 24.04 |
| Runtime | Python | 3.12 |
| Runtime | Torch | 2.7.1+cu128 |
| Runtime | CUDA | 12.8 |
| Library | NCCL | 2.27.5 |
| Library | flash_attn | 2.8.2 |
| Library | flash_mla | 1.0.0+41b611f |
| Library | flashinfer-python | 0.2.9rc2 |
| Library | triton | 3.3.1 |
| Library | xgrammar | 0.1.22 |
| Library | torchao | 0.9.0 |
| Library | transformers | 4.54.1 |
| Library | diffusers | 0.34.0 |
| Library | imageio | 2.37.0 |
| Library | imageio-ffmpeg | 0.6.0 |
| Library | sgl-kernel | 0.2.8 |
| Library | sglang | 0.4.10.post2 |
Image registry
Internet images
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.08-vllm0.10.0-pytorch2.7-cu128-20250811-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.08-sglang0.4.10.post2-pytorch2.7-cu128-20250808-serverlessVPC images
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}Replace the placeholders with your actual values:
| Placeholder | Description | Example |
|---|---|---|
{region-id} | Region where your ACS is activated | cn-beijing, cn-wulanchabu |
{image:tag} | Image name and tag | inference-nv-pytorch:25.08-vllm0.10.0-pytorch2.7-cu128-20250811-serverless |
VPC image pulls are currently supported only in the China (Beijing) region.
Both images are compatible with ACS products and Lingjun multi-tenant products. They are not compatible with Lingjun single-tenant products.
Driver requirements
NVIDIA Driver release >= 570
Quick start
The following example pulls the vLLM image and runs an inference test using the Qwen2.5-7B-Instruct model.
To use inference-nv-pytorch images in ACS, select the image on the Artifacts page when creating a workload in the console, or specify the image reference in a YAML file. For step-by-step guides, see:
Pull the container image.
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]Download the Qwen2.5-7B-Instruct model from ModelScope.
pip install modelscope cd /mnt modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-InstructStart the container.
docker run -d -t --network=host --privileged --init --ipc=host \ --ulimit memlock=-1 --ulimit stack=67108864 \ -v /mnt/:/mnt/ \ egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]Start the vLLM inference service inside the container.
python3 -m vllm.entrypoints.openai.api_server \ --model /mnt/Qwen2.5-7B-Instruct \ --trust-remote-code --disable-custom-all-reduce \ --tensor-parallel-size 1Send a test request from the client.
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/mnt/Qwen2.5-7B-Instruct", "messages": [ {"role": "system", "content": "You are a friendly AI assistant."}, {"role": "user", "content": "Tell me about deep learning."} ]}'For more information about vLLM, see the vLLM documentation.
Known issues
The
deepgpu-comfyuiplug-in for Wanx model video generation currently supports only the GN8IS and G49E GPU instance types.