This release upgrades vLLM to v0.8.5 and SGLang to v0.4.6.post1, both with PyTorch 2.6.0 and Qwen3 model support.
What's new
Updated
vLLM upgraded to v0.8.5
SGLang image PyTorch version upgraded to 2.6.0
SGLang upgraded to v0.4.6.post1
Added
Qwen3 model support for both vLLM and SGLang images
Bug fixes
None
Image details
Both images target LLM inference on PyTorch with CUDA 12.4 and require NVIDIA Driver release >= 550.
| vLLM image | SGLang image | |
|---|---|---|
| Tag | 25.04-vllm0.8.5-pytorch2.6-cu124-20250430-serverless | 25.04-sglang0.4.6.post1-pytorch2.6-cu124-20250430-serverless |
| Scenario | LLM inference | LLM inference |
| Frame | pytorch | pytorch |
| Driver requirement | NVIDIA Driver release >= 550 | NVIDIA Driver release >= 550 |
| Ubuntu | 22.04 | 22.04 |
| Python | 3.10 | 3.10 |
| Torch | 2.6.0+cu124 | 2.6.0+cu124 |
| CUDA | 12.4 | 12.4 |
| ACCL-N | 2.23.4.12 | 2.23.4.12 |
| accelerate | 1.6.0 | — |
| diffusers | 0.33.1 | — |
| flash_attn | 2.7.4.post1 | — |
| flashinfer-python | — | 0.2.3 |
| transformer / transformers | 4.51.3 | 4.51.1 |
| vllm | 0.8.5 | — |
| sglang | — | 0.4.6.post1 |
| sgl-kernel | — | 0.1.0 |
| ray | 2.43.0 | — |
| triton | 3.2.0 | 3.2.0 |
| xgrammar | 0.1.18 | 0.1.17 |
Image registry
Public network
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.04-vllm0.8.5-pytorch2.6-cu124-20250430-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.04-sglang0.4.6.post1-pytorch2.6-cu124-20250430-serverlessVPC
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}Replace {region-id} with the region where your ACS is activated, such as cn-beijing or cn-wulanchabu. Replace {image:tag} with the image name and tag.
VPC image pulls are currently supported only in the China (Beijing) region.
The 25.04-vllm0.8.5-pytorch2.6-cu124-20250430-serverless and 25.04-sglang0.4.6.post1-pytorch2.6-cu124-20250430-serverless images are compatible with the ACS product form and the multi-tenant product form of Lingjun. They are not compatible with the single-tenant product form of Lingjun.
Quick start
The following example pulls the inference-nv-pytorch image and runs an inference test using the Qwen2.5-7B-Instruct model with vLLM.
To use the inference-nv-pytorch image in ACS, select the image from the artifact center page of the console when you create workloads, or specify the image in a YAML file. For details, see:
Pull the inference container image.
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]Download an open-source model in ModelScope format.
pip install modelscope cd /mnt modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-InstructStart the container.
docker run -d -t --network=host --privileged --init --ipc=host \ --ulimit memlock=-1 --ulimit stack=67108864 \ -v /mnt/:/mnt/ \ egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]Start the vLLM API server inside the container.
python3 -m vllm.entrypoints.openai.api_server \ --model /mnt/Qwen2.5-7B-Instruct \ --trust-remote-code --disable-custom-all-reduce \ --tensor-parallel-size 1Send a test request from the client.
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/mnt/Qwen2.5-7B-Instruct", "messages": [ {"role": "system", "content": "You are a friendly AI assistant."}, {"role": "user", "content": "Please introduce deep learning."} ]}'For more information about working with vLLM, see the vLLM documentation.
Known issues
None