This release updates vLLM to v0.8.2 and PyTorch to 2.6.0 in the vLLM image, and updates SGLang to v0.4.4.post1.
What's new
-
PyTorch in the vLLM image is updated to 2.6.0.
-
vLLM is updated to v0.8.2.
-
SGLang is updated to v0.4.4.post1.
-
ACCL-N is updated to 2.23.4.12, with new features and bug fixes.
Bug fixes
None.
Image content
| inference-nv-pytorch (vLLM variant) | inference-nv-pytorch (SGLang variant) | |
|---|---|---|
| Tag | 25.03-vllm0.8.2-pytorch2.6-cu124-20250327-serverless |
25.03-sglang0.4.4.post1-pytorch2.5-cu124-20250327-serverless |
| Scenarios | LLM inference | LLM inference |
| Framework | PyTorch | PyTorch |
| Requirements | NVIDIA driver release >= 550 | NVIDIA driver release >= 550 |
| System components | Ubuntu 22.04, Python 3.10, Torch 2.6.0, CUDA 12.4, ACCL-N 2.23.4.12, accelerate 1.5.2, diffusers 0.32.2, flash_attn 2.7.4.post1, transformer 4.50.1, vllm 0.8.2, ray 2.44.0, triton 3.2.0 | Ubuntu 22.04, Python 3.10, Torch 2.5.1, CUDA 12.4, ACCL-N 2.23.4.12, accelerate 1.5.2, diffusers 0.32.2, flash_attn 2.7.4.post1, transformer 4.48.3, vllm 0.7.2, ray 2.44.0, triton 3.2.0, flashinfer-python 0.2.3, sglang 0.4.4.post1, sgl-kernel 0.0.5 |
Assets
Public images
-
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.03-vllm0.8.2-pytorch2.6-cu124-20250328-serverless -
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.03-sglang0.4.4.post1-pytorch2.5-cu124-20250327-serverless
VPC images
Pull VPC images using the following pattern:
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
Replace {region-id} with the region where your Apsara Container Service (ACS) is activated (for example, cn-beijing or cn-wulanchabu), and replace {image:tag} with the name and tag of the image.
Currently, you can pull VPC images only in the China (Beijing) region.
Theinference-nv-pytorch:25.03-vllm0.8.2-pytorch2.6-cu124-20250328-serverlessandinference-nv-pytorch:25.03-sglang0.4.4.post1-pytorch2.5-cu124-20250327-serverlessimages are compatible with ACS products and Lingjun multi-tenant products. They are not compatible with Lingjun single-tenant products.
Driver requirements
NVIDIA driver release >= 550.
Quick start
The following example pulls the inference-nv-pytorch image using Docker and runs an inference test with the Qwen2.5-7B-Instruct model.
To use the inference-nv-pytorch image in ACS, select the image from the artifact center page when creating workloads in the console, or specify it in a YAML file. For step-by-step guidance, see:
Use ACS GPU compute power to deploy a model inference service from a DeepSeek distilled model
Use ACS GPU compute power to deploy a model inference service based on the DeepSeek full version
Use ACS GPU compute power to deploy a distributed model inference service based on the DeepSeek full version
-
Pull the inference container image.
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag] -
Download an open-source model in the ModelScope format.
pip install modelscope cd /mnt modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct -
Start a container using the pulled image.
docker run -d -t --network=host --privileged --init --ipc=host \ --ulimit memlock=-1 --ulimit stack=67108864 \ -v /mnt/:/mnt/ \ egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag] -
Run an inference test using vLLM.
-
Start the vLLM API server.
python3 -m vllm.entrypoints.openai.api_server \ --model /mnt/Qwen2.5-7B-Instruct \ --trust-remote-code --disable-custom-all-reduce \ --tensor-parallel-size 1 -
Send a request from the client.
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/mnt/Qwen2.5-7B-Instruct", "messages": [ {"role": "system", "content": "You are a friendly AI assistant."}, {"role": "user", "content": "Please introduce deep learning."} ]}'For more information about vLLM, see the vLLM documentation.
-
Known issues
None.