inference-nv-pytorch 25.02 updates vLLM to v0.7.2, adds SGLang v0.4.3.post2 support, and enables DeepSeek model inference.
What's new
-
vLLM updated to v0.7.2
-
SGLang v0.4.3.post2 supported
-
DeepSeek models supported — run DeepSeek model inference directly in the container.
Bug fixes
None.
System components
Requirements
| Component | Version |
|---|---|
| NVIDIA Driver | >= 550 |
| Ubuntu | 22.04 |
Pre-installed packages
| Package | Version |
|---|---|
| Python | 3.10 |
| PyTorch | 2.5.1 |
| CUDA | 12.4 |
| transformers | 4.48.3 |
| triton | 3.1.0 |
| ray | 2.42.1 |
| vLLM | 0.7.2 |
| sgl-kernel | 0.0.3.post6 |
| SGLang | 0.4.3.post2 |
| flashinfer-python | 0.2.1.post2 |
| ACCL-N | 2.23.4.11 |
Container images
Public image
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.02-vllm0.7.2-sglang0.4.3.post2-pytorch2.5-cuda12.4-20250305-serverless
VPC image
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
Replace {region-id} with the region where your ACS is activated (for example, cn-beijing), and {image:tag} with the image name and tag.
VPC image pulls are currently only available in the China (Beijing) region.
Image compatibility
Two image variants are available. Choose based on your deployment target:
| Image tag | Compatible with |
|---|---|
...20250305-serverless |
ACS products and Lingjun multi-tenant products |
...20250305 (no -serverless suffix) |
Lingjun single-tenant products |
The -serverless image is not compatible with Lingjun single-tenant products. Use the image without the -serverless suffix for single-tenant deployments.
Quick start
The following steps use Docker to pull the inference-nv-pytorch image and run an inference test with the Qwen2.5-7B-Instruct model.
To deploy this image in ACS, select the image from the artifact center page in the ACS console, or specify it in a YAML file — do not use docker pull directly. For ACS deployment guides, see What's next.
Prerequisites
-
Docker installed and running
-
NVIDIA Driver release >= 550
Run an inference test
-
Pull the container image.
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]Replace
[tag]with the image tag for your target deployment (see Image compatibility). -
Download the Qwen2.5-7B-Instruct model from ModelScope.
pip install modelscope cd /mnt modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct -
Start the container.
docker run -d -t --network=host --privileged --init --ipc=host \ --ulimit memlock=-1 --ulimit stack=67108864 \ -v /mnt/:/mnt/ \ egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag] -
Start the vLLM server inside the container.
python3 -m vllm.entrypoints.openai.api_server \ --model /mnt/Qwen2.5-7B-Instruct \ --trust-remote-code --disable-custom-all-reduce \ --tensor-parallel-size 1 -
Send a test request to the server.
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/mnt/Qwen2.5-7B-Instruct", "messages": [ {"role": "system", "content": "You are a friendly AI assistant."}, {"role": "user", "content": "Please introduce deep learning."} ] }'For more information about vLLM, see the vLLM documentation.
Known issues
-
Illegal memory access for MoE on H20 (#13693): Update vLLM to resolve this issue.
What's next
To deploy inference-nv-pytorch in ACS, see: