All Products
Search
Document Center

Container Compute Service:inference-nv-pytorch 25.02

Last Updated:Mar 26, 2026

inference-nv-pytorch 25.02 updates vLLM to v0.7.2, adds SGLang v0.4.3.post2 support, and enables DeepSeek model inference.

What's new

  • vLLM updated to v0.7.2

  • SGLang v0.4.3.post2 supported

  • DeepSeek models supported — run DeepSeek model inference directly in the container.

Bug fixes

None.

System components

Requirements

Component Version
NVIDIA Driver >= 550
Ubuntu 22.04

Pre-installed packages

Package Version
Python 3.10
PyTorch 2.5.1
CUDA 12.4
transformers 4.48.3
triton 3.1.0
ray 2.42.1
vLLM 0.7.2
sgl-kernel 0.0.3.post6
SGLang 0.4.3.post2
flashinfer-python 0.2.1.post2
ACCL-N 2.23.4.11

Container images

Public image

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.02-vllm0.7.2-sglang0.4.3.post2-pytorch2.5-cuda12.4-20250305-serverless

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace {region-id} with the region where your ACS is activated (for example, cn-beijing), and {image:tag} with the image name and tag.

Important

VPC image pulls are currently only available in the China (Beijing) region.

Image compatibility

Two image variants are available. Choose based on your deployment target:

Image tag Compatible with
...20250305-serverless ACS products and Lingjun multi-tenant products
...20250305 (no -serverless suffix) Lingjun single-tenant products
Important

The -serverless image is not compatible with Lingjun single-tenant products. Use the image without the -serverless suffix for single-tenant deployments.

Quick start

The following steps use Docker to pull the inference-nv-pytorch image and run an inference test with the Qwen2.5-7B-Instruct model.

Note

To deploy this image in ACS, select the image from the artifact center page in the ACS console, or specify it in a YAML file — do not use docker pull directly. For ACS deployment guides, see What's next.

Prerequisites

  • Docker installed and running

  • NVIDIA Driver release >= 550

Run an inference test

  1. Pull the container image.

    docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

    Replace [tag] with the image tag for your target deployment (see Image compatibility).

  2. Download the Qwen2.5-7B-Instruct model from ModelScope.

    pip install modelscope
    cd /mnt
    modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct
  3. Start the container.

    docker run -d -t --network=host --privileged --init --ipc=host \
      --ulimit memlock=-1 --ulimit stack=67108864 \
      -v /mnt/:/mnt/ \
      egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  4. Start the vLLM server inside the container.

    python3 -m vllm.entrypoints.openai.api_server \
      --model /mnt/Qwen2.5-7B-Instruct \
      --trust-remote-code --disable-custom-all-reduce \
      --tensor-parallel-size 1
  5. Send a test request to the server.

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "/mnt/Qwen2.5-7B-Instruct",
        "messages": [
          {"role": "system", "content": "You are a friendly AI assistant."},
          {"role": "user", "content": "Please introduce deep learning."}
        ]
      }'

    For more information about vLLM, see the vLLM documentation.

Known issues

What's next

To deploy inference-nv-pytorch in ACS, see: