All Products
Search
Document Center

Container Compute Service:Inference-nv-pytorch 25.03

Last Updated:Mar 26, 2026

This release updates vLLM to v0.8.2 and PyTorch to 2.6.0 in the vLLM image, and updates SGLang to v0.4.4.post1.

What's new

  • PyTorch in the vLLM image is updated to 2.6.0.

  • vLLM is updated to v0.8.2.

  • SGLang is updated to v0.4.4.post1.

  • ACCL-N is updated to 2.23.4.12, with new features and bug fixes.

Bug fixes

None.

Image content

inference-nv-pytorch (vLLM variant) inference-nv-pytorch (SGLang variant)
Tag 25.03-vllm0.8.2-pytorch2.6-cu124-20250327-serverless 25.03-sglang0.4.4.post1-pytorch2.5-cu124-20250327-serverless
Scenarios LLM inference LLM inference
Framework PyTorch PyTorch
Requirements NVIDIA driver release >= 550 NVIDIA driver release >= 550
System components Ubuntu 22.04, Python 3.10, Torch 2.6.0, CUDA 12.4, ACCL-N 2.23.4.12, accelerate 1.5.2, diffusers 0.32.2, flash_attn 2.7.4.post1, transformer 4.50.1, vllm 0.8.2, ray 2.44.0, triton 3.2.0 Ubuntu 22.04, Python 3.10, Torch 2.5.1, CUDA 12.4, ACCL-N 2.23.4.12, accelerate 1.5.2, diffusers 0.32.2, flash_attn 2.7.4.post1, transformer 4.48.3, vllm 0.7.2, ray 2.44.0, triton 3.2.0, flashinfer-python 0.2.3, sglang 0.4.4.post1, sgl-kernel 0.0.5

Assets

Public images

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.03-vllm0.8.2-pytorch2.6-cu124-20250328-serverless

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.03-sglang0.4.4.post1-pytorch2.5-cu124-20250327-serverless

VPC images

Pull VPC images using the following pattern:

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace {region-id} with the region where your Apsara Container Service (ACS) is activated (for example, cn-beijing or cn-wulanchabu), and replace {image:tag} with the name and tag of the image.

Important

Currently, you can pull VPC images only in the China (Beijing) region.

The inference-nv-pytorch:25.03-vllm0.8.2-pytorch2.6-cu124-20250328-serverless and inference-nv-pytorch:25.03-sglang0.4.4.post1-pytorch2.5-cu124-20250327-serverless images are compatible with ACS products and Lingjun multi-tenant products. They are not compatible with Lingjun single-tenant products.

Driver requirements

NVIDIA driver release >= 550.

Quick start

The following example pulls the inference-nv-pytorch image using Docker and runs an inference test with the Qwen2.5-7B-Instruct model.

To use the inference-nv-pytorch image in ACS, select the image from the artifact center page when creating workloads in the console, or specify it in a YAML file. For step-by-step guidance, see:
Use ACS GPU compute power to deploy a model inference service from a DeepSeek distilled model
Use ACS GPU compute power to deploy a model inference service based on the DeepSeek full version
Use ACS GPU compute power to deploy a distributed model inference service based on the DeepSeek full version
  1. Pull the inference container image.

    docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  2. Download an open-source model in the ModelScope format.

    pip install modelscope
    cd /mnt
    modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct
  3. Start a container using the pulled image.

    docker run -d -t --network=host --privileged --init --ipc=host \
    --ulimit memlock=-1 --ulimit stack=67108864  \
    -v /mnt/:/mnt/ \
    egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  4. Run an inference test using vLLM.

    1. Start the vLLM API server.

      python3 -m vllm.entrypoints.openai.api_server \
      --model /mnt/Qwen2.5-7B-Instruct \
      --trust-remote-code --disable-custom-all-reduce \
      --tensor-parallel-size 1
    2. Send a request from the client.

      curl http://localhost:8000/v1/chat/completions \
          -H "Content-Type: application/json" \
          -d '{
          "model": "/mnt/Qwen2.5-7B-Instruct",
          "messages": [
          {"role": "system", "content": "You are a friendly AI assistant."},
          {"role": "user", "content": "Please introduce deep learning."}
          ]}'

      For more information about vLLM, see the vLLM documentation.

Known issues

None.