All Products
Search
Document Center

ENS:Deploy QwQ-32B reasoning model on edge cloud

Last Updated:Mar 01, 2026

This topic covers deployment configurations, performance benchmarks, and environment setup for running the QwQ-32B reasoning model on Alibaba Cloud Edge Node Service (ENS) with vLLM.

QwQ-32B overview

QwQ-32B is an open-source reasoning model built on Qwen2.5-32B. Through reinforcement learning, it achieves performance comparable to the full-version DeepSeek-R1 on core benchmarks (AIME 2024, AIME 2025, LiveCodeBench) and general metrics (IFEval, LiveBench), while using fewer parameters. This makes QwQ-32B a cost-effective choice for reasoning workloads.

Applicable scenarios

ScenarioAverage input length (tokens)Average output length (tokens)Typical applications
Mathematical reasoning0.5K-1.5K0.8-3.6KMATH problem-solving, LSAT logic analysis
Knowledge-based conversational search1K-4K0.2K-1KMMLU knowledge assessment, medical consultation
Multi-round conversation2K-8K0.5K-2KCustomer service, psychological consultation
Long document processing8K-16K1K-4KPaper summarization, legal document analysis
Code generation and debugging0.3K-2K1K-5KFunction implementation, debugging

Performance metrics reference

MetricDescription
Model precisionNumerical precision for model weights and computation. Lower precision reduces memory usage and cost but may reduce accuracy on complex tasks.
ConcurrencyNumber of simultaneous user requests. Higher concurrency increases capacity but also increases GPU memory and bandwidth usage.
Input lengthNumber of tokens in the user prompt. Longer inputs increase GPU memory consumption and TTFT.
Output lengthNumber of tokens in the model response. Excessively long outputs can cause truncation or out-of-memory (OOM) errors.
TTFTTime to first token. The delay between sending a request and receiving the first output token. Target: under 1 second. Maximum: 2 seconds.
TPOTTime per output token. The average generation time for each token after the first. Target: under 50 ms. Maximum: 100 ms.
Per-request throughputToken output rate per request (tokens/s). Target: 10-30 tokens/s.
GPU memory usagePercentage of GPU memory used at runtime, including model parameters, KV cache, and intermediate activations. Usage above 95% may cause OOM errors.

Deployment configurations and benchmarks

ENS provides heterogeneous computing resources across globally distributed edge nodes. Single-card GPU memory ranges from 12 GB to 48 GB. The following sections describe two tested configurations for deploying QwQ-32B at different model precisions.

FP16 precision on dual-card 48 GB GPU instance

This configuration uses a virtual machine (VM) instance with two 48 GB GPU cards.

Environment specifications

ParameterValue
CPU96 cores
Memory384 GB
GPUNVIDIA 48 GB x 2
Operating systemUbuntu 22.04
Docker version26.1.3
GPU driverDriver Version: 570.124.06, CUDA Version: 12.4
Inference frameworkvLLM 0.7.2

Performance benchmarks

ScenarioInput lengthOutput lengthConcurrencyPer-request throughput (tokens/s)TTFT (s)TPOT (ms)GPU memory usage
Mathematical reasoning and code generation1K4K414.50.667.495%
Mathematical reasoning and code generation1K4K813.31.671.395%
Knowledge-based conversational search4K1K214.21.868.695%
Knowledge-based conversational search4K1K4132.772.795%
Multi-round conversation and long document processing4K4K214.641.771.595%
Multi-round conversation and long document processing4K4K413.62.982.395%

Concurrency recommendations

  • Mathematical reasoning and code generation (short input, long output; input 0.3K-2K, output 0.8K-5K): Concurrency 4 provides approximately 15 tokens/s with TTFT under 1 second, balancing user experience and cost. Concurrency 8 increases TTFT but remains acceptable. Increase concurrency to reduce per-request cost.

  • Knowledge-based conversational search (long input, short output; input 1K-4K, output 0.2K-1K): Optimal concurrency is 2. At concurrency 4, TTFT exceeds 2 seconds but remains acceptable when factoring in network latency.

  • Multi-round conversation and long document processing (long input, long output; input 2K-16K, output 1K-4K): Longer inputs increase both memory consumption and TTFT. Optimal concurrency is 2. Adjust input length and concurrency based on your workload.

INT4 precision on five-card 12 GB GPU instance

This configuration uses a bare metal instance with five 12 GB GPU cards.

Environment specifications

ParameterValue
CPU24 cores x 2, 3.0 to 4.0 GHz
Memory256 GB
GPUNVIDIA 12 GB x 5
Operating systemUbuntu 20.04
Docker version28.0.1
GPU driverDriver Version: 570.124.06, CUDA Version: 12.4
Inference frameworkvLLM 0.7.2

Performance benchmarks

The five-card 12 GB configuration delivers strong per-request throughput but has limited TTFT performance due to smaller per-card GPU memory. Use this configuration for mathematical reasoning and code generation scenarios. For knowledge-based conversational search, multi-round conversations, and long document processing, use the dual-card 48 GB GPU instance instead.

ScenarioInput lengthOutput lengthConcurrencyPer-request throughput (tokens/s)TTFT (s)TPOT (ms)GPU memory usage
Mathematical reasoning and code generation1K4K2371.326.496.5%
Mathematical reasoning and code generation1K4K432.51.728.796.5%
Mathematical reasoning and code generation1K4K824.63.561.596.5%
Knowledge-based conversational search4K1K133.54.725.196.5%
Multi-round conversation and long document processing4K4K135.84.726.696.5%
Multi-round conversation and long document processing8K4K121.99.343.396.5%

Concurrency recommendations

  • Mathematical reasoning and code generation: Concurrency 2 delivers 37 tokens/s with a TTFT of 1.3 seconds, balancing user experience and cost. For better cost-effectiveness, use concurrency 4.

  • Knowledge-based conversational search, multi-round conversation, and long document processing: TTFT reaches approximately 5 seconds at concurrency 1. This is not suitable for production workloads but works for proof-of-concept (POC) environments.

Set up a test environment

This section walks through creating an ENS instance, initializing the disk, installing the inference environment, and running benchmarks. The instructions use the dual-card 48 GB GPU configuration.

Prerequisites

  • An Alibaba Cloud account with ENS access

  • Familiarity with Linux command-line operations

  • An SSH client for connecting to the instance

Step 1: Create an ENS instance

Create an instance through the ENS console or by calling the API.

Option A: Use the ENS console

  1. Log on to the ENS console.

  2. In the left-side navigation pane, choose Resources and Images > Instances.

  3. On the Instances page, click Create Instance. For more information, see Create an instance.

  4. Configure the instance with the following settings.

    StepParameterRecommended value
    Basic ConfigurationsBilling MethodSubscription
    Instance typex86 Computing
    Instance SpecificationNVIDIA 48GB * 2. For detailed specifications, contact the customer manager.
    ImageUbuntu, ubuntu_22_04_x64_20G_alibase_20240926
    Network and StorageNetworkSelf-built Network
    System DiskUltra Disk, 80 GB or more
    Data DiskUltra Disk, 1 TB or more
    System SettingsSet PasswordCustom Key or Key Pair
  5. Click Confirm Order in the lower-right corner. Complete the payment.

After payment, the page redirects to the ENS console. The instance is ready when its status changes to Running.

Option B: Call the RunInstances API

Call the RunInstances operation in OpenAPI Portal. The following JSON shows reference parameters.

{
  "InstanceType": "ens.gnxxxx",                    // Instance type
  "InstanceChargeType": "PrePaid",
  "ImageId": "ubuntu_22_04_x64_20G_alibase_20240926",
  "ScheduleAreaLevel": "Region",
  "EnsRegionId": "cn-your-ens-region",             // Edge node
  "Password": "<YOURPASSWORD>",                    // Password
  "InternetChargeType": "95BandwidthByMonth",
  "SystemDisk": {
    "Size": 80,
    "Category": "cloud_efficiency"
  },
  "DataDisk": [
    {
      "Category": "cloud_efficiency",
      "Size": 1024
    }
  ],
  "InternetMaxBandwidthOut": 5000,
  "Amount": 1,
  "NetWorkId": "n-xxxxxxxxxxxxxxx",
  "VSwitchId": "vsw-xxxxxxxxxxxxxxx",
  "InstanceName": "test",
  "HostName": "test",
  "PublicIpIdentification": true,
  "InstanceChargeStrategy": "instance"             // Billing based on instance
}

Step 2: Log on to the instance and initialize the disk

Log on to the instance

For more information, see Connect to an instance.

Expand the root partition

After creating or resizing an instance, expand the root partition online without restarting.

# Install the cloud environment toolkit.
sudo apt-get update
sudo apt-get install -y cloud-guest-utils

# Make sure the GPT partitioning tool sgdisk exists.
type sgdisk || sudo apt-get install -y gdisk

# Expand the physical partition.
sudo LC_ALL=en_US.UTF-8 growpart /dev/vda 3

# Resize the file system.
sudo resize2fs /dev/vda3

# Verify the result.
df -h
image

Mount the data disk

Format and mount the data disk. Adjust the commands based on your environment.

# Identify the new disk.
lsblk

# Format the disk without partitioning it.
sudo mkfs -t ext4 /dev/vdb

# Configure the mount.
sudo mkdir /data
echo "UUID=$(sudo blkid -s UUID -o value /dev/vdb) /data ext4 defaults,nofail 0 0" | sudo tee -a /etc/fstab

# Verify the mount.
sudo mount -a
df -hT /data

# Modify permissions.
sudo chown $USER:$USER /data
image
Note

To create an image from this instance, delete the ext4 defaults 0 0 row from /etc/fstab before creating the image. Otherwise, instances created from the image will fail to start.

Step 3: Install the vLLM inference environment

Install CUDA

For more information, see CUDA Toolkit 12.4 Downloads.

# Download the CUDA Toolkit.
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_570.124.06_linux.run
chmod +x cuda_12.4.0_570.124.06_linux.run

# Run the installer. This step takes several minutes and requires interactive input.
sudo sh cuda_12.4.0_570.124.06_linux.run

# Add environment variables to ~/.bashrc.
export PATH="$PATH:/usr/local/cuda-12.8/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64"
source ~/.bashrc

# Verify the installation.
nvcc -V
nvidia-smi
Note

The installer downloads CUDA 12.4, but the environment variable paths reference cuda-12.8. This reflects the source configuration. Adjust the paths if your installation directory differs.

Install auxiliary software (optional)

uv is a Python virtual environment and dependency management tool, useful for clusters that run multiple models. For more information, see Installing uv.

# Install uv. By default, uv installs to ~/.local/bin/.
curl -LsSf https://astral.sh/uv/install.sh | sh

# Add uv to PATH in ~/.bashrc.
export PATH="$PATH:~/.local/bin"
source ~/.bashrc

# Create a virtual environment.
uv venv myenv --python 3.12 --seed
source myenv/bin/activate

If the CUDA environment variables become invalid after activating the virtual environment (for example, nvcc or nvidia-smi cannot be found), add the CUDA paths to the activation script:

# Edit myenv/bin/activate and add the following lines after the existing export PATH statement:
export PATH="$PATH:/usr/local/cuda-12.8/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64"

Install the required packages:

# Install vLLM and ModelScope.
uv pip install vllm==0.7.2
uv pip install modelscope

# Install the GPU monitoring tool (optional). You can also use nvidia-smi.
uv pip install nvitop

Download the QwQ-32B model and benchmark script

# Download the model to /data to avoid running out of space on the system disk.
mkdir -p /data/Qwen/QwQ-32B
cd /data/Qwen/QwQ-32B
modelscope download --model Qwen/QwQ-32B --local_dir .

# Optional: download the dataset.
wget https://www.modelscope.cn/datasets/gliang1001/ShareGPT_V3_unfiltered_cleaned_split/resolve/master/ShareGPT_V3_unfiltered_cleaned_split.json

# Install git if needed.
apt update
apt install git -y

# Clone the vLLM repository for benchmark scripts.
git clone https://github.com/vllm-project/vllm.git

Step 4: Run the benchmark

Start the vLLM server

vllm serve /data/Qwen/QwQ-32B/ \
  --host 127.0.0.1 \
  --port 8080 \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --served-model-name qw \
  --gpu-memory-utilization 0.95 \
  --enforce-eager \
  --max-num-batched-tokens 8192 \
  --max-model-len 8192 \
  --enable-prefix-caching

Run the benchmark

python3 ./vllm/benchmarks/benchmark_serving.py \
  --backend vllm \
  --served-model-name qw \
  --model /data/Qwen/QwQ-32B \
  --dataset-name random \
  --random-input 1024 \
  --random-output 4096 \
  --random-range-ratio 1 \
  --max-concurrency 4 \
  --num-prompts 10 \
  --host 127.0.0.1 \
  --port 8080 \
  --save-result \
  --result-dir /data/logs/ \
  --result-filename QwQ-32B-4-1-4.log

Review the results

The following image shows example benchmark output.

image