This topic covers deployment configurations, performance benchmarks, and environment setup for running the QwQ-32B reasoning model on Alibaba Cloud Edge Node Service (ENS) with vLLM.
QwQ-32B overview
QwQ-32B is an open-source reasoning model built on Qwen2.5-32B. Through reinforcement learning, it achieves performance comparable to the full-version DeepSeek-R1 on core benchmarks (AIME 2024, AIME 2025, LiveCodeBench) and general metrics (IFEval, LiveBench), while using fewer parameters. This makes QwQ-32B a cost-effective choice for reasoning workloads.
Applicable scenarios
| Scenario | Average input length (tokens) | Average output length (tokens) | Typical applications |
|---|---|---|---|
| Mathematical reasoning | 0.5K-1.5K | 0.8-3.6K | MATH problem-solving, LSAT logic analysis |
| Knowledge-based conversational search | 1K-4K | 0.2K-1K | MMLU knowledge assessment, medical consultation |
| Multi-round conversation | 2K-8K | 0.5K-2K | Customer service, psychological consultation |
| Long document processing | 8K-16K | 1K-4K | Paper summarization, legal document analysis |
| Code generation and debugging | 0.3K-2K | 1K-5K | Function implementation, debugging |
Performance metrics reference
| Metric | Description |
|---|---|
| Model precision | Numerical precision for model weights and computation. Lower precision reduces memory usage and cost but may reduce accuracy on complex tasks. |
| Concurrency | Number of simultaneous user requests. Higher concurrency increases capacity but also increases GPU memory and bandwidth usage. |
| Input length | Number of tokens in the user prompt. Longer inputs increase GPU memory consumption and TTFT. |
| Output length | Number of tokens in the model response. Excessively long outputs can cause truncation or out-of-memory (OOM) errors. |
| TTFT | Time to first token. The delay between sending a request and receiving the first output token. Target: under 1 second. Maximum: 2 seconds. |
| TPOT | Time per output token. The average generation time for each token after the first. Target: under 50 ms. Maximum: 100 ms. |
| Per-request throughput | Token output rate per request (tokens/s). Target: 10-30 tokens/s. |
| GPU memory usage | Percentage of GPU memory used at runtime, including model parameters, KV cache, and intermediate activations. Usage above 95% may cause OOM errors. |
Deployment configurations and benchmarks
ENS provides heterogeneous computing resources across globally distributed edge nodes. Single-card GPU memory ranges from 12 GB to 48 GB. The following sections describe two tested configurations for deploying QwQ-32B at different model precisions.
FP16 precision on dual-card 48 GB GPU instance
This configuration uses a virtual machine (VM) instance with two 48 GB GPU cards.
Environment specifications
| Parameter | Value |
|---|---|
| CPU | 96 cores |
| Memory | 384 GB |
| GPU | NVIDIA 48 GB x 2 |
| Operating system | Ubuntu 22.04 |
| Docker version | 26.1.3 |
| GPU driver | Driver Version: 570.124.06, CUDA Version: 12.4 |
| Inference framework | vLLM 0.7.2 |
Performance benchmarks
| Scenario | Input length | Output length | Concurrency | Per-request throughput (tokens/s) | TTFT (s) | TPOT (ms) | GPU memory usage |
|---|---|---|---|---|---|---|---|
| Mathematical reasoning and code generation | 1K | 4K | 4 | 14.5 | 0.6 | 67.4 | 95% |
| Mathematical reasoning and code generation | 1K | 4K | 8 | 13.3 | 1.6 | 71.3 | 95% |
| Knowledge-based conversational search | 4K | 1K | 2 | 14.2 | 1.8 | 68.6 | 95% |
| Knowledge-based conversational search | 4K | 1K | 4 | 13 | 2.7 | 72.7 | 95% |
| Multi-round conversation and long document processing | 4K | 4K | 2 | 14.64 | 1.7 | 71.5 | 95% |
| Multi-round conversation and long document processing | 4K | 4K | 4 | 13.6 | 2.9 | 82.3 | 95% |
Concurrency recommendations
Mathematical reasoning and code generation (short input, long output; input 0.3K-2K, output 0.8K-5K): Concurrency 4 provides approximately 15 tokens/s with TTFT under 1 second, balancing user experience and cost. Concurrency 8 increases TTFT but remains acceptable. Increase concurrency to reduce per-request cost.
Knowledge-based conversational search (long input, short output; input 1K-4K, output 0.2K-1K): Optimal concurrency is 2. At concurrency 4, TTFT exceeds 2 seconds but remains acceptable when factoring in network latency.
Multi-round conversation and long document processing (long input, long output; input 2K-16K, output 1K-4K): Longer inputs increase both memory consumption and TTFT. Optimal concurrency is 2. Adjust input length and concurrency based on your workload.
INT4 precision on five-card 12 GB GPU instance
This configuration uses a bare metal instance with five 12 GB GPU cards.
Environment specifications
| Parameter | Value |
|---|---|
| CPU | 24 cores x 2, 3.0 to 4.0 GHz |
| Memory | 256 GB |
| GPU | NVIDIA 12 GB x 5 |
| Operating system | Ubuntu 20.04 |
| Docker version | 28.0.1 |
| GPU driver | Driver Version: 570.124.06, CUDA Version: 12.4 |
| Inference framework | vLLM 0.7.2 |
Performance benchmarks
The five-card 12 GB configuration delivers strong per-request throughput but has limited TTFT performance due to smaller per-card GPU memory. Use this configuration for mathematical reasoning and code generation scenarios. For knowledge-based conversational search, multi-round conversations, and long document processing, use the dual-card 48 GB GPU instance instead.
| Scenario | Input length | Output length | Concurrency | Per-request throughput (tokens/s) | TTFT (s) | TPOT (ms) | GPU memory usage |
|---|---|---|---|---|---|---|---|
| Mathematical reasoning and code generation | 1K | 4K | 2 | 37 | 1.3 | 26.4 | 96.5% |
| Mathematical reasoning and code generation | 1K | 4K | 4 | 32.5 | 1.7 | 28.7 | 96.5% |
| Mathematical reasoning and code generation | 1K | 4K | 8 | 24.6 | 3.5 | 61.5 | 96.5% |
| Knowledge-based conversational search | 4K | 1K | 1 | 33.5 | 4.7 | 25.1 | 96.5% |
| Multi-round conversation and long document processing | 4K | 4K | 1 | 35.8 | 4.7 | 26.6 | 96.5% |
| Multi-round conversation and long document processing | 8K | 4K | 1 | 21.9 | 9.3 | 43.3 | 96.5% |
Concurrency recommendations
Mathematical reasoning and code generation: Concurrency 2 delivers 37 tokens/s with a TTFT of 1.3 seconds, balancing user experience and cost. For better cost-effectiveness, use concurrency 4.
Knowledge-based conversational search, multi-round conversation, and long document processing: TTFT reaches approximately 5 seconds at concurrency 1. This is not suitable for production workloads but works for proof-of-concept (POC) environments.
Set up a test environment
This section walks through creating an ENS instance, initializing the disk, installing the inference environment, and running benchmarks. The instructions use the dual-card 48 GB GPU configuration.
Prerequisites
An Alibaba Cloud account with ENS access
Familiarity with Linux command-line operations
An SSH client for connecting to the instance
Step 1: Create an ENS instance
Create an instance through the ENS console or by calling the API.
Option A: Use the ENS console
Log on to the ENS console.
In the left-side navigation pane, choose Resources and Images > Instances.
On the Instances page, click Create Instance. For more information, see Create an instance.
Configure the instance with the following settings.
Step Parameter Recommended value Basic Configurations Billing Method Subscription Instance type x86 Computing Instance Specification NVIDIA 48GB * 2. For detailed specifications, contact the customer manager. Image Ubuntu, ubuntu_22_04_x64_20G_alibase_20240926 Network and Storage Network Self-built Network System Disk Ultra Disk, 80 GB or more Data Disk Ultra Disk, 1 TB or more System Settings Set Password Custom Key or Key Pair Click Confirm Order in the lower-right corner. Complete the payment.
After payment, the page redirects to the ENS console. The instance is ready when its status changes to Running.
Option B: Call the RunInstances API
Call the RunInstances operation in OpenAPI Portal. The following JSON shows reference parameters.
{
"InstanceType": "ens.gnxxxx", // Instance type
"InstanceChargeType": "PrePaid",
"ImageId": "ubuntu_22_04_x64_20G_alibase_20240926",
"ScheduleAreaLevel": "Region",
"EnsRegionId": "cn-your-ens-region", // Edge node
"Password": "<YOURPASSWORD>", // Password
"InternetChargeType": "95BandwidthByMonth",
"SystemDisk": {
"Size": 80,
"Category": "cloud_efficiency"
},
"DataDisk": [
{
"Category": "cloud_efficiency",
"Size": 1024
}
],
"InternetMaxBandwidthOut": 5000,
"Amount": 1,
"NetWorkId": "n-xxxxxxxxxxxxxxx",
"VSwitchId": "vsw-xxxxxxxxxxxxxxx",
"InstanceName": "test",
"HostName": "test",
"PublicIpIdentification": true,
"InstanceChargeStrategy": "instance" // Billing based on instance
}Step 2: Log on to the instance and initialize the disk
Log on to the instance
For more information, see Connect to an instance.
Expand the root partition
After creating or resizing an instance, expand the root partition online without restarting.
# Install the cloud environment toolkit.
sudo apt-get update
sudo apt-get install -y cloud-guest-utils
# Make sure the GPT partitioning tool sgdisk exists.
type sgdisk || sudo apt-get install -y gdisk
# Expand the physical partition.
sudo LC_ALL=en_US.UTF-8 growpart /dev/vda 3
# Resize the file system.
sudo resize2fs /dev/vda3
# Verify the result.
df -h
Mount the data disk
Format and mount the data disk. Adjust the commands based on your environment.
# Identify the new disk.
lsblk
# Format the disk without partitioning it.
sudo mkfs -t ext4 /dev/vdb
# Configure the mount.
sudo mkdir /data
echo "UUID=$(sudo blkid -s UUID -o value /dev/vdb) /data ext4 defaults,nofail 0 0" | sudo tee -a /etc/fstab
# Verify the mount.
sudo mount -a
df -hT /data
# Modify permissions.
sudo chown $USER:$USER /data
To create an image from this instance, delete the ext4 defaults 0 0 row from /etc/fstab before creating the image. Otherwise, instances created from the image will fail to start.
Step 3: Install the vLLM inference environment
Install CUDA
For more information, see CUDA Toolkit 12.4 Downloads.
# Download the CUDA Toolkit.
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_570.124.06_linux.run
chmod +x cuda_12.4.0_570.124.06_linux.run
# Run the installer. This step takes several minutes and requires interactive input.
sudo sh cuda_12.4.0_570.124.06_linux.run
# Add environment variables to ~/.bashrc.
export PATH="$PATH:/usr/local/cuda-12.8/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64"
source ~/.bashrc
# Verify the installation.
nvcc -V
nvidia-smiThe installer downloads CUDA 12.4, but the environment variable paths reference cuda-12.8. This reflects the source configuration. Adjust the paths if your installation directory differs.
Install auxiliary software (optional)
uv is a Python virtual environment and dependency management tool, useful for clusters that run multiple models. For more information, see Installing uv.
# Install uv. By default, uv installs to ~/.local/bin/.
curl -LsSf https://astral.sh/uv/install.sh | sh
# Add uv to PATH in ~/.bashrc.
export PATH="$PATH:~/.local/bin"
source ~/.bashrc
# Create a virtual environment.
uv venv myenv --python 3.12 --seed
source myenv/bin/activateIf the CUDA environment variables become invalid after activating the virtual environment (for example, nvcc or nvidia-smi cannot be found), add the CUDA paths to the activation script:
# Edit myenv/bin/activate and add the following lines after the existing export PATH statement:
export PATH="$PATH:/usr/local/cuda-12.8/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64"Install the required packages:
# Install vLLM and ModelScope.
uv pip install vllm==0.7.2
uv pip install modelscope
# Install the GPU monitoring tool (optional). You can also use nvidia-smi.
uv pip install nvitopDownload the QwQ-32B model and benchmark script
# Download the model to /data to avoid running out of space on the system disk.
mkdir -p /data/Qwen/QwQ-32B
cd /data/Qwen/QwQ-32B
modelscope download --model Qwen/QwQ-32B --local_dir .
# Optional: download the dataset.
wget https://www.modelscope.cn/datasets/gliang1001/ShareGPT_V3_unfiltered_cleaned_split/resolve/master/ShareGPT_V3_unfiltered_cleaned_split.json
# Install git if needed.
apt update
apt install git -y
# Clone the vLLM repository for benchmark scripts.
git clone https://github.com/vllm-project/vllm.gitStep 4: Run the benchmark
Start the vLLM server
vllm serve /data/Qwen/QwQ-32B/ \
--host 127.0.0.1 \
--port 8080 \
--tensor-parallel-size 2 \
--trust-remote-code \
--served-model-name qw \
--gpu-memory-utilization 0.95 \
--enforce-eager \
--max-num-batched-tokens 8192 \
--max-model-len 8192 \
--enable-prefix-cachingRun the benchmark
python3 ./vllm/benchmarks/benchmark_serving.py \
--backend vllm \
--served-model-name qw \
--model /data/Qwen/QwQ-32B \
--dataset-name random \
--random-input 1024 \
--random-output 4096 \
--random-range-ratio 1 \
--max-concurrency 4 \
--num-prompts 10 \
--host 127.0.0.1 \
--port 8080 \
--save-result \
--result-dir /data/logs/ \
--result-filename QwQ-32B-4-1-4.logReview the results
The following image shows example benchmark output.
