This topic describes the features, key metrics, best practices for edge cloud deployment, and test environment setup of the QwQ-32B model. This topic provides a comprehensive guide to the QwQ-32B model to help you quickly understand the model features, deployment requirements, and performance optimization methods. This way, you can efficiently deploy and use the model in an edge cloud environment, which improves inference efficiency and reduce costs.
About QwQ-32B
Description
QwQ-32B is an open-source inference model trained based on the Qwen2.5-32B model. The model greatly improves the model inference capability through enhanced learning. Core metrics such as model mathematical code (AIME 2024 and 2025, LiveCodeBench) and some general metrics (IFEval, LiveBench) have reached the level of the full version DeepSeek-R1, realizing a paradigm breakthrough of high performance with fewer parameters in inference tasks, providing a more cost-effective choice for high-cost large model deployment.
Scenarios
The QwQ-32B model is suitable for scenarios such as mathematical logic inference, large document processing, and code generation. It also performs well in scenarios such as knowledge-based conversational search and multi-round conversation in Chinese. The following table describes the common inference scenarios.
Scenario | Average input length (tokens) | Average output length (tokens) | Typical application cases |
Mathematical logic inference | 0.5K-1.5K | 0.8-3.6K | MATH problem-solving, LSAT logic problem analysis |
Knowledge-based conversational search | 1K-4K | 0.2K-1K | MMLU knowledge assessment, medical consultation |
Multi-round conversation system | 2K-8K | 0.5K-2K | Customer service conversation, psychological consultation |
Large document processing | 8K-16K | 1K-4K | Paper abstract, legal document analysis |
Code generation and debugging | 0.3K-2K | 1K-5K | Function implementation, debugging |
Key metrics for model inference
Metric | Description |
Model precision | The numerical precision used in model weights and calculation. The low-precision version of the model occupies less memory and costs less resources, but reduces the accuracy on complex tasks. |
Concurrency | The number of user requests processed at the same time. Higher concurrency indicates greater business capacity. However, increased concurrency also leads to higher GPU memory and GPU memory bandwidth usage. |
Input length | The number of tokens of prompts provided by uses, which affects the GPU memory usage. A large input length affects the time to first token (TTFT). |
Output length | The number of tokens in the response text generated by the model, which affects the GPU memory usage. A large output length leads to truncation or out-of-memory (OOM) error. |
TTFT | The time between when the user request is initiated and when the first output token is received, which affects the user experience. We recommend that you control the TTFT to be less than 1s and not more than 2s. |
Time per output token (TPOT) | The average time required to generate each output token (excluding the first token), which reflects the match between the generation speed and the reading experience. We recommend that you control the TPOT to be less than 50 ms and not more than 100 ms. |
Single-channel throughput | The token output rate per channel (tokens/s). A low single-channel throughput leads to a poor customer experience. We recommend that you control the value between 10 tokens/s and 30 tokens/s. |
GPU memory usage | The GPU memory usage during runtime in percentage. The GPU memory usage consists of model parameters, KV cache, and intermediate activation values. A high GPU memory usage, such as > 95%, may cause an OOM error, which affects service availability. |
Best practices for deploying QwQ-32B in the edge cloud
The edge cloud provides heterogeneous computing resources of multiple specifications on globally distributed edge nodes to meet the heterogeneous computing requirements in different scenarios. The GPU memory of a single GPU card ranges from 12 GB to 48 GB. The following tables describe recommended configurations and inference performance for deploying inference services with different QwQ-32B model precisions in the edge cloud:
Recommended dual-card instance with 48 GB GPU memory for QwQ-32B FP16 precision
The dual-card instance with 48 GB GPU memory is a VM. The following table describes the configurations.
Environment parameters
CPU
96 cores
Memory
384 GB
GPU
NVIDIA 48 GB * 2
Operating system
Ubuntu 22.04
Docker version
26.1.3
GPU Driver
Driver Version: 570.124.06
CUDA Version: 12.4
Inference framework
vllm 0.7.2
Performance in different scenarios
Scenario
Input length
Output length
Concurrency
Single-channel throughput (tokens)
TTFT (s)
TPOT (ms)
GPU memory usage
Mathematical logic inference and code generation
1K
4K
4
14.5
0.6
67.4
95%
1K
4K
8
13.3
1.6
71.3
95%
Knowledge-based conversational search
4K
1K
2
14.2
1.8
68.6
95%
4K
1K
4
13
2.7
72.7
95%
Multi-round conversation and long document processing
4K
4K
2
14.64
1.7
71.5
95%
4K
4K
4
13.6
2.9
82.3
95%
Mathematical logic inference and code generation:
This scenario features short input and long output. The input length ranges from 0.3 KB to 2 KB, and the output length ranges from 0.8 KB to 5 KB.
When the concurrency is 4, the single-channel throughput is close to 15 tokens/s, and the TTFT is less than 1s. This balances user experience and cost-effectiveness. When the concurrency is 8, a larger TTFT has a slight impact on the user experience, but is still acceptable. If you want to reduce the cost, you can increase the concurrency.
Knowledge-based conversational search:
This scenario features long input and short output. The input length ranges from 1 KB to 4 KB, and the output length ranges from 0.2 KB to 1 KB.
The optimal concurrency for an instance is 2. When the concurrency increases to 4, the TTFT is greater than 2s. Considering the network latency, the impact on user experience is still acceptable.
Multi-round conversation and long document processing:
This scenario features long input and long output. The input length ranges from 2 KB to 16 KB and the output length ranges from 1 KB to 4 KB.
An increase in the input length not only increases the memory consumption, but also significantly affects TTFT. The optimal concurrency for an instance is 2. You can adjust the input length and concurrency based on your business situation.
Recommended five-card instance with 12 GB GPU memory for QwQ-32B INT4 precision
The five-card instance with 12 GB GPU memory is a bare metal instance. The following table describes the configurations.
Environment parameters
CPU
24 cores × 2, 3.0 to 4.0 GHz
Memory
256GB
GPU
NVIDIA 12GB * 5
Operating system
Ubuntu 20.04
Docker version
28.0.1
GPU Driver
Driver Version: 570.124.06
CUDA Version: 12.4
Inference framework
vllm 0.7.2
Performance in different scenarios
The single-channel throughput of a five-card instance with 12 GB GPU memory can meet the performance requirements of both single- and multi-channel concurrency. However, the TTFT performance is not satisfactory due to the size of single-card GPU memory. We recommend that you deploy mathematical logic inference and code generation services by using this configuration. For scenarios such as knowledge-based conversational search, multi-round conversations, and long document processing, which has a large input length, we recommend that you use a dual-card instance with 48 GB GPU memory.
Scenario
Input length
Output length
Concurrency
Single-channel throughput (tokens)
TTFT (s)
TPOT (ms)
GPU memory usage
Mathematical logic inference and code generation
1K
4K
2
37
1.3
26.4
96.5%
1K
4K
4
32.5
1.7
28.7
96.5%
1K
4K
8
24.6
3.5
61.5
96.5%
Knowledge-based conversational search
4K
1K
1
33.5
4.7
25.1
96.5%
Multi-round conversation and long document processing
4K
4K
1
35.8
4.7
26.6
96.5%
8K
4K
1
21.9
9.3
43.3
96.5%
Mathematical logic inference and code generation
When the concurrency is 2, the single-channel throughput can reach 37 tokens/s, and the TTFT is 1.3s. This balances user experience and cost-effectiveness. When the concurrency is increased to 8, the impact on the user experience is greater. If you want to achieve a better cost-effectiveness, you can increase the concurrency to 4.
Knowledge-based conversational search, mathematical logic inference, and code generation
Due to the large input length, the GPU memory usage is high, and the TTFT is close to 5s when the concurrency is 1, which is not suitable for application production. However, this configuration can be used to build POC environments.
Build a test environment
Create and initialize dual-card instance with 48 GB GPU memory
Create an instance in the ENS console
Log on to the ENS console.
In the left-side navigation pane, choose .
On the Instances page, click Create Instance. For information about how to configure the ENS instance parameters, see Create an instance.
You can configure the parameters based on your business requirements. The following table describes the recommended configurations.
Step
Parameter
Recommended value
Basic Configurations
Billing Method
Subscription
Instance type
x86 Computing
Instance Specification
NVIDIA 48GB * 2
For detailed specifications, contact the customer manager.
Image
Ubuntu
ubuntu_22_04_x64_20G_alibase_20240926
Network and Storage
Network
Self-built Network
System Disk
Ultra Disk, 80 GB or more
Data Disk
Ultra Disk, 1 TB or more
System Settings
Set Password
Custom Key or Key Pair
Confirm the order.
After you complete the system settings, click Confirm Order in the lower-right corner. The system configures the instance based on your configuration and displays the price. After the payment is made, the page jumps to the ENS console.
You can view the created instance in the ENS the console. If the instance is in the Running state, the instance is available.
Create an instance by calling an operation
You can also call an operation to create an instance in OpenAPI Portal.
The following section describes the reference code of the operation parameters.
{
"InstanceType": "ens.gnxxxx", <The instance type>
"InstanceChargeType": "PrePaid",
"ImageId": "ubuntu_22_04_x64_20G_alibase_20240926",
"ScheduleAreaLevel": "Region",
"EnsRegionId": "cn-your-ens-region", <The edge node>
"Password": <YOURPASSWORD>, <The password>
"InternetChargeType": "95BandwidthByMonth",
"SystemDisk": {
"Size": 80,
"Category": "cloud_efficiency"
},
"DataDisk": [
{
"Category": "cloud_efficiency",
"Size": 1024
}
],
"InternetMaxBandwidthOut": 5000,
"Amount": 1,
"NetWorkId": "n-xxxxxxxxxxxxxxx",
"VSwitchId": "vsw-xxxxxxxxxxxxxxx",
"InstanceName": "test",
"HostName": "test",
"PublicIpIdentification": true,
"InstanceChargeStrategy": "instance", <Billing based on instance>
}Log on to the instance and initialize the disk
Log on to the instance
For more information about how to log on to an instance, see Connect to an instance.
Initialize the disk
Expand the root directory.
After you create or resize an instance, you need to expand the root partition online without restarting the instance.
# Install the cloud environment toolkit. sudo apt-get update sudo apt-get install -y cloud-guest-utils # Ensure that the GPT partitioning tool sgdisk exists. type sgdisk || sudo apt-get install -y gdisk # Expand the physical partition. sudo LC_ALL=en_US.UTF-8 growpart /dev/vda 3 # Resize the file system. sudo resize2fs /dev/vda3 # Verify the resizing result. df -h
Mount the data disk.
You need to format and mount the data disk. The following section provides the sample commands. Run the commands based on your business requirements.
# Identify the new disk. lsblk # Format the disk without partitioning it. sudo mkfs -t ext4 /dev/vdb # Configure the mount. sudo mkdir /data echo "UUID=$(sudo blkid -s UUID -o value /dev/vdb) /data ext4 defaults,nofail 0 0" | sudo tee -a /etc/fstab # Verify the mount. sudo mount -a df -hT /data # Modify permissions. sudo chown $USER:$USER $MOUNT_DIR
NoteIf you want to create an image based on the instance, you must delete the
ext4 defaults 0 0row from the/etc/fstabfile. Otherwise, instances created based on the image cannot be started.
Install the vLLM inference environment
Install CUDA
For more information about how to install CUDA, see CUDA Toolkit 12.4 Downloads.
# Install the CUDA Toolkit.
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_570.124.06_linux.run
chmod +x cuda_12.4.0_570.124.06_linux.run
# This step takes a period of time and you need to interact with the GUI.
sudo sh cuda_12.4.0_570.124.06_linux.run
# Add environment variables.
vim ~/.bashrc
export PATH="$PATH:/usr/local/cuda-12.8/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64"
source ~/.bashrc
# Verify whether the installation is successful.
nvcc -V
nvidia-smiInstall auxiliary software (optional)
uv is a good management tool for Python virtual environment and dependencies, which is suitable for clusters that need to run multiple models. For more information about how to install uv, see Installing uv.
# Install uv. By default, uv is installed in ~/.local/bin /.
curl -LsSf https://astral.sh/uv/install.sh | sh
# Edit ~/.bashrc.
export PATH="$PATH:~/.local/bin"
source ~/.bashrc
# Create a clean venv environment
uv venv myenv --python 3.12 --seed
source myenv/bin/activateIf the CUDA environment variables that you configured become invalid after you install uv, and nvcc\nvidia-smi cannot be found, perform the following operations:
vim myenv/bin/activate
Add
export PATH="$PATH:/usr/local/cuda-12.8/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64"
after export PATH# Install vLLM and ModelScope.
uv pip install vllm==0.7.2
uv pip install modelscope
# GPU monitoring tool, you can also use default tool System Management Interface (SMI) provided by NVIDIA.
uv pip install nvitopDownload the QwQ-32B model and vLLM benchmark script
# Download the model to the data disk /data to avoid the error due to insufficient space.
mkdir -p /data/Qwen/QwQ-32B
cd /data/Qwen/QwQ-32B
modelscope download --model Qwen/QwQ-32B --local_dir .
# Optional. Download the dataset.
wget https://www.modelscope.cn/datasets/gliang1001/ShareGPT_V3_unfiltered_cleaned_split/resolve/master/ShareGPT_V3_unfiltered_cleaned_split.json
# Install git as needed.
apt update
apt install git -y
# Download vllm.git, which contains the test script.
git clone https://github.com/vllm-project/vllm.gitTest the model online
Start the vLLM server
vllm serve /data/Qwen/QwQ-32B/ \
--host 127.0.0.1 \
--port 8080 \
--tensor-parallel-size 2 \
--trust-remote-code \
--served-model-name qw \
--gpu-memory-utilization 0.95 \
--enforce-eage \
--max-num-batched-Tokens 8192 \
--max-model-len 8192 \
--enable-prefix-caching Start the test
python3 ./vllm/benchmarks/benchmark_serving.py --backend vllm --served-model-name qw --model /data/Qwen/QwQ-32B --dataset-name random --random-input 1024 --random-output 4096 --random-range-ratio 1 --max-concurrency 4 --num-prompts 10 --host 127.0.0.1 --port 8080 --save-result --result-dir /data/logs/ --result-filename QwQ-32B-4-1-4.logFinish the test
The following table describes the test result.
