All Products
Search
Document Center

ENS:Best practices of inference services for QwQ-32B in the edge cloud

Last Updated:Jun 20, 2025

This topic describes the features, key metrics, best practices for edge cloud deployment, and test environment setup of the QwQ-32B model. This topic provides a comprehensive guide to the QwQ-32B model to help you quickly understand the model features, deployment requirements, and performance optimization methods. This way, you can efficiently deploy and use the model in an edge cloud environment, which improves inference efficiency and reduce costs.

About QwQ-32B

Description

QwQ-32B is an open-source inference model trained based on the Qwen2.5-32B model. The model greatly improves the model inference capability through enhanced learning. Core metrics such as model mathematical code (AIME 2024 and 2025, LiveCodeBench) and some general metrics (IFEval, LiveBench) have reached the level of the full version DeepSeek-R1, realizing a paradigm breakthrough of high performance with fewer parameters in inference tasks, providing a more cost-effective choice for high-cost large model deployment.

Scenarios

The QwQ-32B model is suitable for scenarios such as mathematical logic inference, large document processing, and code generation. It also performs well in scenarios such as knowledge-based conversational search and multi-round conversation in Chinese. The following table describes the common inference scenarios.

Scenario

Average input length (tokens)

Average output length (tokens)

Typical application cases

Mathematical logic inference

0.5K-1.5K

0.8-3.6K

MATH problem-solving, LSAT logic problem analysis

Knowledge-based conversational search

1K-4K

0.2K-1K

MMLU knowledge assessment, medical consultation

Multi-round conversation system

2K-8K

0.5K-2K

Customer service conversation, psychological consultation

Large document processing

8K-16K

1K-4K

Paper abstract, legal document analysis

Code generation and debugging

0.3K-2K

1K-5K

Function implementation, debugging

Key metrics for model inference

Metric

Description

Model precision

The numerical precision used in model weights and calculation. The low-precision version of the model occupies less memory and costs less resources, but reduces the accuracy on complex tasks.

Concurrency

The number of user requests processed at the same time. Higher concurrency indicates greater business capacity. However, increased concurrency also leads to higher GPU memory and GPU memory bandwidth usage.

Input length

The number of tokens of prompts provided by uses, which affects the GPU memory usage. A large input length affects the time to first token (TTFT).

Output length

The number of tokens in the response text generated by the model, which affects the GPU memory usage. A large output length leads to truncation or out-of-memory (OOM) error.

TTFT

The time between when the user request is initiated and when the first output token is received, which affects the user experience. We recommend that you control the TTFT to be less than 1s and not more than 2s.

Time per output token (TPOT)

The average time required to generate each output token (excluding the first token), which reflects the match between the generation speed and the reading experience. We recommend that you control the TPOT to be less than 50 ms and not more than 100 ms.

Single-channel throughput

The token output rate per channel (tokens/s). A low single-channel throughput leads to a poor customer experience. We recommend that you control the value between 10 tokens/s and 30 tokens/s.

GPU memory usage

The GPU memory usage during runtime in percentage. The GPU memory usage consists of model parameters, KV cache, and intermediate activation values. A high GPU memory usage, such as > 95%, may cause an OOM error, which affects service availability.

Best practices for deploying QwQ-32B in the edge cloud

The edge cloud provides heterogeneous computing resources of multiple specifications on globally distributed edge nodes to meet the heterogeneous computing requirements in different scenarios. The GPU memory of a single GPU card ranges from 12 GB to 48 GB. The following tables describe recommended configurations and inference performance for deploying inference services with different QwQ-32B model precisions in the edge cloud:

  • Recommended dual-card instance with 48 GB GPU memory for QwQ-32B FP16 precision

    • The dual-card instance with 48 GB GPU memory is a VM. The following table describes the configurations.

      Environment parameters

      CPU

      96 cores

      Memory

      384 GB

      GPU

      NVIDIA 48 GB * 2

      Operating system

      Ubuntu 22.04

      Docker version

      26.1.3

      GPU Driver

      Driver Version: 570.124.06

      CUDA Version: 12.4

      Inference framework

      vllm 0.7.2

    • Performance in different scenarios

      Scenario

      Input length

      Output length

      Concurrency

      Single-channel throughput (tokens)

      TTFT (s)

      TPOT (ms)

      GPU memory usage

      Mathematical logic inference and code generation

      1K

      4K

      4

      14.5

      0.6

      67.4

      95%

      1K

      4K

      8

      13.3

      1.6

      71.3

      95%

      Knowledge-based conversational search

      4K

      1K

      2

      14.2

      1.8

      68.6

      95%

      4K

      1K

      4

      13

      2.7

      72.7

      95%

      Multi-round conversation and long document processing

      4K

      4K

      2

      14.64

      1.7

      71.5

      95%

      4K

      4K

      4

      13.6

      2.9

      82.3

      95%

      • Mathematical logic inference and code generation:

        This scenario features short input and long output. The input length ranges from 0.3 KB to 2 KB, and the output length ranges from 0.8 KB to 5 KB.

        When the concurrency is 4, the single-channel throughput is close to 15 tokens/s, and the TTFT is less than 1s. This balances user experience and cost-effectiveness. When the concurrency is 8, a larger TTFT has a slight impact on the user experience, but is still acceptable. If you want to reduce the cost, you can increase the concurrency.

      • Knowledge-based conversational search:

        This scenario features long input and short output. The input length ranges from 1 KB to 4 KB, and the output length ranges from 0.2 KB to 1 KB.

        The optimal concurrency for an instance is 2. When the concurrency increases to 4, the TTFT is greater than 2s. Considering the network latency, the impact on user experience is still acceptable.

      • Multi-round conversation and long document processing:

        This scenario features long input and long output. The input length ranges from 2 KB to 16 KB and the output length ranges from 1 KB to 4 KB.

        An increase in the input length not only increases the memory consumption, but also significantly affects TTFT. The optimal concurrency for an instance is 2. You can adjust the input length and concurrency based on your business situation.

  • Recommended five-card instance with 12 GB GPU memory for QwQ-32B INT4 precision

    • The five-card instance with 12 GB GPU memory is a bare metal instance. The following table describes the configurations.

      Environment parameters

      CPU

      24 cores × 2, 3.0 to 4.0 GHz

      Memory

      256GB

      GPU

      NVIDIA 12GB * 5

      Operating system

      Ubuntu 20.04

      Docker version

      28.0.1

      GPU Driver

      Driver Version: 570.124.06

      CUDA Version: 12.4

      Inference framework

      vllm 0.7.2

    • Performance in different scenarios

      The single-channel throughput of a five-card instance with 12 GB GPU memory can meet the performance requirements of both single- and multi-channel concurrency. However, the TTFT performance is not satisfactory due to the size of single-card GPU memory. We recommend that you deploy mathematical logic inference and code generation services by using this configuration. For scenarios such as knowledge-based conversational search, multi-round conversations, and long document processing, which has a large input length, we recommend that you use a dual-card instance with 48 GB GPU memory.

      Scenario

      Input length

      Output length

      Concurrency

      Single-channel throughput (tokens)

      TTFT (s)

      TPOT (ms)

      GPU memory usage

      Mathematical logic inference and code generation

      1K

      4K

      2

      37

      1.3

      26.4

      96.5%

      1K

      4K

      4

      32.5

      1.7

      28.7

      96.5%

      1K

      4K

      8

      24.6

      3.5

      61.5

      96.5%

      Knowledge-based conversational search

      4K

      1K

      1

      33.5

      4.7

      25.1

      96.5%

      Multi-round conversation and long document processing

      4K

      4K

      1

      35.8

      4.7

      26.6

      96.5%

      8K

      4K

      1

      21.9

      9.3

      43.3

      96.5%

      • Mathematical logic inference and code generation

        When the concurrency is 2, the single-channel throughput can reach 37 tokens/s, and the TTFT is 1.3s. This balances user experience and cost-effectiveness. When the concurrency is increased to 8, the impact on the user experience is greater. If you want to achieve a better cost-effectiveness, you can increase the concurrency to 4.

      • Knowledge-based conversational search, mathematical logic inference, and code generation

        Due to the large input length, the GPU memory usage is high, and the TTFT is close to 5s when the concurrency is 1, which is not suitable for application production. However, this configuration can be used to build POC environments.

Build a test environment

Create and initialize dual-card instance with 48 GB GPU memory

Create an instance in the ENS console

  1. Log on to the ENS console.

  2. In the left-side navigation pane, choose Resources and Images > Instances.

  3. On the Instances page, click Create Instance. For information about how to configure the ENS instance parameters, see Create an instance.

    1. You can configure the parameters based on your business requirements. The following table describes the recommended configurations.

      Step

      Parameter

      Recommended value

      Basic Configurations

      Billing Method

      Subscription

      Instance type

      x86 Computing

      Instance Specification

      NVIDIA 48GB * 2

      For detailed specifications, contact the customer manager.

      Image

      Ubuntu

      ubuntu_22_04_x64_20G_alibase_20240926

      Network and Storage

      Network

      Self-built Network

      System Disk

      Ultra Disk, 80 GB or more

      Data Disk

      Ultra Disk, 1 TB or more

      System Settings

      Set Password

      Custom Key or Key Pair

    2. Confirm the order.

      After you complete the system settings, click Confirm Order in the lower-right corner. The system configures the instance based on your configuration and displays the price. After the payment is made, the page jumps to the ENS console.

      You can view the created instance in the ENS the console. If the instance is in the Running state, the instance is available.

Create an instance by calling an operation

You can also call an operation to create an instance in OpenAPI Portal.

The following section describes the reference code of the operation parameters.

{
  "InstanceType": "ens.gnxxxx",         <The instance type>
  "InstanceChargeType": "PrePaid",
  "ImageId": "ubuntu_22_04_x64_20G_alibase_20240926",
  "ScheduleAreaLevel": "Region",
  "EnsRegionId": "cn-your-ens-region",     <The edge node>
  "Password": <YOURPASSWORD>,           <The password>
  "InternetChargeType": "95BandwidthByMonth", 
  "SystemDisk": {
    "Size": 80,
    "Category": "cloud_efficiency"
  },
  "DataDisk": [
    {
      "Category": "cloud_efficiency",
      "Size": 1024
    }
  ],
  "InternetMaxBandwidthOut": 5000,
  "Amount": 1,
  "NetWorkId": "n-xxxxxxxxxxxxxxx",
  "VSwitchId": "vsw-xxxxxxxxxxxxxxx",
  "InstanceName": "test",        
  "HostName": "test",
  "PublicIpIdentification": true,
  "InstanceChargeStrategy": "instance",      <Billing based on instance>
}

Log on to the instance and initialize the disk

Log on to the instance

For more information about how to log on to an instance, see Connect to an instance.

Initialize the disk

  1. Expand the root directory.

    After you create or resize an instance, you need to expand the root partition online without restarting the instance.

    # Install the cloud environment toolkit.
    sudo apt-get update
    sudo apt-get install -y cloud-guest-utils
    
    # Ensure that the GPT partitioning tool sgdisk exists.
    type sgdisk || sudo apt-get install -y gdisk
    
    # Expand the physical partition.
    sudo LC_ALL=en_US.UTF-8 growpart /dev/vda 3
    
    # Resize the file system.
    sudo resize2fs /dev/vda3
    
    # Verify the resizing result.
    df -h

    image

  2. Mount the data disk.

    You need to format and mount the data disk. The following section provides the sample commands. Run the commands based on your business requirements.

    # Identify the new disk.
    lsblk
    
    # Format the disk without partitioning it.
    sudo mkfs -t ext4 /dev/vdb
    
    # Configure the mount.
    sudo mkdir /data
    echo "UUID=$(sudo blkid -s UUID -o value /dev/vdb) /data ext4 defaults,nofail 0 0" | sudo tee -a /etc/fstab
    
    # Verify the mount.
    sudo mount -a
    df -hT /data
    
    # Modify permissions.
    sudo chown $USER:$USER $MOUNT_DIR

    image

    Note

    If you want to create an image based on the instance, you must delete the ext4 defaults 0 0 row from the /etc/fstab file. Otherwise, instances created based on the image cannot be started.

Install the vLLM inference environment

Install CUDA

For more information about how to install CUDA, see CUDA Toolkit 12.4 Downloads.

# Install the CUDA Toolkit.
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_570.124.06_linux.run
chmod +x cuda_12.4.0_570.124.06_linux.run

# This step takes a period of time and you need to interact with the GUI.
sudo sh cuda_12.4.0_570.124.06_linux.run

# Add environment variables.
vim ~/.bashrc
export PATH="$PATH:/usr/local/cuda-12.8/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64"
source ~/.bashrc

# Verify whether the installation is successful.
nvcc  -V
nvidia-smi

Install auxiliary software (optional)

uv is a good management tool for Python virtual environment and dependencies, which is suitable for clusters that need to run multiple models. For more information about how to install uv, see Installing uv.

# Install uv. By default, uv is installed in ~/.local/bin /.
curl -LsSf https://astral.sh/uv/install.sh | sh

# Edit ~/.bashrc.
export PATH="$PATH:~/.local/bin"

source ~/.bashrc

# Create a clean venv environment
uv venv myenv --python 3.12 --seed
source myenv/bin/activate

If the CUDA environment variables that you configured become invalid after you install uv, and nvcc\nvidia-smi cannot be found, perform the following operations:

vim myenv/bin/activate 
Add
export PATH="$PATH:/usr/local/cuda-12.8/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-12.8/lib64"
after export PATH
# Install vLLM and ModelScope.
uv pip install vllm==0.7.2
uv pip install modelscope

# GPU monitoring tool, you can also use default tool System Management Interface (SMI) provided by NVIDIA.
uv pip install nvitop

Download the QwQ-32B model and vLLM benchmark script

# Download the model to the data disk /data to avoid the error due to insufficient space.
mkdir -p /data/Qwen/QwQ-32B
cd /data/Qwen/QwQ-32B
modelscope download --model Qwen/QwQ-32B --local_dir .

# Optional. Download the dataset.
wget https://www.modelscope.cn/datasets/gliang1001/ShareGPT_V3_unfiltered_cleaned_split/resolve/master/ShareGPT_V3_unfiltered_cleaned_split.json

# Install git as needed.
apt update
apt install git -y

# Download vllm.git, which contains the test script.
git clone https://github.com/vllm-project/vllm.git

Test the model online

Start the vLLM server

vllm serve /data/Qwen/QwQ-32B/ \
  --host 127.0.0.1 \
  --port 8080 \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --served-model-name qw \
  --gpu-memory-utilization 0.95 \
  --enforce-eage \
  --max-num-batched-Tokens 8192 \
  --max-model-len 8192 \
  --enable-prefix-caching 

Start the test

python3 ./vllm/benchmarks/benchmark_serving.py --backend vllm --served-model-name qw --model /data/Qwen/QwQ-32B --dataset-name random --random-input 1024 --random-output 4096 --random-range-ratio 1 --max-concurrency 4 --num-prompts 10 --host 127.0.0.1 --port 8080 --save-result --result-dir /data/logs/ --result-filename QwQ-32B-4-1-4.log

Finish the test

The following table describes the test result.

image