BladeLLM model quantization - Platform For AI - Alibaba Cloud Documentation Center

BladeLLM provides efficient and easy-to-use quantization features for Large Language Models (LLMs), including weight-only quantization (weight_only_quant) and joint quantization of weights and activations (act_and_weight_quant). It integrates several mainstream effective quantization algorithms such as GPTQ, AWQ, and SmoothQuant, while supporting various data types for quantization including INT8, INT4, and FP8. This topic describes how to perform model quantization operations.

Background information

Existing issues
With the rapid development of LLM technology and applications, their parameters and context scale bring great challenges for inference deployment.
- Excessive GPU memory consumption: Loading large model weights alone requires substantial GPU memory, and the combination of sequence length and hidden layer scale makes the KV cache require additional GPU memory.
- Service throughput and latency issues: The GPU memory consumption constrains the batch size during LLM inference computation, thereby limiting the overall throughput of LLM services. The growth of model scale and context increases the amount of inference computation, affecting text generation speed. Combined with batch size constraints, this leads to more serious response latency issues due to request queuing under high concurrency loads.
Solutions
Compressing model weights and computation cache can effectively reduce GPU memory consumption during deployment and help increase the upper limit of inference computation batch size, thereby improving the overall service throughput. Additionally, INT8/INT4 quantization can reduce the size of data read from GPU memory during computation, alleviating the memory bottleneck in LLM inference computation. Further acceleration can be achieved by using INT8/INT4 hardware computing power.
BladeLLM combines quantization algorithms and system optimization implementation to develop a quantization optimization solution with comprehensive features and excellent performance. The quantization tool provides flexible calibration data input methods and supports multi-GPU model quantization. Additionally, BladeLLM provides CPU offload functionality, supporting model quantization in situations with limited GPU memory. It also provides automatic mixed precision functionality, supporting quantization precision tuning by falling back some quantized computations.

Create a quantization task

You can deploy an elastic job service in Elastic Algorithm Service (EAS) of Platform for AI (PAI) to run the service as a job to complete model quantization calibration and conversion. Perform the following steps:

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Custom Model Deployment section of the Deploy Service page, click Custom Deployment.

On the Custom Deployment page, configure the key parameters. For information about other parameters, see Custom deployment.

Parameter		Description
Basic Information	Service Name	Specify a name for the service, such as bladellm_quant.
Environment Information	Deployment Method	Select Image-based Deployment.
	Image Configuration	In the Alibaba Cloud Image list, select blade-llm > blade-llm:0.11.0. Note The image version is frequently updated. We recommend that you select the latest version.
	Model Settings	Mount the model to be quantized. Use OSS mounting as an example. You can also choose other mounting methods. Click OSS and configure the following parameters: Uri: Select the OSS storage directory where the quantization model is located. For information about how to create an Object Storage Service (OSS) directory and upload files, see Get started with the OSS console. Mount Path: Configure the destination path to mount to the service instance, such as `/mnt/model`. Enable Read-only Mode: Turn off this feature.
	Command	Configure the model quantization command: `blade_llm_quantize --model /mnt/model/Qwen2-1.5B --output_dir /mnt/model/Qwen2-1.5B-qt/ --quant_algo gptq --calib_data ['hello world!']` . Parameters: `--model`: Configure the quantization model input path. `--output_dir`: Configure the quantization model output path. `--quant_algo`: Specify a quantization algorithm. The default quantization algorithm is MinMax. If you use the default quantization algorithm, you do not need to specify `--calib_data`. The input or output path must match the OSS path mounted in the model configuration. For information about other quantization parameters that can be configured, see Model quantization parameters.
	Port Number	After you select an image, the system automatically configures port 8081. No manual modification is required.
Resource Information	Resource Type	In this example, select Public Resources. You can select other resource types based on your business requirements.
Resource Information	Deployment Resources	Select an instance type for running the BladeLLM quantization command. The selected instance type's GPU memory only needs to be slightly higher than the GPU memory required by the quantization model. The model GPU memory calculation formula is as follows:. For example, if an FP16 model includes 7 billion parameters, the GPU memory is calculated by using the following formula: $\frac{7 \times 1 0 ^{9} \times 2}{1 0 ^{9}} = 14 (GB)$ .
Features	Task Mode	Turn on the switch to create an EAS scalable job service.

After configuring the parameters, click Deploy.
When the Service Status becomes Completed, model quantization is complete. You can go to the output path in OSS (your data source) to view the generated quantization model, and deploy the model on EAS by referring to Get started with BladeLLM.

Introduction to quantization techniques

This section introduces the different quantization modes, quantization algorithms, and their application scenarios involved in the BladeLLM quantization tool, and provides specific usage instructions for some parameters.

Quantization modes

BladeLLM supports two quantization modes: weight_only_quant and act_and_weight_quant, which can be specified by using the quant_mode parameter.

weight_only_quant
- Definition: quantizes only model weights.
- Characteristics: Compared with act_and_weight_quant, weight_only_quant usually maintains model accuracy more easily. In some cases, the main bottleneck for LLM deployment is GPU memory bandwidth, and weight_only_quant is a better solution for considering both model performance and accuracy.
- Supported data types: supports 8-bit and 4-bit. Both are signed symmetric quantization by default.
- Quantization granularity: supports per-channel quantization and block-wise quantization.
- Supported algorithms: includes minmax, gptq, awq, and smoothquant+.
act_and_weight_quant
- Definition: quantizes both model weights and activation values simultaneously.
- Characteristics: Compared with weight_only_quant, it can truly enable low-bit dense computation, significantly improving operator execution speed.
- Supported data types: supports 8-bit. It is signed symmetric quantization by default.
- Quantization granularity: uses per-token dynamic quantization for activation values and per-channel static quantization for weights. Block-wise quantization is not currently supported.
- Supported algorithms: includes minmax, smoothquant, and smoothquant_gptq.

The quantization modes support the following hardware and quantization data types.

Hardware type		weight_only_quant			act_and_weight_quant
Hardware type		INT8	INT4	FP8	INT8	FP8
Ampere (SM80/SM86)	GU100/GU30	Y	Y	N	Y	N
Ada Lovelace (SM89)	L20	N	N	N	Y	Y
Hopper (SM90)	GU120/GU108	N	N	N	Y	Y

block_wise_quant

Previously, model parameter quantization usually used per-channel quantization, where each output channel shares a set of quantization parameters. Currently, for weight_only_quant, to further reduce LLM model quantization loss, many quantization methods adopt finer-grained quantization settings, further dividing each output channel into multiple small blocks, with each small block using separate quantization parameters. BladeLLM fixes the size of small blocks at 64, meaning every 64 parameters share a set of quantization parameters.

Example: For the parameter $W \in R^{C i \times C o}$ , the number of input channels is $C i$ and the number of output channels is $C o$ .

If per-channel quantization is used, the quantization parameter $S \in R^{C o}$ .
If block-wise quantization is used, the quantization parameter $S \in R^{C i /64 \times C o}$ .

The above example illustrates that block_wise quantization uses finer-grained quantization parameters, theoretically achieving better quantization accuracy (especially in 4-bit weight-only quantization), but will slightly reduce the quantization performance of the model.

Quantization algorithms

BladeLLM provides several quantization algorithms that can be specified by using the quant_algo parameter.

MinMax
MinMax is a simple and direct quantization algorithm, using round-to-nearest (RTN).
This algorithm is suitable for both weight_only_quant and act_and_weight_quant. It does not require calibration data, and the quantization process is fast.
GPTQ
GPTQ is a weight-only quantization algorithm based on approximate second-order information for quantization fine-tuning, which effectively maintains quantization accuracy and has a relatively efficient quantization process. GPTQ quantizes parameters sequentially within each channel (or block), and after each parameter is quantized, it uses the inverse of the Hessian matrix calculated from activation values to appropriately adjust the parameters within that channel (or block) to compensate for the accuracy loss caused by quantization.
This algorithm supports block-wise quantization, requires a certain amount of calibration data, and in most cases has better quantization accuracy than the MinMax algorithm.
AWQ
AWQ is an activation-aware weight-only quantization algorithm. This algorithm recognizes that different weights have different importance, with a small portion (0.1%-1%) of important parameters (salient weights), and canceling the quantization of these parameters significantly reduces quantization loss. Experiments have found that parameter channels associated with larger activation values are more important, so the selection of important parameter channels is determined based on the distribution of activation values. Specifically, AWQ reduces the quantization loss of important parameters by multiplying them by a relatively large scaling factor before quantization.
This algorithm supports block-wise quantization, requires a certain amount of calibration data, and in some cases has better accuracy than GPTQ, but takes longer for quantization calibration. For example, when calibrating Qwen-72B on 4 V100-32G GPUs, gptq takes about 25 minutes, while awq takes about 100 minutes.
SmoothQuant
SmoothQuant is an effective post-training quantization algorithm for improving LLM W8A8 quantization, and is a typical act-and-weight quantization. It is generally believed that activation values are more difficult to quantize than model weights during model quantization, and outliers are the main challenge for activation value quantization. SmoothQuant discovered that outliers in LLM activations often appear uniformly in certain channels and do not change with tokens. Based on this finding, SmoothQuant uses a mathematically equivalent transformation to transfer the quantization difficulty from activation values to weights, achieving smooth processing of outliers in activations.
This algorithm requires a certain amount of calibration data, and in most cases has better quantization accuracy than the MinMax algorithm. Currently, it does not support block-wise quantization.
SmoothQuant+
SmoothQuant+ is a weight-only quantization algorithm that reduces quantization loss by smoothing activation outliers. This algorithm recognizes that model weight quantization errors are amplified by activation outliers, so smoothquant+ first smooths activation outliers in the channel dimension, and simultaneously adjusts the corresponding weights to maintain computational equivalence, before proceeding with normal weight-only quantization.
This algorithm requires a certain amount of calibration data and supports block-wise quantization.
SmoothQuant-GPTQ
SmoothQuant-GPTQ refers to using the GPTQ algorithm to quantize model parameters after smoothing outliers in activations using the SmoothQuant principle. This method can, to some extent, combine the advantages of SmoothQuant and GPTQ.

The following table describes the basic support for various quantization algorithms.

Quantization algorithm	weight_only_quant *(supports blockwise quantization)*			act_and_weight_quant *(does not support blockwise quantization)*		Calibration data dependency
Quantization algorithm	INT8	INT4	FP8	INT8	FP8	Calibration data dependency
minmax	Y	Y	N	Y	Y	N
gptq	Y	Y	N	N	N	Y
awq	Y	Y	N	N	N	Y
smoothquant	N	N	N	Y	Y	Y
smoothquant+	Y	Y	N	N	N	Y
smoothquant_gptq	N	N	N	Y	Y	Y

Recommendations for choosing quantization algorithms:

If you want to experiment quickly, we recommend that you try the MinMax algorithm, which requires no calibration data and has a fast quantization process.
If the MinMax quantization accuracy does not meet your requirements:
- For weight-only quantization, you can further try gptq, awq, or smoothquant+.
- For act_and_weight quantization, you can further try smoothquant or smoothquant_gptq.
Among these, awq and smoothquant+ take longer for the quantization process compared with gptq, but in some cases have smaller quantization accuracy loss. gptq, awq, smoothquant, and smoothquant+ all require calibration data.
If you want to better maintain quantization accuracy, we recommend that you enable block-wise quantization. Block-wise quantization may cause a slight decrease in the performance of the quantized model, but it usually significantly improves quantization accuracy in 4-bit weight-only quantization scenarios.
If the above quantization accuracy still does not meet your requirements, you can try enabling automatic mixed precision quantization by setting the fallback_ratio parameter to specify the proportion of layers to fall back to floating-point computation. This will automatically calculate the quantization sensitivity of each layer and fall back the specified proportion of computation layers to improve quantization accuracy.