Model quantization parameters - Platform For AI - Alibaba Cloud Documentation Center

BladeLLM executes model quantization through the command line statement blade_llm_quantize. The quantized model can be directly used for inference and deployment with BladeLLM. This topic describes the configuration parameters supported by blade_llm_quantize.

Usage sample

BladeLLM performs model quantization operations by executing the command line statement blade_llm_quantize. The example command is as follows:

blade_llm_quantize \
    --model Qwen/Qwen-7B-Chat \
    --output_dir Qwen-7B-Chat-int8 \
    --quant_algo minmax \
    --quant_mode weight_only_quant \
    --bit 8

The quantized model can be directly used for inference and deployment with BladeLLM. For more information, see Service deployment parameters.

Parameter description

Here are the parameters supported by blade_llm_quantize:

Parameter	Type	Required	Description
model	str	Yes	The directory where the original floating-point model is in.
output_dir	str	Yes	The directory to store the quantized model.
bit	int	No	The number of quantization bits. Valid values: [8, 4]. Default value: 8.
quant_mode	str	No	The quantization mode. Valid values: weight_only_quant (default) act_and_weight_quant
quant_dtype	str	No	Whether the model is quantized to integer type or floating-point type. Valid values: int (default): The quantization type is determined in combination with the bit parameter above. bit=8 indicates quantization to int8 type. fp8: The default quantization is to fp8 e4m3 type, which is equivalent to selecting fp8_e4m3. This type is recommended for fp8 quantization.
quant_algo	str	No	The quantization algorithm. Valid values: minmax, gptq, awq, smoothquant, smoothquant+, and smoothquant_gptq. The minmax algorithm does not require calibration data. Default value: minmax.
block_wise_quant	bool	No	Whether to enable block-wise quantization (also known as sub-channel quantization). The default block_size is 64 (same as the group_size parameter in the gptq algorithm). Currently, this option is supported only when quant_mode is set to weight_only_quant. False (default) True
calib_data	list of str	No	The calibration data. Some quantization algorithms (such as gptq) require calibration data for weight fine-tuning and other processing. You can directly pass the text used for calibration in a list. Default value: ['hello world!'].
calib_data_file	str	No	The calibration data, passed in a jsonl file. Format: Each line in the file contains one data entry, format: `{"text": "hello world!"}`. For vl models, format: `{"content": [{"image": "https://xxx/demo.jpg"}, {"text": "What is this?"}]}`. Provide 100 to 1,000 data entries that are close to the actual application scenario. The data length should not be too long to avoid unnecessary quantization time. Default value: None.
cpu_offload	bool	No	If the current GPU memory is insufficient to load the floating-point model for quantization, perhaps causing out of memory (OOM) errors, enable this option to load some parameters to the CPU during quantization. Default value: False.
max_gpu_memory_utilization	float	No	Takes effect when cpu_offload is True. Used to control the maximum video memory ratio estimated for cpu_offload. A smaller value will load more model layers. When OOM still occurs after enabling cpu_offload, you can appropriately reduce this parameter. Default value: 0.9.
fallback_ratio	float	No	Specifies the proportion of layers to fall back to pre-quantization floating-point calculation. When fallback_ratio > 0, automatic mixed-precision quantization is enabled. Try this parameter when the initial quantization accuracy does not meet requirements. The quantization sensitivity of each layer will be calculated, and a specified proportion of calculation layers will be rolled back to improve quantization accuracy. Default value: 0.0.
tokenizer_dir	str	No	Specifies the tokenizer directory. If not specified, it is the same as the model directory. Default value: None.
tensor_parallel_size	int	No	Specifies the degree of tensor parallelism. If the original floating-point model needs to be loaded on multiple cards, specify the number in this parameter. Default value: 1.
pipeline_parallel_size	int	No	Specifies the pipeline parallelism. If the original floating-point model needs to be loaded on multiple cards, specify the number in this parameter. Default value: 1.