BladeLLM executes model quantization through the command line statement blade_llm_quantize. The quantized model can be directly used for inference and deployment with BladeLLM. This topic describes the configuration parameters supported by blade_llm_quantize.
Usage sample
BladeLLM performs model quantization operations by executing the command line statement blade_llm_quantize. The example command is as follows:
blade_llm_quantize \
--model Qwen/Qwen-7B-Chat \
--output_dir Qwen-7B-Chat-int8 \
--quant_algo minmax \
--quant_mode weight_only_quant \
--bit 8The quantized model can be directly used for inference and deployment with BladeLLM. For more information, see Service deployment parameters.
Parameter description
Here are the parameters supported by blade_llm_quantize:
Parameter | Type | Required | Description |
model | str | Yes | The directory where the original floating-point model is in. |
output_dir | str | Yes | The directory to store the quantized model. |
bit | int | No | The number of quantization bits. Valid values: [8, 4]. Default value: 8. |
quant_mode | str | No | The quantization mode. Valid values:
|
quant_dtype | str | No | Whether the model is quantized to integer type or floating-point type. Valid values:
|
quant_algo | str | No | The quantization algorithm. Valid values: minmax, gptq, awq, smoothquant, smoothquant+, and smoothquant_gptq. The minmax algorithm does not require calibration data. Default value: minmax. |
block_wise_quant | bool | No | Whether to enable block-wise quantization (also known as sub-channel quantization). The default block_size is 64 (same as the group_size parameter in the gptq algorithm). Currently, this option is supported only when quant_mode is set to weight_only_quant.
|
calib_data | list of str | No | The calibration data. Some quantization algorithms (such as gptq) require calibration data for weight fine-tuning and other processing. You can directly pass the text used for calibration in a list. Default value: ['hello world!']. |
calib_data_file | str | No | The calibration data, passed in a jsonl file. Format:
Default value: None. |
cpu_offload | bool | No | If the current GPU memory is insufficient to load the floating-point model for quantization, perhaps causing out of memory (OOM) errors, enable this option to load some parameters to the CPU during quantization. Default value: False. |
max_gpu_memory_utilization | float | No | Takes effect when cpu_offload is True. Used to control the maximum video memory ratio estimated for cpu_offload. A smaller value will load more model layers. When OOM still occurs after enabling cpu_offload, you can appropriately reduce this parameter. Default value: 0.9. |
fallback_ratio | float | No | Specifies the proportion of layers to fall back to pre-quantization floating-point calculation. When fallback_ratio > 0, automatic mixed-precision quantization is enabled. Try this parameter when the initial quantization accuracy does not meet requirements. The quantization sensitivity of each layer will be calculated, and a specified proportion of calculation layers will be rolled back to improve quantization accuracy. Default value: 0.0. |
tokenizer_dir | str | No | Specifies the tokenizer directory. If not specified, it is the same as the model directory. Default value: None. |
tensor_parallel_size | int | No | Specifies the degree of tensor parallelism. If the original floating-point model needs to be loaded on multiple cards, specify the number in this parameter. Default value: 1. |
pipeline_parallel_size | int | No | Specifies the pipeline parallelism. If the original floating-point model needs to be loaded on multiple cards, specify the number in this parameter. Default value: 1. |