All Products
Search
Document Center

Platform For AI:Model quantization parameters

Last Updated:May 28, 2025

BladeLLM executes model quantization through the command line statement blade_llm_quantize. The quantized model can be directly used for inference and deployment with BladeLLM. This topic describes the configuration parameters supported by blade_llm_quantize.

Usage sample

BladeLLM performs model quantization operations by executing the command line statement blade_llm_quantize. The example command is as follows:

blade_llm_quantize \
    --model Qwen/Qwen-7B-Chat \
    --output_dir Qwen-7B-Chat-int8 \
    --quant_algo minmax \
    --quant_mode weight_only_quant \
    --bit 8

The quantized model can be directly used for inference and deployment with BladeLLM. For more information, see Service deployment parameters.

Parameter description

Here are the parameters supported by blade_llm_quantize:

Parameter

Type

Required

Description

model

str

Yes

The directory where the original floating-point model is in.

output_dir

str

Yes

The directory to store the quantized model.

bit

int

No

The number of quantization bits. Valid values: [8, 4]. Default value: 8.

quant_mode

str

No

The quantization mode. Valid values:

  • weight_only_quant (default)

  • act_and_weight_quant

quant_dtype

str

No

Whether the model is quantized to integer type or floating-point type. Valid values:

  • int (default): The quantization type is determined in combination with the bit parameter above. bit=8 indicates quantization to int8 type.

  • fp8: The default quantization is to fp8 e4m3 type, which is equivalent to selecting fp8_e4m3. This type is recommended for fp8 quantization.

quant_algo

str

No

The quantization algorithm. Valid values: minmax, gptq, awq, smoothquant, smoothquant+, and smoothquant_gptq. The minmax algorithm does not require calibration data.

Default value: minmax.

block_wise_quant

bool

No

Whether to enable block-wise quantization (also known as sub-channel quantization). The default block_size is 64 (same as the group_size parameter in the gptq algorithm). Currently, this option is supported only when quant_mode is set to weight_only_quant.

  • False (default)

  • True

calib_data

list of str

No

The calibration data. Some quantization algorithms (such as gptq) require calibration data for weight fine-tuning and other processing. You can directly pass the text used for calibration in a list.

Default value: ['hello world!'].

calib_data_file

str

No

The calibration data, passed in a jsonl file. Format:

  • Each line in the file contains one data entry, format: {"text": "hello world!"}. For vl models, format: {"content": [{"image": "https://xxx/demo.jpg"}, {"text": "What is this?"}]}.

  • Provide 100 to 1,000 data entries that are close to the actual application scenario. The data length should not be too long to avoid unnecessary quantization time.

Default value: None.

cpu_offload

bool

No

If the current GPU memory is insufficient to load the floating-point model for quantization, perhaps causing out of memory (OOM) errors, enable this option to load some parameters to the CPU during quantization.

Default value: False.

max_gpu_memory_utilization

float

No

Takes effect when cpu_offload is True. Used to control the maximum video memory ratio estimated for cpu_offload. A smaller value will load more model layers. When OOM still occurs after enabling cpu_offload, you can appropriately reduce this parameter.

Default value: 0.9.

fallback_ratio

float

No

Specifies the proportion of layers to fall back to pre-quantization floating-point calculation. When fallback_ratio > 0, automatic mixed-precision quantization is enabled. Try this parameter when the initial quantization accuracy does not meet requirements. The quantization sensitivity of each layer will be calculated, and a specified proportion of calculation layers will be rolled back to improve quantization accuracy.

Default value: 0.0.

tokenizer_dir

str

No

Specifies the tokenizer directory. If not specified, it is the same as the model directory.

Default value: None.

tensor_parallel_size

int

No

Specifies the degree of tensor parallelism. If the original floating-point model needs to be loaded on multiple cards, specify the number in this parameter.

Default value: 1.

pipeline_parallel_size

int

No

Specifies the pipeline parallelism. If the original floating-point model needs to be loaded on multiple cards, specify the number in this parameter.

Default value: 1.