Quantize LLMs with Weight-only Quantization for Efficient Inference - Platform for AI - Alibaba Cloud - Platform For AI

Model compression reduces model size and computational complexity through techniques such as quantization, with minimal impact on its predictive performance. It is ideal for scenarios with limited GPU memory or when you need to lower deployment costs.

How it works

PAI Model Gallery supports model quantization based on the Weight-only Quantization technique. By using the MinMax-8Bit or MinMax-4Bit strategy, you can quantize a model's floating-point weight parameters into 8-bit or 4-bit integer representations. This reduces model size and GPU memory usage while maintaining good performance.

Compress a model

Train the model.

A model must be trained before you can compress it. For more information, see model deployment and training.
After model training completes, click Compression in the upper-right corner of the Task details page.

Configure compression parameters.

The following table describes the key parameters.

Parameter	Description
Compression method	Only model quantization (Weight-only Quantization) is supported. This converts weight parameters to a lower bit width to reduce GPU memory usage during inference.
Compression strategy	MinMax-8Bit: Quantizes model weights to 8-bit integers by using min-max scaling. MinMax-4Bit: Quantizes model weights to 4-bit integers by using min-max scaling.

For other parameters, see Model deployment and training.

Click Compression.

You are redirected to the Task details page, where you can view the compression job's basic information, real-time status, and task logs.

View compression jobs

To view compression jobs, go to PAI Model Gallery > Job Management > Compression Jobs.

Platform For AI:Model compression

How it works

Compress a model

View compression jobs

Next steps