With the rapid development of large language models (LLMs), LLM APIs have become a core component of the infrastructure for AI-native applications. This topic evaluates the performance of multiple compression algorithms, such as ZSTD, GZIP, Brotli, and QATZip, for both streaming and non-streaming responses. The evaluation is based on stress testing data from Alibaba Cloud AI Gateway in LLM API scenarios. The test results show that enabling response compression significantly reduces network bandwidth consumption. For streaming responses, this can reduce outbound bandwidth costs by more than 90%.
Characteristics of LLM APIs
LLM APIs typically have the following characteristics:
Streaming responses: To improve real-time responsiveness, most LLM APIs use streaming data mechanisms, such as Server-Sent Events (SSE).
Asymmetric data volume: In typical scenarios, a request contains only a short prompt, but the response contains the full generated text. Sometimes, the response also includes additional content, such as inference processes and confidence level information. As a result, outbound traffic during response transmission consumes the most network bandwidth.
Highly structured response data: LLM API responses are usually in JSON format. They have a clear field structure and high data redundancy. Structured data can be compressed more efficiently with general-purpose compression algorithms because of its repetitive patterns and fixed syntax elements. In streaming scenarios, consecutive data frames often contain similar structural information, which further improves the compression ratio.
In summary, compressing LLM API response data, especially for streaming responses, can significantly reduce bandwidth usage and improve overall transmission efficiency.
Benefits and use cases
Problems solved
High bandwidth costs: LLM application responses often consist of large volumes of text. High call volumes can lead to high Internet outbound bandwidth costs.
Poor experience on weak networks: In mobile environments or on networks with poor connectivity, large, uncompressed responses can cause client-side stuttering and degrade the streaming experience.
Quantifiable benefits
Enabling response compression, especially for streaming APIs, saves costs and improves user experience.
Cost savings:
For streaming responses, you can save 92% to 97% on bandwidth.
For non-streaming responses, you can save about 28% to 40% on bandwidth.
Improved experience: Reduces the amount of data transferred and speeds up response loading on the client. This improvement is especially noticeable on weak networks.
Use cases and recommendations
Use case | Recommended or not | Explanation |
LLM streaming API | Recommended | For example, a conversational interface based on SSE. The protocol and data structure of streaming have redundancy, which leads to good compression results. You can save up to 97% on bandwidth. |
Standard API that returns large amounts of text or JSON | Recommended | For example, an API that returns long articles or large lists. When the response body is larger than 100 KB, compression can save bandwidth. |
API response body is already compressed | Not recommended | For example, images (JPEG/PNG), videos, or binary files. Compressing already compressed data is not effective and increases CPU overhead. |
API response body is very small | Not recommended | For example, responses smaller than 1 KB. The overhead from compression (CPU, latency) might outweigh the benefits. |
Performance evaluation
Test environment
Parameter | Configuration |
Gateway node | 1-core 2 GB memory, single-node deployment |
Stress testing QPS | Fixed at 30 QPS |
Test scenario | Real-world LLM API call scenario |
Compression algorithm | ZSTD, GZIP, Brotli, QATZip |
Compression algorithms
This test compares four compression algorithms:
Algorithm | Description |
GZIP | The classic DEFLATE compression algorithm with excellent compatibility. |
ZSTD | Open-sourced by Facebook. Provides an excellent balance between compression ratio and speed. |
Brotli | Open-sourced by Google. Offers a high compression ratio but consumes more CPU. |
QATZip | Hardware-accelerated GZIP based on Intel QAT technology. Suitable for high-concurrency, low-latency, and low-CPU-consumption scenarios. |
QATZip
QATZip is a GZIP compression solution that uses Intel QuickAssist Technology (QAT) for hardware acceleration. QAT is a dedicated hardware acceleration platform from Intel. It offloads compute-intensive tasks from the CPU to dedicated hardware. By performing compression and decompression on a separate accelerator card, QAT significantly improves system performance.
How it works?
The QAT hardware accelerator card has a dedicated compression and decompression engine optimized for data compression tasks. It connects to the host through a PCIe bus and supports parallel operation with the CPU and asynchronous operations. The CPU can submit a compression task to the accelerator card and then immediately return to handle other tasks. This process enables efficient task offloading.
Alibaba Cloud AI Gateway and Cloud Native API Gateway support QATZip. These gateways use hardware acceleration for Gzip compression to effectively reduce CPU utilization and network bandwidth consumption.
Test method
This test compares the performance of the four compression algorithms in two response modes:
Non-streaming response: Returns the complete JSON response at once.
Streaming response: Uses a streaming mode (SSE or Chunked Transfer Encoding).
Metrics
Compression ratio: The ratio of the compressed data size to the original data size. A smaller value is better.
CPU consumption: The CPU usage of the gateway node during compression.
Test results
Compression algorithm | Response mode | Compression ratio | CPU consumption |
ZSTD | Non-streaming | 72% | 6% |
Streaming | 2.8% | 23.6% | |
GZIP | Non-streaming | 68.8% | 6% |
Streaming | 7.2% | 30% | |
Brotli | Non-streaming | 60.8% | 6.2% |
Streaming | 7.2% | 34.5% | |
QATZip | Non-streaming | 67% | 4.8% |
Streaming | 2.9% | 20.9% |
Note: Compression ratio = Compressed size / Original size. A smaller compression ratio indicates better compression. For example, a ratio of 2.8% means the compressed data is only 2.8% of the original size, saving 97.2% of the bandwidth.
Result analysis
Streaming response compression has clear advantages
The compression efficiency for streaming responses is significantly better than for non-streaming responses. The compression ratio can differ by a factor of more than 10.
Non-streaming scenario: The compression ratio is between 60% and 72%, saving about 28% to 40% on bandwidth.
Streaming scenario: The compression ratio can be as low as 2.8% to 7.2%, saving up to 92% to 97% on bandwidth.
Streaming responses use a chunked transfer mechanism, where data is returned token by token. Each returned chunk contains highly repetitive data structures, such as JSON field names and formatting characters. This results in significant data redundancy. These repetitive patterns are easily identified and efficiently compressed by algorithms. Compressing streaming responses for LLM APIs can significantly reduce outbound network bandwidth consumption.
CPU consumption trade-off
While achieving a higher compression ratio, streaming responses also increase CPU overhead.
In non-streaming scenarios, CPU usage is stable between 4.8% and 6.2%, indicating low resource consumption. In streaming scenarios, CPU usage rises to between 20.9% and 34.5%, which requires more computing resources.
Streaming compression requires real-time chunking of continuous data. This process significantly increases both the algorithm's complexity and the volume of data processing. Achieving a higher compression ratio also typically involves more computational operations. Therefore, the increase in CPU usage is expected.
QATZip has better overall performance
Based on the test results, QATZip delivers better overall performance.
In streaming scenarios, ZSTD and QATZip have compression ratios of 2.8% to 2.9%. This is significantly better than the 7.2% ratio of Gzip and Brotli.
At similar compression ratios, QATZip's CPU consumption is 20.9%, which is lower than ZSTD's 23.6%. Thanks to dedicated hardware acceleration, QATZip offers advantages in both compression efficiency and resource usage.
Implementation steps
Create an AI Gateway instance that supports hardware acceleration
Log in to the AI Gateway console and select a Region from the top menu bar.
In the navigation pane on the left, choose AI Gateway > Instance.
Click Create Instance and specify the following configurations on the page that appears:
Set the Product Type to dedicated instance (pay-as-you-go or subscription).
Set the Instance Specification to
aigw.medium.x1or higher.Select Allocate Gzip hardware compression resources.
NoteServerless instances do not support hardware acceleration.
Only instances of the
aigw.medium.x1specification or higher support hardware acceleration.
Click Buy Now.
Enable compression in gateway parameters
Return to the Instance page and click the target instance ID to open its details page.
In the navigation pane on the left, select Parameters.
In the Gateway Engine Parameters section, click Edit for the EnableGzipHardwareAccelerate parameter and turn on the Parameter Value switch.
After you enable this feature, the client must be configured to handle Gzip-compressed data. To do this, add Accept-Encoding:gzip to the request header.