By Jingquan
As the big model inference service becomes the mainstream, message queues are facing unprecedented challenges in fine-grained traffic governance in AI scenarios.
Traditional Internet applications have fixed workflow, short request time, and message queue throttling mechanisms are relatively mature. However, in AI inference scenarios, the workflow is highly dynamic and a single task can last for several minutes or more. This makes traditional methods seem inadequate and causes two core pain points:
To address these issues, Apache RocketMQ 5. X provides a lightweight topic model named LiteTopic. It supports the creation of millions of lightweight topics and high-performance dynamic subscriptions. The fine-grained traffic governance solution based on LiteTopic can implement real-time throttling in milliseconds and minute-by-minute busy/idle scheduling.
AI applications differ from traditional Internet applications in the execution mode and task duration. Traditional application processes are fixed and predictable, take a short time (in seconds), and are mostly one-way one-time interactions. AI applications are more active, disassembling targets and dynamically adjusting strategies. The process is uncertain, and a single task takes a long time (in minutes and unpredictable). AI applications are often accompanied by multiple rounds of dialogue and interaction.

This difference causes message queue to face two serious challenges in AI inference scenarios:
In traditional services, requests from different users are time-balanced (usually within seconds). Even if multiple tenants share a queue, the queue head is not occupied for a long time, and the blocking problem is not obvious. Therefore, only a few queues need to be set up to meet the demand.
However, in AI inference scenarios, the request time of different users varies greatly (ranging from a few seconds to tens of minutes and is unpredictable). When multiple tenants share a queue, a long and time-consuming message (such as a complex inference task) occupies the head of the queue, which blocks the processing of all subsequent messages. As a result, normal messages of other users in the same queue cannot be processed in a timely manner. If a user intensively submits slow tasks, it may preempt the head position of all queues for a long time, resulting in exclusive resource occupation, which causes the delay of other users to soar and undermines the fairness of the system.
In AI inference scenarios, when a user intensively submits a large number of inference requests within a short period of time, the system needs to implement traffic control for the user. However, traditional throttling measures (such as Thread.sleep()) block consumer threads, which causes a serious problem:
Even if there are messages from other healthy users waiting to be processed in the queue, the normal messages of these healthy users cannot be processed because all consumption threads are blocked because they are processing the requests of the throttling users. As more users are throttled, a large number of threads are blocked, and the concurrent processing capability of the entire system decreases sharply.

In the face of traffic peaks in AI inference scenarios, the industry usually uses two "old routines" to limit traffic, but both "treat the symptoms but not the root causes".
Simply and crudely let the message fail and automatically get back in the queue. This sounds very clever, but actually planted a "time bomb":
When a user detects that the request frequency is too high or the resource consumption is too high in a short period of time, a synchronous blocking API such as Thread.sleep() is used to suspend the message processing thread and directly let the processing thread "sleep for a while". This seems to control the frequency of message processing, but is actually "quenching thirst".
These two traditional methods, either over-rely on middleware mechanisms or sacrifice system performance, cannot fundamentally solve the problem of fine-grained traffic control in a multi-tenant environment.
AI inference requests may fluctuate within milliseconds. Millisecond-level fine-grained throttling is required to cope with instantaneous traffic peaks.
Message Queue for Apache RocketMQ provides a fine-grained throttling solution based on LiteTopics.
In practical applications, the traffic processing flow is shown in the following figure:

1. Message offloading: Upstream business messages are offloaded to the dedicated LiteTopic corresponding to each independent user based on the user ID (such as userId) to achieve physical isolation.
2. Parallel pulling: The consumer pulls messages from each LiteTopic in parallel by using long polling, and performs throttling judgment on each LiteTopic in the throttling window.
3. Current throttling judgment:
4. Consumption Pending: The LiteTopic is immediately suspended. The consumer releases the processing thread and suspends the pull of the LiteTopic by the server. The suspension period can be precisely controlled in milliseconds to ensure the flexibility and response speed of the throttling policy.
5. Thread reuse: The released threads immediately forward requests from other users to implement elastic scheduling and efficient reuse of resources.
6. Automatic recovery: The consumption of a suspended LiteTopic is automatically resumed after the specified time.
The following consumption code example shows how to implement this mechanism in actual business:
LitePushConsumer litePushConsumer = PROVIDER.newLitePushConsumerBuilder()
.setClientConfiguration(clientConfiguration)
.bindTopic(TOPIC)
.setConsumerGroup(GROUP)
.setMessageListener(messageView -> {
// [Physical isolation] uses userId as the liteTopic name to implement user-level physical isolation.
// Each user has an independent physical queue to ensure that resources are completely independent and avoid mutual interference.
String userId = messageView.getLiteTopic();
// [Accurate throttling] Determine whether throttling needs to be triggered based on business rules.
// You can configure differentiated thresholds by user to implement personalized traffic governance.
if (shouldThrottle(userId)) {
// [Consumption pending] Return suspend to release the current processing thread immediately.
// The server suspends pulling data from the user to avoid invalid resource consumption.
// Supports precise control in milliseconds. After 100ms, threads that are released can be redistributed to other users.
return ConsumeResultSuspend.of(Duration.ofMillis(100));
}
// The message is processed as expected and Action.CommitMessage is returned.
processMessage(messageView);
return ConsumeResult.SUCCESS;
})
.build();
The core of the preceding code is the "consumption suspension" mechanism.
Unlike traditional message queue, which only support the consumption success and consumption failure states, a third consumption state, Suspend, is added to implement precise time window control:
This mechanism allows instantaneous throttling to no longer block threads. This not only protects system resources, but also ensures normal processing of other user requests. This mechanism perfectly meets the real-time traffic governance requirements in AI inference scenarios.
In addition to instantaneous traffic control in milliseconds, the consumption suspension mechanism of RocketMQ LiteTopic is also applicable to long-time window scheduling in minutes or even hours to implement the scheduling of latency-insensitive tasks.
In actual business scenarios, a large number of latency-insensitive tasks may exist, such as:
Such tasks do not need to be processed in real time, but may occupy a large amount of computing resources. The consumption suspension mechanism allows you to intelligently schedule these tasks to be executed during idle hours:
1. Long-time window suspension: Set a suspension period of seconds or even minutes (such as Duration.ofMinutes(30)) to delay the task to off-peak hours;
2. Dynamic sensing of business load: monitoring system the load in real time, and actively suspend the consumption of low-priority tasks when resource constraints are detected;
3. Lightweight task scheduling: Without introducing additional scheduling system, the message queue itself achieves delayed execution of tasks and resource staggering, reducing system complexity.
LitePushConsumer litePushConsumer = PROVIDER.newLitePushConsumerBuilder()
.setClientConfiguration(clientConfiguration)
.bindTopic(TOPIC)
.setConsumerGroup(GROUP)
.setMessageListener(messageView -> {
String taskType = messageView.getUserProperty("taskType");
// Identify latency-insensitive tasks.
if ("BATCH".equals(taskType) || "LOW_PRIORITY".equals(taskType)) {
// Check whether the system is busy.
if (isSystemBusy()) {
// [Suspend for a long time] postpone the task to an idle period for processing.
// Suspend for 30 minutes and then automatically resume scheduling.
return ConsumeResultSuspend.of(Duration.ofMinutes(30));
}
}
// The message is processed as expected and Action.CommitMessage is returned.
processMessage(messageView);
return ConsumeResult.SUCCESS;
})
.build();
This busy and idle scheduling capability allows RocketMQ litetopics to expand the processing capability of delayed tasks on message queue. You do not need to introduce additional scheduling components to maximize system resource utilization while ensuring the core business SLA.
LiteTopic is a lightweight topic model designed by Apache RocketMQ for AI scenarios. LiteTopic features lightweight resources, automated lifecycle management, high-performance subscription, and order assurance.
Its underlying layer is based on an innovative storage architecture and distribution mechanism that supports the efficient management of millions of LiteTopics. It implements physical isolation of massive LiteTopic resources without sacrificing performance, providing a solid technical foundation for refined traffic governance in AI scenarios.
Key technical points include:

Send and consume processes with high concurrency performance for millions of LiteTopics
With the increasing popularity of AI inference, traditional message queue throttling methods have been difficult to meet the requirements of refined traffic control.
Based on RocketMQ liteTopic's sophisticated traffic governance solution, through physical isolation, elastic scaling, precise flow control, and consumption suspension four core characteristics, systematically addressed queue head blocking and concurrency efficiency is impaired two major pain points, providing AI inference scenarios from real-time throttling in milliseconds to minute-level busy-idle scheduling the full range of message processing guarantee.
It is worth mentioning that this solution has reached in-depth cooperation with the gateway of Alibaba Cloud big model service platform Bailian. It uses the fine-grained traffic control capability of RocketMQ LiteTopic to better manage the peak traffic and resource scheduling of AI inference requests.
Currently, the core capabilities of LiteTopics have been released in Alibaba Cloud Message Queue for Apache RocketMQ 5.x. To use LiteTopics in actual business, click here to view the help document.
In the future, we will continue to explore more innovative technologies to promote the evolution and development of message queue in the AI era.
We welcome you to search on DingTalk (group ID: 110085036316) or scan the code to join the RocketMQ for AI user community to exchange and discuss with us.

Put a Microscope on Hermes: Full Visibility into Agent Execution
711 posts | 58 followers
FollowAlibaba Cloud Native Community - May 14, 2026
Alibaba Cloud Native Community - December 17, 2025
Alibaba Cloud Native Community - October 22, 2025
Alibaba Cloud Native Community - October 20, 2025
Alibaba Cloud Native - November 3, 2022
Alibaba Cloud Native Community - December 19, 2023
711 posts | 58 followers
Follow
Alibaba Cloud Model Studio
A one-stop generative AI platform to build intelligent applications that understand your business, based on Qwen model series such as Qwen-Max and other popular models
Learn More
ApsaraMQ for RocketMQ
ApsaraMQ for RocketMQ is a distributed message queue service that supports reliable message-based asynchronous communication among microservices, distributed systems, and serverless applications.
Learn More
ChatAPP
Reach global users more accurately and efficiently via IM Channel
Learn More
Qwen
Full-range, open-source, multimodal, and multi-functional
Learn MoreMore Posts by Alibaba Cloud Native Community