Refined AI Inference Traffic Governance in Practice: RocketMQ LiteTopic's Per-Scenario Traffic Control Solution

RocketMQ LiteTopic enables fine-grained, per-scenario traffic governance for AI inference workloads via millisecond-level throttling and consumption suspension.

By Jingquan

Overview

As the big model inference service becomes the mainstream, message queues are facing unprecedented challenges in fine-grained traffic governance in AI scenarios.

Traditional Internet applications have fixed workflow, short request time, and message queue throttling mechanisms are relatively mature. However, in AI inference scenarios, the workflow is highly dynamic and a single task can last for several minutes or more. This makes traditional methods seem inadequate and causes two core pain points:

Queue head blocking: Slow tasks of a single user block messages from other users in the queue.
Concurrency efficiency is compromised: Simple and crude throttling measures cause a sharp decrease in the throughput of the entire system.

To address these issues, Apache RocketMQ 5. X provides a lightweight topic model named LiteTopic. It supports the creation of millions of lightweight topics and high-performance dynamic subscriptions. The fine-grained traffic governance solution based on LiteTopic can implement real-time throttling in milliseconds and minute-by-minute busy/idle scheduling.

New Message Queue Challenges in AI Inference Scenarios

AI applications differ from traditional Internet applications in the execution mode and task duration. Traditional application processes are fixed and predictable, take a short time (in seconds), and are mostly one-way one-time interactions. AI applications are more active, disassembling targets and dynamically adjusting strategies. The process is uncertain, and a single task takes a long time (in minutes and unpredictable). AI applications are often accompanied by multiple rounds of dialogue and interaction.

This difference causes message queue to face two serious challenges in AI inference scenarios:

1. Queue Head Blocking

In traditional services, requests from different users are time-balanced (usually within seconds). Even if multiple tenants share a queue, the queue head is not occupied for a long time, and the blocking problem is not obvious. Therefore, only a few queues need to be set up to meet the demand.

However, in AI inference scenarios, the request time of different users varies greatly (ranging from a few seconds to tens of minutes and is unpredictable). When multiple tenants share a queue, a long and time-consuming message (such as a complex inference task) occupies the head of the queue, which blocks the processing of all subsequent messages. As a result, normal messages of other users in the same queue cannot be processed in a timely manner. If a user intensively submits slow tasks, it may preempt the head position of all queues for a long time, resulting in exclusive resource occupation, which causes the delay of other users to soar and undermines the fairness of the system.

2. Concurrency Efficiency Is Impaired

In AI inference scenarios, when a user intensively submits a large number of inference requests within a short period of time, the system needs to implement traffic control for the user. However, traditional throttling measures (such as Thread.sleep()) block consumer threads, which causes a serious problem:

Even if there are messages from other healthy users waiting to be processed in the queue, the normal messages of these healthy users cannot be processed because all consumption threads are blocked because they are processing the requests of the throttling users. As more users are throttled, a large number of threads are blocked, and the concurrent processing capability of the entire system decreases sharply.

Why Do Traditional Solutions Fail in AI Inference Scenarios?

In the face of traffic peaks in AI inference scenarios, the industry usually uses two "old routines" to limit traffic, but both "treat the symptoms but not the root causes".

▍ Solution 1: Retry Method for Consumption Failures

Simply and crudely let the message fail and automatically get back in the queue. This sounds very clever, but actually planted a "time bomb":

Uncontrollable retry mechanism: The built-in retry mechanism of middleware does not control the time accuracy, which may cause latency amplification.
Unstable service quality: The timeliness cannot be guaranteed. Messages may lie in the queue for several rounds before being processed, which affects the service SLA.
Serious waste of resources: If a failure occurs, retries consume additional network, disk, and CPU resources. This increases the overall load of the system and reduces the stability of the system.

▍ Solution 2: Thread Blocking and Throttling

When a user detects that the request frequency is too high or the resource consumption is too high in a short period of time, a synchronous blocking API such as Thread.sleep() is used to suspend the message processing thread and directly let the processing thread "sleep for a while". This seems to control the frequency of message processing, but is actually "quenching thirst".

Low resource utilization: A large number of threads are blocked in an invalid way, which not only occupies memory, but also increases scheduling overheads. As a result, the concurrency capability decreases and resources may be exhausted for a long period of time.
Tenant isolation failure: In a shared thread pool, throttling on a queue affects other queues that are processed by the same thread. This breaks the isolation between multiple users.
Impaired throughput: The blocking mechanism runs counter to the high-performance design and severely damages the overall message processing capability of the system.

These two traditional methods, either over-rely on middleware mechanisms or sacrifice system performance, cannot fundamentally solve the problem of fine-grained traffic control in a multi-tenant environment.

RocketMQ LiteTopic Traffic Governance

▍ 1. Real-Time Throttling in Milliseconds: Each User Has an "Exclusive VIP Channel"

AI inference requests may fluctuate within milliseconds. Millisecond-level fine-grained throttling is required to cope with instantaneous traffic peaks.

RocketMQ provides a fine-grained throttling solution based on LiteTopics.

Physical isolation: An independent LiteTopic is created for each user or session. This physically isolates user-level resources and eliminates cross-interference.
Elastic scaling: LiteTopic allows you to create millions of topics on demand.
Precise throttling: Each LiteTopic can independently implement throttling policies. You can configure different thresholds based on your business needs. This allows you to implement personalized traffic governance.
Consumption suspension: When a user request is detected to be exceeded, the request is not simply rejected (failed retry) or waited (blocked thread), but gracefully "asked the user to wait for a moment" (suspended), which not only protects system resources, but also does not affect user experience.

In practical applications, the traffic processing flow is shown in the following figure:

1. Message offloading: Upstream business messages are offloaded to the dedicated LiteTopic corresponding to each independent user based on the user ID (such as userId) to achieve physical isolation.

2. Parallel pulling: The consumer pulls messages from each LiteTopic in parallel by using long polling, and performs throttling judgment on each LiteTopic in the throttling window.

3. Current throttling judgment:

Do not exceed the threshold: If a user request does not trigger the threshold, normal consumption and output traffic.
Exceeds Limit: If a request exceeds the limit, the Suspend status is returned.

4. Consumption Pending: The LiteTopic is immediately suspended. The consumer releases the processing thread and suspends the pull of the LiteTopic by the server. The suspension period can be precisely controlled in milliseconds to ensure the flexibility and response speed of the throttling policy.

5. Thread reuse: The released threads immediately forward requests from other users to implement elastic scheduling and efficient reuse of resources.

6. Automatic recovery: The consumption of a suspended LiteTopic is automatically resumed after the specified time.

The following consumption code example shows how to implement this mechanism in actual business:

LitePushConsumer litePushConsumer = PROVIDER.newLitePushConsumerBuilder() 
    .setClientConfiguration(clientConfiguration) 
    .bindTopic(TOPIC) 
    .setConsumerGroup(GROUP) 
    .setMessageListener(messageView -> { 
// [Physical isolation] uses userId as the liteTopic name to implement user-level physical isolation.
// Each user has an independent physical queue to ensure that resources are completely independent and avoid mutual interference.
        String userId = messageView.getLiteTopic(); 
// [Accurate throttling] Determine whether throttling needs to be triggered based on business rules.
// You can configure differentiated thresholds by user to implement personalized traffic governance.
        if (shouldThrottle(userId)) { 
// [Consumption pending] Return suspend to release the current processing thread immediately.
// The server suspends pulling data from the user to avoid invalid resource consumption.
// Supports precise control in milliseconds. After 100ms, threads that are released can be redistributed to other users.
            return ConsumeResultSuspend.of(Duration.ofMillis(100)); 
        } 
// The message is processed as expected and Action.CommitMessage is returned.
        processMessage(messageView); 
        return ConsumeResult.SUCCESS; 
    }) 
    .build();

The core of the preceding code is the "consumption suspension" mechanism.

Unlike traditional message queue, which only support the consumption success and consumption failure states, a third consumption state, Suspend, is added to implement precise time window control:

Status extension: When the consumer returns to the ConsumeResultSuspend status, the next visible timestamp can be carried to specify the invisible period of the message in the time window.
Resource release: The system immediately releases the processing thread and clears the local cache of the queue to avoid resource occupation.
Automatic recovery: The broker maintains a scheduled scheduler to automatically wake up the queue when the specified time is reached.

This mechanism allows instantaneous throttling to no longer block threads. This not only protects system resources, but also ensures normal processing of other user requests. This mechanism perfectly meets the real-time traffic governance requirements in AI inference scenarios.

▍ 2. Minute-Level Busy and Idle Scheduling: Make Delayed Tasks "Travel at the Wrong Peak"

In addition to instantaneous traffic control in milliseconds, the consumption suspension mechanism of RocketMQ LiteTopic is also applicable to long-time window scheduling in minutes or even hours to implement the scheduling of latency-insensitive tasks.

In actual business scenarios, a large number of latency-insensitive tasks may exist, such as:

Run batch tasks: batch processing jobs such as data statistics and report generation.
Asynchronous processing: asynchronous notification and log analysis of non-core links;
Resource-consuming tasks: compute-intensive operations such as model training and offline inference.

Such tasks do not need to be processed in real time, but may occupy a large amount of computing resources. The consumption suspension mechanism allows you to intelligently schedule these tasks to be executed during idle hours:

1. Long-time window suspension: Set a suspension period of seconds or even minutes (such as Duration.ofMinutes(30)) to delay the task to off-peak hours;

2. Dynamic sensing of business load: monitoring system the load in real time, and actively suspend the consumption of low-priority tasks when resource constraints are detected;

3. Lightweight task scheduling: Without introducing additional scheduling system, the message queue itself achieves delayed execution of tasks and resource staggering, reducing system complexity.

LitePushConsumer litePushConsumer = PROVIDER.newLitePushConsumerBuilder() 
    .setClientConfiguration(clientConfiguration) 
    .bindTopic(TOPIC) 
    .setConsumerGroup(GROUP) 
    .setMessageListener(messageView -> { 
        String taskType = messageView.getUserProperty("taskType"); 
// Identify latency-insensitive tasks.
        if ("BATCH".equals(taskType) || "LOW_PRIORITY".equals(taskType)) { 
// Check whether the system is busy.
            if (isSystemBusy()) { 
// [Suspend for a long time] postpone the task to an idle period for processing.
// Suspend for 30 minutes and then automatically resume scheduling.
                return ConsumeResultSuspend.of(Duration.ofMinutes(30)); 
            } 
        } 
// The message is processed as expected and Action.CommitMessage is returned.
        processMessage(messageView); 
        return ConsumeResult.SUCCESS; 
    }) 
    .build();

This busy and idle scheduling capability allows RocketMQ litetopics to expand the processing capability of delayed tasks on message queue. You do not need to introduce additional scheduling components to maximize system resource utilization while ensuring the core business SLA.

RocketMQ LiteTopics: How to Achieve Million-Level Physical Isolation?

LiteTopic is a lightweight topic model designed by Apache RocketMQ for AI scenarios. LiteTopic features lightweight resources, automated lifecycle management, high-performance subscription, and order assurance.

Its underlying layer is based on an innovative storage architecture and distribution mechanism that supports the efficient management of millions of LiteTopics. It implements physical isolation of massive LiteTopic resources without sacrificing performance, providing a solid technical foundation for refined traffic governance in AI scenarios.

Key technical points include:

Unified storage and multi-path distribution: All message data is stored in the underlying CommitLog file and only one copy is stored. The append write mode is used to prevent disk fragmentation and ensure high write performance. At the same time, the multi-path distribution mechanism is used to generate independent consumption indexes for different LiteTopics.
RocksDB KV storage engine: RocksDB uses the high-performance KV storage engine instead of the traditional file-based CQ architecture. It stores the queue index and the physical offset of messages as key-value pairs to efficiently manage millions of pieces of metadata.
The broker manages the subscriptions of the consumer (subscription sets of Lite topics) and supports incremental updates. This enables the broker to perceive the matching status between messages and subscriptions in real time and proactively.
Event-driven and ready set maintenance: When a new message is written, subscription matching is immediately triggered to aggregate the messages that meet the conditions into a ready set.
Efficient bulk pulling: Consumers can pull messages from multiple LiteTopics in one poll request. This significantly reduces the frequency of network interaction and ensures low latency and high throughput in scenarios where a large number of subscriptions are made.

Send and consume processes with high concurrency performance for millions of LiteTopics

Conclusion

With the increasing popularity of AI inference, traditional message queue throttling methods have been difficult to meet the requirements of refined traffic control.

Based on RocketMQ liteTopic's sophisticated traffic governance solution, through physical isolation, elastic scaling, precise flow control, and consumption suspension four core characteristics, systematically addressed queue head blocking and concurrency efficiency is impaired two major pain points, providing AI inference scenarios from real-time throttling in milliseconds to minute-level busy-idle scheduling the full range of message processing guarantee.

It is worth mentioning that this solution has reached in-depth cooperation with the gateway of Alibaba Cloud big model service platform Bailian. It uses the fine-grained traffic control capability of RocketMQ LiteTopic to better manage the peak traffic and resource scheduling of AI inference requests.

Currently, the core capabilities of LiteTopics have been released in ApsaraMQ for RocketMQ 5.x. To use LiteTopics in actual business, click here to view the help document.

In the future, we will continue to explore more innovative technologies to promote the evolution and development of message queue in the AI era.

We welcome you to search on DingTalk (group ID: 110085036316) or scan the code to join the RocketMQ for AI user community to exchange and discuss with us.

Community

Refined AI Inference Traffic Governance in Practice: RocketMQ LiteTopic's Per-Scenario Traffic Control Solution

Overview

New Message Queue Challenges in AI Inference Scenarios

1. Queue Head Blocking

2. Concurrency Efficiency Is Impaired

Why Do Traditional Solutions Fail in AI Inference Scenarios?

▍ Solution 1: Retry Method for Consumption Failures

▍ Solution 2: Thread Blocking and Throttling

RocketMQ LiteTopic Traffic Governance

▍ 1. Real-Time Throttling in Milliseconds: Each User Has an "Exclusive VIP Channel"

▍ 2. Minute-Level Busy and Idle Scheduling: Make Delayed Tasks "Travel at the Wrong Peak"

RocketMQ LiteTopics: How to Achieve Million-Level Physical Isolation?

Conclusion

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Alibaba Cloud Model Studio

ApsaraMQ for RocketMQ

Qwen

Short Message Service