Community Blog The Road to Large-Scale Commercialization of Apache RocketMQ on Alibaba Cloud

The Road to Large-Scale Commercialization of Apache RocketMQ on Alibaba Cloud

This article discusses the history, commercialization, and latest developments of Apache RocketMQ.

By Zhou Xinyu – Apache Member, Apache RocketMQ PMC Member, and the R&D leader of AlibabaMQ for Apache RocketMQ

Commercialization History of AlibabaMQ for Apache RocketMQ


RocketMQ was open-source after its release in 2012. RocketMQ polished its service from 2012-2015 through its internal e-commerce business and launched the public preview on Alibaba Cloud in 2015. Alibaba Cloud RocketMQ was commercialized and donated to the Apache Software Foundation in 2016. It also won the honor of China's most popular open-source software in 2016.

During the incubation in the foundation, Apache RocketMQ experienced rapid development and became a top-level Apache project when graduating in 2017. Apache RocketMQ TLP RocketMQ 4.0 was released in 2017. Since then, Alibaba Cloud commerce and open-source complemented each other and grew hand in hand. Today, they enter the RocketMQ 5.0 era together.

After RocketMQ 5.0 was released, Alibaba Cloud commerce will continue to adopt the OpenCore development model, adhere to the community development principle of giving priority to upstream, and work with the community to build RocketMQ into a hyper-integrated data processing platform.

Product Matrix of AlibabaMQ


Alibaba Cloud builds a diversified messaging product series based on the RocketMQ messaging base.

RocketMQ is the main messaging brand of Alibaba Cloud and the preferred tunnel in the emerging businesses of the Internet. Message Queue for Apache Kafka is the preferred tunnel of big data, AliwareMQ for MQTT is the tunnel of mobile Internet and IoT, and AlibabaMQ for Apache RocketMQ is the tunnel of traditional business domains. MNS is a lightweight version of RocketMQ, mainly used in the application integration field and providing simple queue services for platform-based applications. Event Bridge is positioned as an event hub on the cloud to build a unified event center on Alibaba Cloud.

AlibabaMQ product matrix is built on RocketMQ and achieves full coverage of application scenarios, including microservice decoupling, SaaS integration, IoT, big data, or log collection ecosystem. It also covers all Alibaba businesses internally and provides high-quality message service for tens of thousands of Alibaba Cloud enterprises on the cloud. Alibaba Cloud provides message queuing services that apply to scenarios (such as the Internet, big data, and mobile Internet). The services provide an all-in-one solution for users of cloud-native services.

RocketMQ has been dedicated to exploring the best business messaging practices during the commercialization of Alibaba Cloud. It has incubated a large number of business messaging features and fed back to the open-source community.

Business Message Exploration of RocketMQ 4.0


In the process of commercialization, RocketMQ has introduced four message types to meet different business scenarios:

Common Message: It provides extreme elasticity and massive accumulation capabilities. Built-in retries and dead-letter queues are used to meet business requirements for failed retries. They feature high throughput, high availability, and low latency. They are widely used in scenarios (such as application integration, asynchronous decoupling, and peak cut).

Scheduled Message: It provides second-level precision and 40-day scheduling, mainly designed for distributed timed scheduling, task timeout processing, and other scenarios. It is currently open-source.

Ordered Message: It supports keeping global and local orders and ensures end-to-end orders: sending, storage, and consumption. It is designed for scenarios (such as ordered event processing, matching transactions, and real-time incremental data synchronization).

Transactional Message: It is a distributed, high-performance, and highly available eventual consistency transaction solution, which is widely used in service consistency and coordination scenarios in e-commerce transaction systems and has been open-sourced.


When RocketMQ 4.0 is in service, both commercial and open-source were committing to the all-around expansion of message access capabilities, enabling RocketMQ to connect to the ecosystem of open-source applications and cloud products easily. For example, a multi-language SDK is commercially available, and open-source SDKs that can cover Java, Go, Python, and C++ to use RocketMQ are available. It also supports the Spring ecosystem and uses RocketMQ through Spring Cloud. Commercially, it provides a set of easy-to-use HTTP APIs and implementations in 6-7 languages.

In addition to SDK access, RocketMQ is actively embracing community standards and providing access to AMQP and MQTT on the cloud product side. MQTT is open-source.

RocketMQ is vigorously developing the connector ecosystem, which can access many data sources through RocketMQ connectors, including big data systems (such as Redis, MongoDB, and Hudi).

Moreover, Alibaba Cloud EventBridge has been open-sourced. Data from Alibaba Cloud's cloud products, SaaS applications, and self-built data platforms can be introduced into RocketMQ through this product.

RocketMQ 4.0 has tried a lot, providing all-around message-accessing capabilities.


RocketMQ has accumulated many leading business message processing and service capabilities in the process of serving Alibaba Group users and commercializing. For example, in terms of message subscription, RocketMQ supports cluster distributed consumption and broadcast consumption. In terms of message processing, it supports flexible filtering based on tags and SQL. Among them, SQL-based filtering is an important feature in e-commerce transactions, which can realize low delivery ratios in the case of abnormal subscription ratios.

Global message routing capabilities have features of high performance and real-time. Data centers are distributed across regions and isolated by VPC networks in the cloud era. However, the global message routing feature can connect regions and networks to meet more business scenarios. For example, Alibaba has implemented enterprise-level features (such as active geo-redundancy and geo-disaster recovery based on this capability).

The global message routing is easy to use, providing a visual task management interface that allows you to create replicated traces through simple configuration.

In terms of message governance, RocketMQ provides capabilities (such as access control, namespace, instance throttling, message playback, retry messages, dead-letter messages, and accumulation governance).

In terms of service capabilities, RocketMQ has accumulated a lot. The twelve-year service for transaction traces and the ten-year participation in Double 11 have ensured that RocketMQ can provide high reliability on Alibaba Cloud. The peak TPS for Double 11 message sending and receiving is over 100 million, and the total daily message sending and receiving volume exceeds 3 trillion. It is one of the largest business message clusters in the world. However, even under the pressure of the trillion data peak of Double 11, 99.996% of messages can be responded to in one millisecond. The average response time of message publishing does not exceed three milliseconds, and the maximum response time does not exceed 20 milliseconds. This realizes low-latency message publishing.


At the beginning of commercialization, the biggest problem encountered by customers is how to completely track asynchronous message traces in a distributed environment. Then, we created the industry's first visualized full-lifecycle message trace tracking system, which can provide a wide range of message query, message download, fixed-point reprojection, and trace tracking capabilities. The observability system helps users solve unobservability problems in distributed environments.

As shown in the preceding figure, a message is generated and sent to the server for storage and delivered to the consumer. The sending and consumption can be traced, including which consumers are delivered to, where do which consumers successfully consume or fail to consume, and when is the proper time to reproject. It helps customers solve the distributed observability problem.


In addition to functional features, RocketMQ has done a lot of construction in terms of stability. We insist that SLA is the foundation of cloud-native. Therefore, the entire R&D O&M trace has strict stability assurance measures.

  • Architecture Development: Each solution will have a failure design. The code development phase will have a strict code review. It will go through processes of unit test, integration test, performance test, and disaster recovery test.
  • Change Management: There is a strict change system, so each change can be grayscaled, monitored, rolled back, and degraded.
  • Stability Protection: It provides capabilities (such as throttling, degradation, capacity assessment, emergency solutions, and large-scale promotion protection). It regularly conducts fault and plan drills and sorts out risks.
  • Systematic Inspection: A comprehensive black-box inspection of the production environment is available on the cloud. From the user's perspective, a full-function scan will be performed in the whole region, including more than 50 detection items. Any function problem can be detected immediately. In terms of white-box inspection, JVM run time metrics, kernel system, and cluster metrics are inspected.
  • Fault Emergency: A complete fault emergency process is available, including monitoring and alert, fault occurrence, immediate remediation, root cause troubleshooting, and fault review.

An Upgrade to the RocketMQ 5.0 Cloud-Native Architecture

Cloud users have higher requirements for service, flexibility, controllability, and resilience of cloud products in the cloud-native era. In this context, we upgraded the cloud-native architecture of RocketMQ, which is the background of the birth of RocketMQ 5.0.


Lightweight SDK: A set of lightweight SDKs are developed based on the cloud-native communication standard gRPC, which can complement current rich clients.

Stateless Message Gateway: It is introduced in the core data trace. Storage nodes are separated to be responsible for core message storage and high availability by building a stateless service node Proxy and exposing services through LB. Proxy is deployed separately from the Store node, independent of elasticity.

Leaderless High-Availability Architecture: Store nodes are fully peer-to-peer and leaderless. Controlling nodes without ZH and HA can achieve high availability. Compared with the traditional Raft consistency protocol, this Leaderless architecture can select the number of replicas flexibly, realize synchronous and asynchronous automatic upgrades and downgrades, and realize failover within seconds. The high-availability architecture is now open-source and integrated with Dledger.

Cloud-Native Infrastructure: Observability is cloud-native and OpenTelemetry is standardized. The overall architecture is moving towards Kubernetes, which can make full use of the resource elasticity of the sales zone.


The recommended access to RocketMQ 4.0 is to use rich clients. Rich clients provide a range of enterprise-class features (such as client-side load balancing, message caching, and failover). However, lightweight and high-performance clients are easier to be integrated with cloud-native technology stacks in the cloud-native era.

Therefore, RocketMQ 5.0 launched a new multi-language lightweight SDK with the following benefits:

New Simple API Design: Immutable API with perfect error handling. Multi-language SDK ensures that APIs are aligned at the native level. A new Simple Consumer is introduced, which can support consumption by message model. Users no longer need to care about message queues. They only need to pay attention to messages.

The gRPC Protocol Adopted by the Communication Layer: It embraces the cloud-native communication standard. gRPC can make services easier to be integrated. Multi-language SDK communication code can be quickly generated by gRPC, which is more native.

Lightweight Implementation: The use of a stateless consumption mode can reduce the complexity of the implementation of the client. The client is lighter, and the applications used are easier for Serverless and Mesh.

Cloud-Native Observability: The client implements the OpenTelemetry standard and can export Metrics and Tracing in the form of OpenTelemetry.


Another major upgrade of RocketMQ 5.0 is the introduction of a new stateless consumption model. It is built on top of the original queue model. The queue model is a consumption model consistent with the storage model. Consumers implement load balancing and pull messages based on queues. This model is ideal for quick pulling in batch and scenarios insensitive to the status of a single message, such as stream computing.

RocketMQ 5.0 released the PoP mechanism, which cleverly built a message model on top of the queue model. In the design of this message model, the service can only care about messages without caring about queues. All APIs can support consumption, retry, modification of invisible time, and deletion of a single message.

Messages are visible to consumers in the message model after they are sent and stored. When it is consumed, the message becomes scheduled invisible. After the message times out, it will be visible again and can be consumed by other consumers. After a consumer confirms the message, the server deletes it, and it becomes invisible.

Under the consumption process based on the message model, API is message-oriented rather than queue-oriented. However, when the PoP mechanism meets a stateless Proxy, all nodes except the storage layer are stateless. Clients, connections, and consumption are also stateless, which can be drifted on any Proxy nodes, truly becoming lightweight.


After reconstruction, the observability of RocketMQ 5.0 approached the cloud-native standard.

Metrics Side

  • Wide-Ranging Metrics: Metrics (including the number of messages, number of accumulated messages, and time consumed in each stage) are aggregated and displayed based on dimensions of instance, topic, and consumer group ID.
  • Best Practices for Messaging Teams: It provides continuously updated.
  • Prometheus + Grafana: This is the Prometheus standard data format, which is displayed by Grafana. In addition to templates, users can customize their display dashboard.

Tracing Side

  • OpenTelemetry Tracing Standard: The RocketMQ tracing standard is merged into the open-source OpenTelemetry standard to provide many scenarios for messaging tracing.
  • Customized Display of Messages: Message Queue for Apache RocketMQ reorganizes abstract request span data based on messages to offer an intuitive display for information regarding one-to-many consumption and repeated consumption.
  • Connection between Upstream and Downstream Trace Information: The context of calls can be inherited and added to the complete trace in message tracing. Message trace data includes the upstream and downstream information about asynchronous traces.

Logging Side

  • Error Code Standardization: A unique error code is assigned to each type of error.
  • Error Message Integrity: An error message includes complete error information and resource information required for troubleshooting.
  • Error Level Standardization: Message Queue for Apache RocketMQ allows users to configure log levels for error messages. This way, users can configure suitable monitoring and alert settings based on log levels (such as Error and Warn).


In terms of elasticity, RocketMQ 5.0 Commerce can take full advantage of cloud computing, storage, and network pooling resources. For example, all workloads of RocketMQ 5.0 are deployed on ACK in computing, which makes full use of the elasticity of ACK and elastic resources of ACK. It mainly relies on two technologies of ACK. One is the elastic resource pool, and the other is HPA supporting fast and elastic computing. Cross-zone deployment is implemented on ACK to ensure high availability.

RocketMQ 5.0 will make full use of Alibaba Cloud network facilities at the network level to provide users with convenient network access. For example, the Internet can be used with RocketMQ 5.0 instances whenever it is enabled and whenever it is depended on to test. It is turned off immediately after testing, which is safe and convenient. It also supports various private network types, including Single Tunnel and Private Link. A global interworking design network is built based on CEN.

In terms of storage, RocketMQ 5.0 Commerce is the first to introduce the concept of multi-level storage. It builds secondary storage based on OSS, which can make full use of the elasticity of OSS storage. Storage billing has also shifted to pay-as-you-go. Users can customize the message storage duration on RocketMQ. For example, the valid duration of messages can be prolonged from three days to 30 days to convert messages into data assets. It separates hot and cold data by secondary storage and provides a consistent cold read SLA for users.

RocketMQ 5.0 Commerce Preview

After five years of development, the open-source and commercial versions of RocketMQ 4.0 have entered the 5.0 era together. At the end of July 2022, AlibabaMQ released a new commercial version of 5.0 based on the open-source version.


RocketMQ 5.0 has the following major changes compared with version 4.0 instances:

  1. It is a new version with less costs. The new version adopts a new billing method, including annual and monthly billing, pay-as-you-go, and flexible billing for Internet traffic. There are more complete sales systems (such as added professional instances), which can meet the needs of some users. At the same time, a special test environment instance is added to each product series so users can build their development environment at a low cost.
  2. It is more flexible to reduce costs and improve efficiency. Storage is elastic and can be used by Serverless on-demand and pay-as-you-go. Reserved elasticity supports real-time upgrade and downgrade of instance basic specifications. Uers can easily perform elasticity before traffic arrives. In addition, the professional edition supports burst traffic elasticity, which can solve the potential problem of online stability.
  3. The new architecture enhances observability O&M. The stateless message consumption model can solve the pain points of some old versions. The cloud-native access stack is adopted in observability.

A New Form of Messages: EventBridge

EventBridge is open-source in the RocketMQ community. Events are everywhere, cloud computing resources are scattered everywhere, and various ecological silos can be seen everywhere in the cloud-native era. Therefore, it is the general trend to integrate them in an event and event-driven way.

Based on this, Alibaba Cloud has launched a new event-based product, EventBridge. This product is built on RocketMQ and is an event-driven architecture practice on RocketMQ.


Event sources of EventBridge include control events of Alibaba Cloud services, such as resource change events, audit events, configuration change events, and data events of Alibaba Cloud services, as well as custom applications, SaaS applications, user-created data platforms, and services of other cloud vendors.

After being processed by EventBridge, an event is delivered to an event target that includes function compute, message service, user-created gateway, HTTP(S), SMS, email, and DingTalk.

Complete event processing is experienced between the event source and target, including the ability to filter, convert, archive, and play back events after the event source is connected to EB. Events have perfect observability design in the whole process of EventBridge, including event query and tracing analysis. Events can be accessed in various ways, including OpenAPI, SDKs for seven languages, CloudEvents SDK, Web Console, and Webhook.

EventBridge has the following features:

① It can reduce user development costs. Users do not need additional development. Event architecture can be implemented by creating resources (such as EventBridge sources, event targets, and event rules). Users can write event rules to filter and convert events.

② It supports native CloudEvents, embraces the CNCF community, and seamlessly connects to the community SDK. Standard protocols can unify Alibaba Cloud event specifications.

Event Schema Support: It supports automatic detection and verification of event Schema and supports Schema binding between Source and Target.

Global Event Interoperability: A global event interoperability network is established, which is a cross-region and cross-account event network, and can support event routing across clouds and data centers.


EventBridge has begun to take shape in the cloud ecosystem and has integrated more than 255 cloud product event sources and over 1000 event types.

EventBridge takes the lead to integrate the messaging ecosystem. The matrix ecosystem of Alibaba Cloud's messaging products is fully integrated through EventBridge. The data of any messaging product and another one can interflow. At the same time, it relies on EventBridge's global event network to give global message routing capabilities to all message products.

EventBridge is currently connected to DingTalk ISVs and CloudTmall ISVs internally. There are more than 50 SaaS systems that can be accessed through Webhooks externally. In addition, a large number of event sources can reach more than ten event targets and have been connected to all cloud product APIs. Any event can drive all cloud product APIs.

Join the Apache RocketMQ community: https://github.com/apache/rocketmq

0 1 0
Share on

You may also like


Related Products

  • Function Compute

    Alibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.

    Learn More
  • AlibabaMQ for Apache RocketMQ

    AlibabaMQ for Apache RocketMQ is a distributed message queue service that supports reliable message-based asynchronous communication among microservices, distributed systems, and serverless applications.

    Learn More
  • Elastic High Performance Computing Solution

    High Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.

    Learn More
  • Bastionhost

    A unified, efficient, and secure platform that provides cloud-based O&M, access control, and operation audit.

    Learn More