Community Blog RocketMQ Schema – Make Messages Streaming Structured Data

RocketMQ Schema – Make Messages Streaming Structured Data

This article explains the importance of Schema, its architecture, and more.

By Yibin Xu (Senior R&D Engineer of Alibaba Cloud Intelligence)

Why We Need Schema?


Currently, RocketMQ does not have any m constraints on the message body. It can be JSON, object toString, a word, or a log. Serialization and deserialization are left to the users. The upstream and downstream businesses must agree on the understanding of the message body before they can communicate over RocketMQ. The situation above will lead to the following two problems.

  1. Type-Safe Risk: If producers or consumers come from different teams when the upstream makes minor but incompatible changes to the data format, the downstream would be unable to process the data normally, and the recovery speed would be slow.
  2. The Application Extension Problem: In R&D scenarios, although RocketMQ decouples links, the upstream and downstream of the R&D phase still need to do a lot of communication and coordination based on message understanding. The coupling is strong, and the reconstruction of the production end needs to be changed together with the consumer end. In data stream scenarios, if schemas are not defined, the entire data parsing logic needs to be rewritten each time we build ETL.


RocketMQ schema allows you to manage the data structure of messages. It also provides a variety of serialization and deserialization SDKs (such as Avro, JSON, and PB) for native clients. This makes up for the shortcomings of RocketMQ in data governance and upstream and downstream service decoupling.

As shown in the preceding figure, when you create topics on Kafka commercial edition, you are reminded to maintain the schemas related to the topics. If the schemas are maintained, the upstream and downstream businesses can clearly understand what data needs to be passed in when they see the topics. This improves R&D efficiency.


We hope RocketMQ can be used in app business scenarios, IoT messaging scenarios, and big data scenarios to become the business hub of the entire enterprise.

After RSQLDB is added, users can use SQL to analyze RocketMQ data. RocketMQ can be used as a communication pipeline with stream characteristics and as a data deposit, which is a database. If RocketMQ is approaching both the streaming engine and DB engine, its data definition, standardization, and governance become extremely important.


We expect RocketMQ to have the following benefits in business messaging scenarios (after adding schemas):

Govern Data: Avoid dirty data of messages and prevent producers from generating messages with irregular formats

Improve R&D Efficiency: Reduce communication costs at the upstream and downstream R&D stage or joint debugging stage of the business

Host Contract: Decouple the upstream and downstream of the business in a real sense after managing contracts

④ Improve the Robustness of the Entire System: Avoid data exceptions (such as sudden failure to parse downstream data).

We expect RocketMQ to have the following benefits in streaming scenarios:

Govern Data: Ensure the smoothness of data parsing throughout the link

Improve Transfer Efficiency: Improve the transmission efficiency of the entire link as schemas are independently managed without the need to attach them to data

③ Promote the integration of message-stream-table, and topics can become dynamic tables.

Support More Serialization Methods to Save Message Storage Costs: JSON is used to parse data in most business scenarios. Avro, which is commonly used in big data scenarios, can reduce message storage costs.

Overall Architecture


The preceding figure shows the overall architecture after Schema Registry is introduced. Schema Registry is introduced under the original core architecture of producers, brokers, and consumers to host data structures of the message body.

The lower layer is the management APIs of a schema, including creating, updating, deleting, and binding. In schema management APIs' interactions with producers and consumers, producers serialize schemas before sending them to brokers. During serialization, producers query metadata from the registry and then parse schemas. Consumers support query schema based on IDs and topics and then perform deserialization. When sending and receiving messages, you only need to care about struct and do not need to care about how to serialize or deserialize data.



Schema Registry is deployed in a similar way to NameServer and is deployed separately from brokers. Therefore, brokers do not have to rely heavily on Schema Registry. They use a stateless deployment mode and can be dynamically scaled. In terms of persistence, new features of Compact Topic 5.0 are used by default. You can implement storage plug-ins based on MySQL or Git. The management interface provides a RESTful interface for adding, deleting, modifying, and checking. It also supports schemas to bind or unbind with multiple topics.

After the application is started, it provides a built-in Swagger UI for interactive version evolution. It also provides version evolution in the SchemaName dimension and corresponding compatibility check and supports seven compatibility policies. In terms of metadata, each schema version exposes a globally unique RecordID to you. After you obtain the RecordID, you can go to the registry to find the unique schema version.


The code is designed as shown in the preceding figure. It mainly exposes a restful interface for Spring boot applications. Under the Controller is the Service layer that involves permission verification, jar package management, and StoreManager. StoreManager includes local cache and remote persistence.


The core concepts of the Schema Registry are aligned with the RocketMQ kernel. For example, clusters in the registry correspond to clusters in the kernel, Tenant corresponds to NameSpace, and subjects correspond to topics in the kernel. Each schema has a unique SchemaName. You can use the Java class name or full path name of your application as the SchemaName to ensure it is globally unique and can be bound to subjects. Each schema has a unique ID generated by the snowflake algorithm on the server. Each update of SchemaVersion does not change the ID but generates a monotonically increasing version number. Therefore, a schema can have multiple different versions.

The ID and version are superimposed to generate a new concept record ID, which is exposed to users for uniquely locating a schema version. SchemaType includes common serialization types (such as Avro, JSON, and Protobuf). IDL is used to specifically describe structured information about a schema.


Each schema has an ID. The ID remains unchanged, but versions can be iterated. For example, from version 1 to version 2 to version 3, each version supports binding with different subjects. Subjects can be understood as Flink tables. For example, in the right figure, Flink SQL is used to create a table. First, create RocketMQ topics and register them with NameServer. You must create schemas and register them with subjects because of the table structure. Therefore, after schemas are introduced, they can be seamlessly compatible with data engines (such as Flink).


Schemas store the following types of information:

  • Meta Information: type, Tame, ID, attribution, and compatibility
  • Specific Contents of Each Version: Version number, IDL, fields in IDL, jar package information, and bound subject
  • Naming Information: Cluster, tenant, and subject
  • Audit Information
  • Reserved Attribute


The specific storage design is divided into the following three layers:

Client Cache: If Producers and Consumers interact with the registry every time they send and receive messages, the performance and stability are affected. Therefore, RocketMQ implements one layer of cache. Schema update frequency is relatively low, and the cache can meet most requests for sending and receiving messages.

Server Cache: RocksDB is used for one layer of caching. Thanks to RocksDB, service restarts and upgrades do not affect their data.

Server-Side Persistence: Remote storage is implemented by plug-ins. The compact topic feature of RocketMQ5.0 supports KV storage.

Remote persistence and local cache synchronization are monitored and synchronized through the PushConsumer of the registry.


Currently, Schema Registry supports seven compatibility policies. The default is backward, and Xiaomi's internal practice has also verified that the default strategy is sufficient. The verification is that consumers are compatible with producers. After schemas are evolved, consumers need to be upgraded first. The higher version of consumers can be compatible with the lower version of the producer.

If the compatibility policy is backward_transative, all versions of producers are compatible.


The interface design complies with the Open Schema standard. After the registry service is started, you can initiate an HTTP request by accessing the swagger UI page of the local host and managing schemas yourself.

Client Design


During message sending and receiving, MQ clients need to provide SDKs for schema query and message serialization, and deserialization.

As shown in the figure above, the previous clients passed byte arrays when sending and receiving messages. Now, we want producers to care about an object and consumers to care about an object. If consumers are unaware of the class of objects, they can understand the message using a common type (such as generating the record). Therefore, data sent and received by users is structured data similar to public class Order.

Schemas can be automatically created or updated on the producer end. Mainstream serialization methods (such as Avro and JSON).


The design principle is to not invade the original client code. If schemas are not used, message sending and receiving are unaffected because clients are not aware of the schema but are aware of the serialization and deserialization types. What's more, the client is designed to support parsing by the latest version and by the specified ID during serialization. In addition, message parsing without Schema Registry is supported to meet the needs of lightweight scenarios (such as streams).

The preceding code shows the serialization and deserialization of the core APIs of a schema. The parameters are simple. As long as topics and the original message objects are passed in, they can be serialized to message body formats. Deserialization is the same. If subjects and the original byte arrays are passed in, the objects are parsed and passed to clients.


The preceding figure shows the sample of the producer that integrates schemas. You need to specify registry URLs and serialization types to create producers. You need to send the original objects Instead of byte arrays.


When creating consumers, you must specify the registry URL and serialization types and then use the getMessage method to obtain generic or actual objects.

ETL Scene Landing


The RocketMQ Flink catalog is mainly used to describe metadata (such as Tables and Databases of RocketMQ Flink). Therefore, some concepts need to be naturally aligned when you implement them based on Schema Registry. For example, catalogs correspond to clusters, databases correspond to tenants, and subjects correspond to tables.


In the process of disparate data source conversion, an important part is to convert schemas of the disparate data source. This process involves converters. ConnectRecord transfers data and schemas together. If converters rely on the registry for third-party managing of schemas, ConnectRecord does not need to put the original data and schemas together. This improves transmission efficiency. It is also the starting point of connect integrating the Schema Registry.


The starting point of the integration into the RocketMQ streams scenario is to make RocketMQ streams APIs more user-friendly. If schemas are unintegrated, you need to convert data into JSON. After the integration, you can directly use objects to be close to the usage habits of Flink or streams during stream analysis. This is more user-friendly.

In the preceding code, the parameter schemaConfig is added to configure schemas, including serialization type and target Java class. The subsequent computing of the filters, maps, and window operators can be simply based on object operations.

In addition, integrated streams can support basic type parsing, group by an operation based on messages, and custom deserialization optimizer.

Future Planning


In the future, we will continue to improve on the following results:

  1. Developing Community SIG: The group has grown from nothing. There are still many to-do lists that have not been realized yet. There are also many good first issues suitable for new community members to try.
  2. Strengthening the Concept of Table: If RocketMQ wants to move closer to a streaming engine, it needs to keep strengthening the concept of Table. Therefore, the introduction of schema is a good opportunity to upgrade the topic concept of RocketMQ to the table concept and promote the deep integration of messages and stream tables.
  3. Providing Schema Management with No-Server: The introduced registry increases dependency on external components. Therefore, we hope some scenarios that emphasize lightweight provide no-server schema management. For example, scenarios directly interact with RocketMQ, persist messages to compact topics for direct reading and writing, or store messages based on Git.
  4. Achieving Column-Based Query: After integrating messages into streams, we found that we can consume and understand messages according to fields. Currently, RocketMQ messages are understood by row. During parsing and computing, the entire message body needs to be consumed. Currently, streams can consume messages based on fields. In the future, it is expected that RocketMQ can be used to query messages based on conditions and fields.
  5. Handling Data Lineage or Data Map: When RocketMQ uses features (such as hierarchical storage) to extend the message lifecycle, it can be considered a data asset of an enterprise. The current pain point lies in the dashboards provided by RocketMQ. It is difficult for business personnel to perceive the business semantics behind topics. If we can handle data lineage and clarify the upstream and downstream relationships among data topics (such as who is producing data, which fields are provided, and which information), the entire dashboard can provide a business dashboard from the perspective of messages. This has a lot of room for imagination.
0 2 1
Share on

You may also like


Related Products