Community Blog Distributed End-to-End Tracing Analysis of Message Queue for Apache RocketMQ x OpenTelemetry

Distributed End-to-End Tracing Analysis of Message Queue for Apache RocketMQ x OpenTelemetry

This article discusses RocketMQ 5.0 and distributed end-to-end tracing analysis, best practices, and trends/thoughts.

By Yangkun Ai (Apache RocketMQ PMC Member/Committer, CNCF OpenTelemetry Member, and CNCF Envoy Contributor)

In a distributed system, the interaction between multiple services involves complex network communication and data transmission, where each service may be maintained and developed by a different team or organization. Therefore, in such an environment, a request is sent and processed by multiple services. If a problem or error occurs, it is difficult to locate the root cause quickly. Distributed end-to-end tracing analysis technology can help solve this problem. It can track and record the transmission process of the request in the system and provide detailed performance and logs, so developers can quickly diagnose and locate the problem. It plays an important role in the reliability, performance, and maintainability of distributed systems.

RocketMQ 5.0 and Distributed End-to-End Tracing Analysis

As the largest iteration of Apache RocketMQ 5.0 in recent years, many improvements have been made in the overall observability. Among them, supporting standardized distributed end-to-end Tracing Analysis is an important feature.

RocketMQ 5.0 Observability

As the official successor of OpenTracing and OpenCensus, CNCF OpenTelemetry (jointly launched by Google, Microsoft, Uber, and LightStep) has become the de facto standard in the observability field. The distributed end-to-end tracing analysis of RocketMQ is also developed around OpenTelemetry.

The origins of distributed tracing analysis systems can be traced back to Google's 2007 paper entitled Dapper, a Large-Scale Distributed Systems Tracing Infrastructure [1]. This paper details the tracing analysis system Dapper used internally by Google, where the span concept was widely adopted and became one of the basic concepts in the later open-source tracing analysis system.

Dapper Trace Tree

In Dapper, a span is created when each request or transaction is tracked to record the time and status information of each component and operation during the entire request or transaction processing process. The span can be nested to form a tree structure, which is used to represent the dependencies and calling relationships between various components in the entire request or transaction process. Later, many open-source tracing analysis systems (such as Zipkin and OpenTracing) adopted a similar span concept to describe tracing analysis information in distributed systems. Now, CNCF OpenTelemetry, which combines OpenTracing and OpenCensus, has naturally adopted the span concept and further developed on this basis.

OpenTelemetry defines a set of semantic conventions [2] for the span related to messaging, which aims to develop a set of specifications independent of a specific messaging system. OpenTelemetry's development is driven by specifications.

Specification Driven Development

Messaging Span Definition

The specification describes the topological relationships of messaging span, including the parent-child and link relationships between different spans that represent message sending, receiving, and processing. Please refer to Semantic Conventions of Messaging [3] for specific definitions. There are three different types of span in Message Queue for Apache RocketMQ.

Span Description
send The sending process of a message. The span starts with a sending behavior and ends with a success or failure/exception. The internal retries of message sending are recorded as multiple span entries.
receive The long polling process for receiving messages in consumers is consistent with the lifecycle of long polling.
process Corresponding to the message processing process in the MessageListener of PushConsumer, the span starts with entering MessageListener and ends with leaving MessageListener.

Specifically, the receive span is not enabled by default. The organizational relationship between the span is different in the two cases where the receive span is enabled and not enabled.

Span Relationships before and after Enabling Receive Span

If the receive span is not enabled, the process span is used as the child of the send span. If the receive span is enabled, the process span is used as the child of the receive span and is linked to the send span.

Messaging Attributes Definition

The semantic convention specifies the uniform names of the common attributes carried with the span, including (but not limited to):

  • messaging.message.id: The unique identifier of the message
  • messaging.destination: The destination of the message, usually a queue or topic name
  • messaging.operation: The type of operation on the message (such as sending, receiving, and confirming)

Please see Messaging Attributes [4] for more information.

In particular, different message systems may have their specific behaviors and attributes. RocketMQ, together with Kafka and RabbitMQ, has promoted their unique attributes to the community specification [5], including:

Attribute Type Description
messaging.rocketmq.namespace string The RocketMQ resource namespace is not enabled.
messaging.rocketmq.client_group string The RocketMQ producer/consumer load balancing group. RocketMQ 5.0 only takes effect for the consumer.
messaging.rocketmq.client_id string The unique identifier of the client
messaging.rocketmq.message.delivery_timestamp int The scheduled time of the scheduled message, which only takes effect for RocketMQ 5.0.
messaging.rocketmq.message.delay_time_level int The timing level of scheduled messages, which only takes effect for RocketMQ 4.0.
messaging.rocketmq.message.group string Ordered message grouping, which only takes effect for RocketMQ 5.0
messaging.rocketmq.message.type string The type of message may be normal, fifo, delay, or transaction, which only takes effect for RocketMQ 5.0.
messaging.rocketmq.message.tag string Message tag
messaging.rocketmq.message.keys string[] Message keys (can have multiple keys)
messaging.rocketmq.consumption_model string The message consumption model, which may be clustering or broadcasting. RocketMQ 5.0 broadcasting was abandoned.

Getting Started

There are two different ways to add observability information to an application in OpenTelemetry.

  1. Automatic Instrumentation: You do not need to write any code, and you only need to perform simple configuration to generate observability information automatically, including the class libraries and frameworks used in the application, so you can more easily obtain basic performance and behavior data.
  2. Manual Instrumentation: You need to write code to create and manage observability data and export it to a specified destination through an exporter. This allows you to control the logic and functions you want more flexibly.

In the Java class library, the former is a more common form of use. The trace of the Message Queue for Apache RocketMQ 5.0 client is also implemented by the automatic instrumentation. In a Java program, the automatic instrumentation takes the form of mounting a Java agent. In the past year, we have introduced the RocketMQ 5.0 client instrumentation [6] to the OpenTelemetry community. Now, you only need to mount the OpenTelemetry agent when the Java program is running to implement distributed end-to-end tracing analysis transparent to the application.

In addition, Automatic Instrumentation does not conflict with Manual Instrumentation, and the key objects used in Automatic Instrumentation are registered as global objects, which can be easily obtained in the way Manual Instrumentation is used. It is very flexible and convenient when two Instrumentation share a set of configurations.

Prepare a Message Queue for Apache RocketMQ 5.0 client for Java. Please see example[7] for more information. Please refer to the RocketMq-clients repository [8] and RocketMQ official website [9] for more details about RocketMQ 5.0.


Then, prepare the OpenTelemetry agent jar. You can download the latest agent [10] from OpenTelemetry and add -javaagent:yourpath/opentelemetry-javaagent.jar when the application starts.

You can set the access point of the OpenTelemetry collector by setting the OTEL_EXPORTER_OTLP_ENDPOINT environment variable.

By default, only the span of send and process is enabled according to the specification on messaging in OpenTelemetry. The span of receive is not enabled by default. In order to enable receive span, you need to manually set-Dotel.instrumentation.messaging.experimental.receive-telemetry.enabled=true.

Best Practices

Currently, mainstream cloud service providers provide good support for OpenTelemetry. Both SLS and ARMS on Alibaba Cloud provide distributed end-to-end tracing analysis services based on OpenTelemetry.

This code example (rocketmq-opentelemetry[11]) demonstrates the process of distributed end-to-end tracing analysis. In this code example, three different processes are started, involving mutual calls between three different class libraries and business logic. This shows a typical case of interaction between more complex middleware in a distributed environment.

First, a request is sent from the gRPC client to the gRPC server. After receiving the request, the gRPC server sends a message to the producer of RocketMQ 5.0 and then returns a response to the client.

After receiving the message, the PushConsumer of RocketMQ 5.0 uses Apache HttpClient in the MessageListener to send a GET request to Taobao.com.

Sample Code Call Link

In particular, gRPC clients initiate specific calls within the lifecycle of an upstream service span, which we call ExampleUpstreamSpan.

After receiving a message, the RocketMQ 5.0 PushConsumer also performs other business operations in the MessageListener. The corresponding span is called ExampleDownstreamSpan. By default, if the receive span is not enabled, seven spans exist in order of start time. They are:

  • ExampleUpstreamSpan
  • The span of the gRPC client request
  • The span of the response from the gRPC server
  • The send span of the RocketMQ 5.0 producer
  • The process span of the RocketMQ 5.0 producer
  • The span of the HTTP request
  • ExampleDownstreamSpan

Connect RocketMQ 5.0 to Log Service Trace Service

Create a trace service in Alibaba Cloud Log Service. Then, obtain the endpoint, project, and instance name. Please see Use OpenTelemetry to connect to Log Service Trace Service [12] for more information.

After you add the information, you can wait a moment to see that the corresponding trace information has been uploaded to the SLS trace service.

Distributed End-to-End Display of the Log Service Trace Service

The Trace service stores relevant data in logs, so these data can be queried using the SQL syntax of SLS.

Trace data allows you to easily know the user's operating system environment, Java version, and other basic information. A series of valid information (such as the message sending latency, failure or not, whether the message is delivered to the client on time, the local consumption time of the client, and whether the consumption fails or not) can help troubleshoot the problem effectively.

In addition, the demo page of the SLS trace service provides a message middleware dashboard customized based on RocketMQ 5.0, which vividly displays a series of metrics (such as the success rate of sending and end-to-end latency obtained) using trace data.

  • Message Middleware Analysis Tab [13]: Display a series of metrics obtained from trace data, including the sending latency, sending success rate, consumption success rate, and end-to-end latency
  • View RocketMQ Trace Tab [14]: You can perform fine-grained queries based on the error message id obtained in the previous step.

Message Middleware Analysis

Connect RocketMQ 5.0 to Application Real-Time Monitoring Service (ARMS)

Log on to the ARMS console, click OpenTelemetry in the access center, select Auto Detection under Java Applications, obtain the startup parameters, and modify the parameters to your Java application. Please see Use OpenTelemetry to access ARMS [15] for more information.

After configuring the parameters, start your related application. After a while, you can see the corresponding data in the ARMS Trace Explorer.

Trace Explorer

You can view the timing relationships between the span.

ARMS Trace Explorer Distributed End-to-End Tracing Analysis Display

Specifically, you can click each span to view detailed information (such as attributes, resources, and events). In addition, ARMS allows you to forward trace data from applications using OpenTelemetry Collector.

Trends and Thoughts

With the continuous evolution of modern application architecture, the importance of observability has become increasingly prominent. It helps us quickly find and solve problems in the system and improves the reliability and performance of the application. It is also a key part of implementing DevOps. Star companies like DataDog and Dynatrace were also created in related fields.

In recent years, some emerging technologies, such as (Extended Berkeley Packet Filter (eBPF) and Service Mesh, have also provided some new ideas for observability.

eBPF can run at the kernel level and dynamically inject code to monitorg the system behavior. It is widely used in real-time network and system performance monitoring, security auditing, and debugging tasks and has a little performance impact. It can also be used as an option for continuous profiling in the future. Service Mesh implements traffic management, security, and observability by injecting a proxy layer between applications. The agent layer can collect and report various metrics and metadata about traffic, which helps us understand the behavior and performance of various components in the system.

A large part of the technical trends reflected in Service Mesh has been applied to the RocketMQ 5.0 proxy, and we are also converging more observability metrics to the proxy. In the future, the current trace link is also considered to be associated with the server and to build an all-around tracing analysis system on the user side, the O&M side, and across multiple applications. In addition, you can use technologies (such as Exemplars) to link trace data with metrics data and realize the ultimate troubleshooting effect of surface-to-line and line-to-point.

In the observability field, RocketMQ is constantly exploring more advanced observability methods to help developers and customers find hidden dangers in the system faster and more easily.

Related Links

[1] Dapper, a Large-Scale Distributed Systems Tracing Infrastructure https://storage.googleapis.com/pub-tools-public-publication-data/pdf/36356.pdf

[2] A set of semantic conventions

[3] Semantic Conventions of Messaging

[4] Part of Messaging Attributes

[5] RocketMQ, together with Kafka and RabbitMQ, has promoted their unique attributes to the community specification.

[6] Instrumentation based on RocketMQ 5.0 client

[7] Example

[8] rocketmq-clients Warehouse

[9] Official RocketMQ website

[10] Download the latest agent

[11] rocketmq-opentelemetry

[12] Use OpenTelemetry to access the Log Service trace service

[13] Message middleware analysis Tab

[14] View RocketMQ Trace Tab

[15] Connect to ARMS by using OpenTelemetry


Message Queue for Apache RocketMQ 5.0 Client:

OpenTelemetry Instrumentation for RocketMQ 5.0:

An Example of OpenTelemetry in Message Queue for Apache RocketMQ:

0 1 0
Share on

You may also like


Related Products