Symptoms

When I use my Message Queue for Apache RocketMQ instance, I receive a message accumulation alert. Then, I log on to the Message Queue for Apache RocketMQ console, perform the following operations, and then find the following problems:
  • On the Group Details page, check the value of the Real-time Accumulated Messages field of the group ID. Then, I find that the value is higher than expected.
  • In the left-side navigation pane, click Message Tracing. On the page that appears, click Create Query Task. In the Create Query Task dialog box, click the Query by Message ID tab and set the parameters. Then, I find that some messages are sent to the broker but are not delivered to consumers.

Causes

In Message Queue for Apache RocketMQ, messages are first sent to the broker. Then, the client that is configured with the group ID pulls some messages from the broker to the local machine for consumption based on the current consumption offset. In most cases, messages are not accumulated when the client pulls messages from the broker. However, if the consumption time is long or the concurrency is low, the consumption capability of the client is insufficient. Therefore, messages are accumulated. For more information about the consumption mechanism and message accumulation causes, see Message accumulation and latency.

Solutions

If messages are accumulated, perform the following operations for troubleshooting:

  1. Determine whether messages are accumulated on the Message Queue for Apache RocketMQ broker or client.
    Check the local log file ons.log of the client to search for the following information:
    the cached message count exceeds the threshold
    • If the preceding information is found, the local buffer queue on the client is full and messages are accumulated on the client. Go to Step 2.
    • If the preceding information is not found, messages are not accumulated on the client. Then, you can submit a ticket to contact Alibaba Cloud Customer Services.
  2. Check whether the message consumption time is reasonable.
    • If the consumption is time-consuming, go to Step 3 to view the client stack information and troubleshoot the specific business logic.
    • If the consumption time is normal, messages may be accumulated due to low consumption concurrency. You must gradually increase the number of consumption threads or add nodes.
    You can view the consumption time by using one of the following methods:
  3. View the client stack information. You only need to take note of the thread named ConsumeMessageThread. This is the logic for the business to consume messages. For more information about how to determine the thread status and modify the business logic based on specific problems, see Java official documentation.
    You can obtain the client stack information by using one of the following methods:
    • Log on to the Message Queue for Apache RocketMQ console and view the consumer status. In the Connection Information section, view the stack information. For more information, see View the status of consumers.
    • Use the Jstack tool to print stack information.
      1. Obtain the host IP address of the consumer instance that has accumulated messages, and log on to the host. For more information, see View the consumer status.
      2. Run one of the following commands to view the PID of the Java process and note it:
        ps -ef 
        |grep javajps -lm
      3. Run the following command to view the stack information:
        jstack -l pid > /tmp/pid.jstack
      4. Run the following command to view information about the thread named ConsumeMessageThread:
        cat /tmp/pid.jstack|grep ConsumeMessageThread -A 10 --color
    The common exception stack information is similar to the following examples:
    • Example 1: The stack is idle and has no accumulated messages.

      When the consumption thread is idle, it is in the WAITING state and waits to obtain messages from the consumption task queue.

      Stack example 1
    • Example 2: The consumption logic is in situations such as lock stealing or sleep.
      The consumption thread is blocked on an internal sleep() method, resulting in slow consumption.Stack example 2
    • Example 3: The consumption logic is stuck when operations are performed on external storage devices such as databases.
      The consumption thread is blocked on external HTTP calls, causing slow consumption.Stack example 3
  4. If the accumulated messages have affected business operating and the accumulated messages can be discarded, you can reset the consumer offset to skip the accumulated messages and recover the consumption. For more information, see Reset consumer offsets. The consumer client must be online when you reset the consumer offset.