Monitoring Kafka with Prometheus

About Kafka

What is Kafka?

Kafka is a distributed, high throughput, scalable real-time data stream platform.

Kafka is widely used in big data fields such as log collection, monitoring data aggregation, streaming data processing, online and offline analysis, and has become an indispensable part of big data ecology.

Producer: sends messages to Kafka Broker in push mode. The message sent can be the page access of the website, the server log, or the system resource information related to CPU and memory.

Kafka Broker: The server used to store messages. Kafka Broker supports horizontal expansion. The more Kafka Broker nodes, the higher the cluster throughput.

Group: Subscribe and consume messages from Kafka Broker through pull mode.

ZooKeeper: manages cluster configuration, elects leader partitions, and performs load balancing when the group changes.

Kafka Features


1. Communication mode: two communication modes are supported: queuing and publish/subscribe.

2. High throughput and low latency: On cheaper hardware, Kafka can also process hundreds of thousands of messages per second, with a minimum latency of only a few milliseconds.

3. Persistence: Kafka can persist messages to ordinary disks.

4. Scalability: Kafka cluster supports hot expansion and can dynamically add new nodes to the cluster.

5. Fault tolerance: nodes in the cluster are allowed to fail (if the number of replicas is n, n-1 nodes are allowed to fail).

Problems needing attention

1. Too many topics/partitions lead to rapid performance degradation: Too many Kafka topics/partitions (for example, for ordinary disks, there are more than 500 topics/partitions on a single machine) will lead to fragmentation of storage, and the load will increase significantly. The more topics/partitions, the higher the load, and the longer the response time for sending messages.

2. Message loss: The following two scenarios may lead to message loss, which should be avoided according to business scenarios.

• Production message: if acks= All or the number of message replicas is not greater than 1, the message may be lost when the Kafka Broker machine goes down abnormally.

• Consumer message: The consumer submits an offset when the message is not completely processed, which may result in the loss of some messages when the consumer exceptions.

3. Repeated consumption: For some reason (such as network jitter or Kafka broker outage), the producer may not receive a successful confirmation from Kafka broker, and then repeatedly send messages, eventually causing the consumer to receive multiple identical business messages. This scenario requires the idempotence of messages supported by consumers.

4. Message out of order: Kafka can only ensure the order of messages in the same partition, but not between different partitions.

5. Transaction not supported

Typical application scenarios of Kafka

1. Big data: website behavior analysis, log aggregation, application monitoring, streaming data processing, online and offline data analysis, etc.

2. Data integration: import messages into MaxCompute, OSS, RDS, Hadoop, HBase and other offline data warehouses.

3. Stream computing integration: integration with StreamCompute, E-MapReduce, Spark, Storm and other stream computing engines.

Kafka Core Concepts

Broker: a Kafka server node.

Cluster: A collection of brokers.

Message: also called Record, the carrier of information transmission in Kafka. Messages can be website page accesses, server logs, or system resource information related to CPU and memory. For the Kafka version of the message queue, messages are byte arrays.

Producer: an application that sends messages to Kafka.

Consumer: an application that receives messages from Kafka.

Consumer Group: A group of consumers with the same Group ID. When a Topic is consumed by multiple Consumer in the same Group, each message will be delivered to only one Consumer to achieve load balancing of consumption. Through the Group, you can ensure that the messages of a topic are consumed in parallel to improve the throughput of Kafka.

Topic: the topic of a message, used to classify messages. Multiple topics can be created on each broker.

Replica: Each partition has multiple replicas. When the primary partition (Leader) fails, a standby partition (Follower) will be selected as the leader. In Kafka, the default maximum number of replicas is 10, and the number of replicas cannot be greater than the number of brokers. Follower and Leader are on different machines, and the same machine can only store one replica for the same partition.

Partition: An orderly and unchanging message sequence used to store messages. A topic consists of one or more partitions, and messages in each partition are stored on one or more brokers. The order of messages in a partition is the order in which the producer sends messages.

Offset: The location information of each message in the partition is a monotonically increasing and unchanging value.

Consumption Site: the maximum site of the message consumed by the current consumer.

Stacking Amount: the total message stacking amount under the current partition, that is, the maximum bit minus the consumption point. Accumulation is a key indicator. If the accumulation is large, the Consumer may be blocked or the consumption speed cannot keep up with the production speed. At this time, it is necessary to analyze the consumer's health and try to improve the consumption speed. You can clear all stacked messages, consume from the largest point, or reset the point by time point.

Rebalance: After a consumer instance in a consumer group fails, other consumer instances automatically reallocate subscription topic partitions. Rebalance is an important means for Kafka consumer to achieve high availability.

Zookeeper: Kafka clusters rely on Zookeeper to save the cluster's meta information to ensure the availability of the system.

Introduction to main Kafka versions

• 0. x: Early incubation version.

• 1. x: Optimize Streams API, enhance observability and debugging, support Java 9, optimize SASL authentication, optimize Controller management, etc.

• 2. x: significantly improved performance, enhanced ACL support, support for OAuth2 bearers, support for dynamic SSL updates, enhanced observability, support for Java 11 (no longer Java 7), support for incremental collaboration rebalancing, etc.

• 3. x: Zookeeper dependency is removed, Java 17 is supported (Java 8 and Scala 12 are no longer supported), v0 and v1 messages are no longer supported, and performance is greatly improved.

2. x and 3. x versions are recommended.

Kafka Metric Monitoring Reference Model

In combination with Kafka's architecture and usage scenarios, we define the reference model of Kafka Metric monitoring from three aspects: metrics collection, monitoring market, and alarm indicators.

• Metrics acquisition

That is, which Kafka related metrics need to be collected to complete business monitoring. The following is a discussion of Producer, Broker and Consumer.

1. Producer indicator

Producers are applications that push messages to the Broker Topic. If producers fail, consumers will not get new information. We suggest paying attention to the following main indicators of Producer.

2. Broker indicators

Because all messages can only be used through Kafka Broker, monitoring the brokers in the cluster is the core. We suggest focusing on the following main indicators:

3. Consumer indicator

Consumer is the end point of Kafka message. We suggest focusing on the following main indicators:

4. Zookeeper indicators

ZooKeeper is a key component of Kafka (before v3. x). ZooKeeper downtime will stop Kafka.

• Kafka monitoring market

It is recommended that the default monitoring market should include at least the following indicator panels:

1. Producer

• Change of topic message production over time: It is convenient for us to quickly determine the source of traffic and provide a basis for changing the configuration of infrastructure.

• Change of request/response rate over time: paying close attention to peak and drop is critical to ensure continuous service availability.

• Change of average request delay over time: Since delay is strongly related to throughput, observing this change can help us determine whether we need to modify the batch.size parameter.

• The change of IO waiting time: If we find that the waiting time is too long, it means that the producer cannot get the required data fast enough.

2. Broker

• Change in the number of partitions with invalid replicas: If this indicator increases suddenly, it is likely that an exception occurred to a broker.

• ISR quantity change: Except that changes in broker or partition quantity will trigger changes in ISR quantity, changes in current indicators under other circumstances require our attention.

• The number of effective brokers changes.

• Change in the number of active Controller.

• Change in the number of offline partitions: If this indicator is greater than 0, it means that these partitions are unavailable, and consumers and producers of this partition will be blocked.

• Changes in leader election rate and time consumption: election means the loss of a leader; If it takes too long, the message cannot receive the producer message and the consumer request during this period.

• Request time consuming: Generally, this value should be fairly stable with little fluctuation.

• Network traffic: provide information about the location of potential bottlenecks, and provide a basis for us to determine whether to enable end-to-end compression of messages.

• Number of purgery messages produced/pulled: By observing the number of purgery messages, it helps us determine the reason for the long message production or pulling time.

3. Broker JVM

Full GC frequency and time consumption: High GC frequency or long GC time have a significant impact on broker performance. Based on this, we can determine whether the memory needs to be expanded.

4. Consumer

• Change in the number of group consumption delays: the larger the indicator, the more message stacks.

• Change of consumption traffic: display the change of network traffic/message traffic size of consumption messages.

• Change in pull data rate: an important indicator of whether consumers are healthy.

5. Zookeeper

• Changes in the number of requests to be processed.

• Change in average request response time: If the time consumption increases suddenly, the coordination mechanism of the entire Kafka cluster may be blocked.

• Changes in the number of client connections: sudden changes in the number of connections are usually accompanied by the broker joining, exiting or losing.

• Changes in the number of open file handles and the remaining number: If the remaining number is insufficient, the broker may be unable to connect to Zookeeper.

• Changes in the number of pending synchronization requests.

• Kafka alarm rules

We recommend configuring the following alarm rules:

1. Producer

• Number of failed messages to be sent: an alarm is given when the number of failed messages to be sent reaches a certain number.

• Number of retried messages sent: An alarm is given when the number of retried messages sent per unit time reaches the threshold.

• Long sending time: the sending time is more than x milliseconds.

2. Broker

• Controller is normal: the number of effective Controller is not 1.

• No offline partitions: The number of offline partitions is greater than 0.

• No Unclean Leader election: The election rate of Unclean Leader is greater than 0.

• Broker unavailable: the number of effective Broker decreases.

• ISR shrinkage: The number of ISRs of topic partitions is less than n.

• Partition unavailable: Topic partition is in Under Replicated state.

• Topic/partition capacity: The number of topics/partitions is greater than n.

• Instance message inflow/outflow: alarm when the current instance traffic exceeds or falls below a certain threshold.

• Topic message inflow/outflow: alarm when the current topic traffic exceeds or falls below the specified threshold.

• Disk capacity: the disk utilization rate is greater than x% (reference value: 85%).

• CPU utilization: more than 80%.

3. Broker JVM:

FullGC frequency: frequent FGC alarms.

4. Consumer

Message stacking: the number of group consumption delays is greater than n (the size of n is appropriately configured according to the business flow).

5. Zookeeper

The number of pending synchronization requests is greater than n.

Typical Kafka problem scenarios and troubleshooting/solutions

• Topic message sending is slow and concurrency is low

1. Scenario Description

The concurrent message sending performance of one or several topics is low. In terms of indicators, the average request latency of the Producer is large and the average production throughput is small.

2. Causes

There are several typical reasons for slow message sending:

• Insufficient network bandwidth leads to IO waiting.

• The message is not compressed, causing network traffic overload.

• Messages are not sent in batches, or the batch threshold is not properly configured, resulting in slow sending rate.

• The number of topic partitions is insufficient, resulting in a backlog of messages received by the broker.

• Broker disk performance is low, resulting in slow disk synchronization.

• The total number of broker partitions is too large, resulting in fragmentation and disk read/write overload.

3. Troubleshooting

Based on the above possible causes, we check the corresponding monitoring indicators/market one by one to locate and solve the problems:

• Confirm whether the producer's "message production" indicator has increased significantly? If there is an increase, we need to increase the number of consumers correspondingly.

• Confirm whether the "message consumption traffic" indicator of the Consumer has dropped significantly? If there is a decrease, it indicates that the time consumption of business processing of consumers has increased. We need to confirm the business consumption or increase the number of consumers.

• Through the command provided by Kafka Broker, confirm whether the number of consumers corresponding to the topic is consistent with the actual number of consumers? If they are inconsistent, it means that some consumers are not connected to the broker correctly, and you need to check whether the consumers are running normally.

• Observe whether the number of consumers changes frequently to trigger repeated rebalancing. If yes, we need to check whether each consumer is running normally.

• Due to network or other reasons, the connection between the consumer and the broker may be unstable. The consumer can continue to consume messages, but the broker always thinks that the messages are not confirmed, resulting in the same consumption point. At this time, it may be necessary to confirm the network stability between the consumer and the broker, or even restart the consumer.

4. Self built Prometheus to monitor the pain points of Kafka

The typical problems we will face in the self built Prometheus observation Kafka are:

• Due to security, organization management and other factors, user businesses are usually deployed in multiple VPCs that are isolated from each other. Prometheus needs to be deployed repeatedly and independently in multiple VPCs, resulting in high deployment and operation and maintenance costs.

• Each complete set of self built observation system needs to install and configure Prometheus, Grafana, AlertManager, etc. The process is complex and the implementation cycle is long.

• Open source Kafka JMX Agent takes up high CPU in some scenarios, which may interfere with self built Kafka services.

• For Alibaba Cloud Message Queuing Kafka (Alibaba Cloud Kafka for short), the self built Prometheus cannot monitor, which makes it impossible to achieve one-stop monitoring from a global perspective.

• For self built Kafka deployed on ECS, the self built Prometheus lacks a service discovery mechanism that is seamlessly integrated with Alibaba Cloud ECS, and cannot flexibly define and capture targets according to ECS tags. If you implement similar functions yourself, you need to use the GOlang language to develop code (call the Alibaba Cloud ECS POP interface), integrate the open source Prometheus code, compile, package, and deploy it. This is a high threshold, complex process, and difficult to upgrade versions.

• The open-source Grafana Kafka market is not professional enough and lacks in-depth optimization based on Kafka principles/features and best practices.

• There is a lack of Kafka alarm indicator template, which requires users to study and configure alarm rules themselves. The workload is heavy, and it is likely that there is a lack of expertise in the Kafka field.


Alibaba Cloud Prometheus Monitoring is a fully connected open source Prometheus ecosystem that supports a wide range of component observations, provides a variety of out of the box preset observation platforms, and provides fully hosted hybrid cloud/multi cloud Prometheus services. In addition to supporting Alibaba Cloud container services, self built Kubernetes and Remote Write, Alibaba Cloud Prometheus also provides metric observation capabilities for hybrid cloud+multi cloud ECS applications; It also supports the multi instance aggregation observation capability, realizes the unified query of Prometheus indicators, unifies the Grafana data source and unifies the alarm.

Alibaba Cloud Prometheus provides full support for Alibaba Cloud Kafka and self built Kafka.

Use Alibaba Cloud Prometheus to monitor Kafka

The following describes how to use Alibaba Cloud Prometheus to monitor Alibaba Cloud Kafka and self built Kafka.

• Use Alibaba Cloud Prometheus to monitor Alibaba Cloud Kafka

Alibaba Cloud Kafka provides full hosting services for open source Apache Kafka to solve the pain points of open source products. Users only need to focus on business development without deployment and maintenance. Compared with open source Apache Kafka, the Kafka version of message queue has lower cost, stronger elasticity and higher reliability.

Alibaba Cloud Kafka now integrates Alibaba Cloud Prometheus monitoring by default. Since Alibaba Cloud Kafka does not require user deployment and O&M, Prometheus monitoring focuses on the "using Kafka" scenario, with the following indicators:

• Traffic indicators at instance/Group/Topic levels

• Group and Topic message accumulation indicators

• Instance disk utilization indicators

• Rebalance indicators of the Group

• View Alibaba Cloud Kafka monitoring market

Alibaba Cloud Kafka provides three monitoring platforms: instance, group and topic. Through these three markets, users can quickly and clearly understand the production and consumption of Kafka messages, and quickly locate the problems encountered in using Alibaba Cloud Kafka.

Users can log in to the Alibaba Cloud Kafka console and enter the "Prometheus Monitoring" menu or tab page of the Kafka instance details interface to view.

Alibaba Cloud Kafka instance monitoring market

Alibaba Cloud Kafka consumer group monitoring market

Alibaba Cloud Kafka Topic Monitoring Market

• Use Alibaba Cloud Prometheus to configure Alibaba Cloud Kafka alerts

The user can log in to the Alibaba Cloud Prometheus console, enter the "Integration Center" interface of the "Cloud Service Monitoring" instance, select the "Cloud Product Self Monitoring Integration" tab, click the "Message Queue Kafka", and the "Alarms" tab page in the pop-up window is displayed to view and add the Prometheus alarms of Alibaba Cloud Kafka. The detailed operation steps for creating Prometheus alarms are shown in the Prometheus alarm rule document.

Alibaba Cloud Prometheus provides 13 default alarm indicators (being updated continuously), which cover the key indicators of instances, groups and topics, so that users can quickly configure common alarm rules.

• Use Alibaba Cloud Prometheus to monitor self built Kafka

In addition to Alibaba Cloud Kafka, Alibaba Cloud Prometheus also provides self built Kafka monitoring access capabilities. It supports container services (including ACK, ASK, and registered clusters) and Kafka monitoring of ECS environments, and provides basic and advanced versions:

• Basic edition: collect the number of brokers, topic partitions, message group Lag and other basic indicators, and the Kafka server can be configured or restarted without any reason.

• Advanced version: In addition to the basic version capabilities, JMX Agent collects important indicators of producers, servers, consumers and their internal modules to achieve full link, integrated expert level Kafka monitoring. However, JMX Agent injection and process restart are required.

Different from the monitoring scenario of Alibaba Cloud Kafka, users need to pay more attention to the internal indicators of "operation and maintenance Kafka" in addition to the routine indicators of "using Kafka" when building their own Kafka. Therefore, we need to deeply capture the important indicators of Kafka producers, servers, consumers and their internal modules, so as to analyze and troubleshoot the possible problems in each link of Kafka itself. Therefore, we recommend using the advanced version, so as to fully understand the operation status of the self built Kafka.

For ZooKeeper and other basic monitoring (disk, network, etc.), refer to Using Prometheus to Monitor ZooKeeper and Node Exporter Component Access and Configure Prometheus Monitoring.

• Use Alibaba Cloud Prometheus for self built Kafka basic monitoring

1. Deploy the basic monitoring of self built Kafka

Log in to the Alibaba Cloud Prometheus console, access the ARMS access center, click the "Add" button of the component application "Kafka (Basic Edition)", select the environment deployed by Kafka in the pop-up interface (currently supporting "Alibaba Cloud Container Service Environment" and "Alibaba Cloud ECS Environment"), select the Prometheus instance where Kafka is located, and then fill in the configuration information.

Configure the information required to connect Kafka:

Exporter name: the unique name of current Kafka monitoring.

Kafka Address: fill in the connection address of Kafka Broker. Multiple broker addresses are separated by commas or semicolons.

• In the container service, you can use the IP or service address of Kafka Broker.

• In the ECS environment, you can use the IP or DNS address of Kafka Broker.

Metrics collection interval (seconds): monitoring data collection interval.

Kafka Version: Select to determine the version number of the Kafka server. Currently, the maximum support is 3.2.0.

Enable SASL: Select to determine whether the Kafka server uses SASL.

SASL User Name: If SASL is enabled, fill in the corresponding user name.

SASL Password: If SASL is enabled, fill in the corresponding user name and password.

SASL method: Select to determine the SASL method. Currently, it supports plain, scram-sha512, and scram-sha256.

Enable TLS: Select to determine whether the Kafka server uses TLS.

Ignore TLS security check: If the Kafka server enables TLS and is a self signed certificate, you can choose to ignore TLS security check.

2. Check the basic monitoring market of self built Kafka

Enter the integration center of the Prometheus instance, click "Kafka (basic version)", select the "big market" tab page in the pop-up interface, and click the big market thumbnail to view the corresponding Grafana big market. The basic monitoring market mainly shows:

• Number of Kafka brokers.

• Number of partitions for each topic.

• The number of incoming/outgoing/stacked messages for each topic.

• The number of ISRs (In Sync Replicas) for each topic.

3. Configure the basic monitoring alarm of self built Kafka

Enter the integration center of the Prometheus instance, click "Kafka (basic version)", and select the "Alarm" tab page in the pop-up interface to view/add the Kafka basic version alarm rules of the current Prometheus instance. The detailed operation steps for creating Prometheus alarms are shown in the Prometheus alarm rule document.

At present, Kafka basic version monitoring provides four alarm indicators. The user can instantiate the alarm rules according to the actual situation.

• Stacking of Consumer Topic messages.

• Too many partitions: define protection alarms for Kafka Broker to avoid a sharp performance degradation caused by too many partitions.

• There is an Under Replicated partition.

• The number of effective brokers decreases.

Advanced monitoring of self built Kafka using Alibaba Cloud Prometheus

• Deploy advanced monitoring of self built Kafka

First, users need to install and configure JMX Agent on Kafka Producer, Broker and Consumer side according to the deployment and configuration documents of JMX Agent, so as to expose Kafka Metric to Alibaba Cloud Prometheus.

Then visit the ARMS access center, click the "Add" button of the component application "Kafka (Advanced Edition)", select the environment deployed by Kafka in the pop-up interface (currently supporting "Alibaba Cloud Container Service Environment" and "Alibaba Cloud ECS Environment"), select the Prometheus instance where Kafka is located, and fill in the configuration information for monitoring access.

• Exporter name: the unique name of current Kafka monitoring.

• Kafka instance name: Kafka instance name, through which kafka Producer, Broker and Consumer are associated to achieve the overall presentation of the whole Topic link.

• JMX Agent Listening Port: the listening port configured when deploying the JMX Agent.

• Metrics collection path: Prometheus collects the http path of the JMX Agent. The default is/metrics.

• Metrics acquisition interval (seconds): monitoring data acquisition interval.

• Pod/ECS tag/value: When deploying the JMX Agent, Prometheus uses the tag and tag value configured for Pod/ECS for Service Discovery.

• View the advanced monitoring market

Enter the integration center of the Prometheus instance, click "Kafka (Advanced Edition)", select the "Big Market" tab page in the pop-up interface, and click the thumbnail of the big market to view the corresponding Grafana big market. The advanced version of monitoring provides two perspectives, namely, Intrance and Topic.

• Self built Kafka Instance Market

Display the internal indicators of Kafka Broker:

• Core indicators: key information such as the number of brokers, OffLine partitions, Under Replicated partitions, number of controllers, CPU/network, etc.

• JVM indicators: display the memory and GC key information of the JVM.

• Partition indicators: display the number of partitions, ISR, Clean Leader election, Replica Lag, Offline partition, Under Replicated partition and other detailed information.

• Time indicators: display the time indicators of each environment such as Produce, Request, Fetch, etc.

• Cluster traffic indicators: display the overall traffic indicators of the cluster.

• Broker traffic indicators: traffic indicators showing broker granularity.

• Self built Kafka Topic Market

Display the full link indicators of each Kafka Topic:

• Producer: Display the key indicators of the Producer side, including message sending speed, message compression rate, sending delay, etc.

• Server (i.e. Kafka Broker): displays the number of partitions, incoming/outgoing message rate, and incoming/outgoing message traffic corresponding to the topic.

• Consumer: Display message consumption rate, consumption delay, Rebalance, etc.

• Configure advanced monitoring alarm of self built Kafka

Enter the integration center of the Prometheus instance, click "Kafka (Advanced Edition)", and select the "Alarm" tab page in the pop-up interface to view/add the Kafka Advanced Edition alarm rules of the current Prometheus instance. Kafka Advanced Edition provides alarm indicators of Producer, Instance and Consumer for users to select and configure. The detailed operation steps for creating Prometheus alarms are shown in the Prometheus alarm rule document.

• Self built Kafka Producer: It provides three alarm indicators, including message sending failure rate, message sending time consuming, and message sending retry rate, to facilitate users to alarm for exceptions at the Producer end.

• Self built Kafka Instance: 13 alarm indicators are provided, including excessive number of partitions, the existence of OffLine partitions, the existence of Unclean Leader election, the existence of Under Replicated partitions, the decrease in the number of effective brokers, the number of effective controllers, the number of instance message rejections, the amount of instance message inflow/outflow, and the amount of topic message inflow/outflow, covering all aspects of Kafka broker exceptions.

• Self built Kafka Consumer: It provides an alarm indicator for message consumption accumulation. Through this alarm rule, users can immediately master the consumption exceptions.

Concluding remarks

Alibaba Cloud Prometheus's Kafka monitoring is based on Alibaba Cloud's rich operation and maintenance practice of message queue Kafka, combined with Kafka community's operation and maintenance suggestions, and provides an integrated solution for Alibaba Cloud Kafka and self built Kafka monitoring.

According to the scene characteristics of self built Kafka, it provides basic and advanced monitoring access to meet users' Kafka monitoring needs in different scenes and depths. At the same time, it supports Kafka deployment of container service environment and ECS environment to meet the monitoring needs of users in different environments. Kafka Advanced Edition provides 200+effective metrics, 10+key big board panels, 60+auxiliary big board panels, and 17 alarm indicators (continuously updated), providing users with a full link, integrated expert level Kafka monitoring support to ensure stable business operation.

In the future, we will continue to optimize the deployment convenience of self built Kafka Advanced Edition to further simplify the user's operation of deploying JMX Agent. At the same time, the performance of JMX Agent will be further optimized to reduce the CPU consumption of self built Kafka Broker.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us