Use Managed Service for Prometheus to monitor self-managed Kafka clusters and ApsaraMQ for Kafka instances -

This topic describes how to use Managed Service for Prometheus to monitor ApsaraMQ for Kafka instances and self-managed Kafka clusters.

Challenges of using a self-managed Prometheus service to monitor ApsaraMQ for Kafka instances and self-managed Kafka clusters

If you use a self-managed Prometheus service to monitor ApsaraMQ for Kafka instances and self-managed Kafka clusters, you may need to handle the following challenges:

To ensure security and facilitate organization management, it is highly likely that you deploy your business in separate virtual private clouds (VPCs). If you want to use a self-managed Prometheus service to monitor your business, you must deploy the Prometheus service in each VPC. This increases the deployment costs and O&M costs.
You must configure Prometheus, Grafana, and Alertmanager in each independent self-managed monitoring system, which is complex and requires a long period of time to complete.
In some cases, the JMX agent of open source Apache Kafka consumes a large amount of CPU resources. This causes some impacts on self-managed Kafka clusters.
You cannot use the self-managed Prometheus service to monitor ApsaraMQ for Kafka instances. As a result, you cannot monitor your messaging clusters in a one-stop and centralized manner.
If your self-managed Kafka cluster is deployed on an Elastic Compute Service (ECS) instance, the self-managed Prometheus service cannot flexibly define and capture targets based on ECS tags due to the lack of the service discovery mechanism. If you want to implement a similar mechanism, you must write code in Golang to call the POP API of Alibaba Cloud ECS to integrate the open source Prometheus service. Then, you must compile and package the code, and then deploy the open source Prometheus service. This process is complex and causes great trouble in version upgrades.
The commonly used open source Grafana dashboards are not designed for specific services. You cannot customize monitoring metrics based on the principles and best practices of Apache Kafka.
No alert template is available for monitoring Apache Kafka. You must configure alert rules on your own. This process requires manpower and has high technical requirements.

Comparison between a self-managed Prometheus service and Managed Service for Prometheus

The following table compares a self-managed Prometheus service with Managed Service for Prometheus in monitoring ApsaraMQ for Kafka instances and self-managed Kafka clusters.

Item	Self-managed Prometheus service	Managed Service for Prometheus
Deployment costs and O&M costs	You must purchase ECS instances to deploy Prometheus, Grafana, and Alertmanager in multiple VPCs. This incurs high O&M costs.	Managed Service for Prometheus is a fully managed service that is provided for immediate use and integrates Prometheus, Grafana, and Alertmanager.
Availability, performance, and storage capacity	The overall performance and high availability performance are poor, and the storage capacity is small.	The overall performance and high availability performance are high, and the storage capacity is large.
Exporter performance	In some cases, the JMX agent of open source Apache Kafka consumes a large amount of CPU resources. This causes some impacts on self-managed Kafka clusters.	Managed Service for Prometheus continuously optimizes the performance and improves the stability of JMX agents of open source Apache Kafka.
Service discovery	The service discovery of ECS instances is performed by using the open source static configurations or a third-party service registry. The service discovery process is complex and the O&M cost is high.	Managed Service for Prometheus is compatible with open source service discovery features and provides aliyun_sd_configs. Similar to the LabelSelector for Kubernetes service discovery, you can use ECS tags to identify target ECS instances. This simplifies the configuration and O&M of service discovery.
Grafana dashboard	The Grafana dashboard displays only the collected metrics. You cannot customize the monitoring metrics based on the principles and best practices of Apache Kafka.	Managed Service for Prometheus provides a professional dashboard template for monitoring Apache Kafka. You can use the dashboard to quickly and accurately understand the running status of the entire Apache Kafka process and troubleshoot issues.
Alert rule	No alert template is available for monitoring Apache Kafka. You must configure the alert rules.	Managed Service for Prometheus provides professional and flexible alert metric templates based on the best practices of monitoring Apache Kafka. You can configure alert rules on the GUI.
Unified service	The self-managed Prometheus service is deployed in multiple VPCs, and the service cannot be used to monitor ApsaraMQ for Kafka instances. As a result, you cannot monitor your messaging clusters in a one-stop and centralized manner.	Managed Service for Prometheus is a fully managed service that is integrated into ApsaraMQ for Kafka. ApsaraMQ for Kafka provides a native overall monitoring system.

Use Managed Service for Prometheus to monitor ApsaraMQ for Kafka

Managed Service for Prometheus is integrated into ApsaraMQ for Kafka. The main metrics include:

The traffic of instances, groups, and topics
The message accumulation of groups and topics
The disk usage of instances
The rebalance metrics of groups

View ApsaraMQ for Kafka dashboards

ApsaraMQ for Kafka provides three monitoring dashboards for instances, groups, and topics. You can view data on the dashboards to understand the production and consumption of messages and quickly identify issues.

Instance dashboard

Log on to the ApsaraMQ for Kafka console. In the left-side navigation pane, click Instances.
Click the name of the ApsaraMQ for Kafka instance that you want to view. In the left-side navigation pane, click Prometheus Monitoring to view the monitoring data of the instance.

Consumer group dashboard

Log on to the ApsaraMQ for Kafka console. In the left-side navigation pane, click Instances.
Click the name of the ApsaraMQ for Kafka instance that you want to view. In the left-side navigation pane, click Groups. On the page that appears, click the ID of the group that you want to view and click the Prometheus Monitoring tab to view the monitoring data of the group.

Topic dashboard

Log on to the ApsaraMQ for Kafka console. In the left-side navigation pane, click Instances.
Click the name of the ApsaraMQ for Kafka instance that you want to view. In the left-side navigation pane, click Topics. On the page that appears, click the name of the topic that you want to view and click the Prometheus Monitoring tab to view the monitoring data of the topic.

Use Managed Service for Prometheus to configure alert rules for ApsaraMQ for Kafka

Log on to the ARMS console.
In the left-side navigation pane, choose Managed Service for Prometheus > Instances.
Click the name of the Prometheus instance instance that you want to manage to go to the Integration Center page.
Click the Cloud Service Self-monitoring Integration tab and click the ApsaraMQ for Kafka card in the Installed section. In the panel that appears, click the Alerts tab to view Prometheus alerts of ApsaraMQ for Kafka. Managed Service for Prometheus provides 13 key alert metrics for ApsaraMQ for Kafka instances, groups, and topics. You can add alert rules based on your business requirements. For more information, see Create an alert rule for a Prometheus instance.

Use Managed Service for Prometheus to monitor self-managed Kafka clusters

You can also use Managed Service for Prometheus to monitor self-managed Kafka clusters that are deployed in an ECS environment or container service environment, such as Container Service for Kubernetes (ACK), Serverless Kubernetes (ASK), and registered clusters. Managed Service for Prometheus provides the basic edition and advanced edition of Kafka application components:

Kafka (basic edition): Basic metrics such as the number of brokers, the topic partitions, and the message group lag are collected. To use Managed Service for Prometheus, you do not need to configure or restart the Kafka broker.
Kafka (advanced edition): The JMX agent collects basic metrics and the important metrics of producers, brokers, consumers, and internal modules. You can monitor the entire process of Apache Kafka messages based on the perspective of an expert by using the metrics. To use Managed Service for Prometheus, you must start the JMX agent and restart the Kafka broker process.

When you use Managed Service for Prometheus to monitor self-managed Kafka clusters, you must also focus on internal O&M metrics. You must store the important metrics of Kafka producers, brokers, consumers, and internal modules to analyze and troubleshoot possible problems in each phase of Kafka messages. We recommend that you use the advanced edition of Kafka application component to understand the overall status of self-managed Kafka clusters.

Use the Kafka (basic edition) application component provided by Managed Service for Prometheus to monitor self-managed Kafka clusters

Deploy the Kafka (basic edition) application component for self-managed Kafka clusters

Log on to the ARMS console.

In the left-side navigation pane, click Integration Center. In the Application Components section, click + Add on the Kafka (Basic Edition) card and perform the following steps.

In the STEP1 section, select the environment where you want to deploy the Kafka exporter.
In the STEP2 section, select the Prometheus instance where you want the Kafka exporter to reside.

On the Configuration tab in the STEP3 section, configure parameters and click OK. The following table describes the parameters.

Parameter	Description
Exporter Name	The unique name of the exporter.
kafka address	The endpoint of the self-managed Kafka broker. Separate multiple broker addresses with commas (,) or semicolons (;). If your Kafka instance is deployed in a container service environment, you can enter the IP address or service address of the Kafka broker in this field. If your Kafka instance is deployed in an ECS environment, you can enter the IP address or domain name system (DNS) address of the broker in this field.
Metrics scrape interval (seconds)	The interval at which you want the service to collect monitoring data.
kafka version	The version number of the Kafka broker. The latest version is V3.2.0.
SASL enabled	Specifies whether to enable the Simple Authentication and Security Layer (SASL) feature on the Apache Kafka broker.
SASL username	This field is required if you enable SASL.
SASL password	This field is required if you enable SASL.
SASL mechanism	The SASL mechanism. The following authentication mechanisms are supported: PLAIN, SCRAM-SHA-512, and SCRAM-SHA-256.
TLS enabled	Specifies whether to enable the Transport Layer Security (TLS) feature on the Apache Kafka broker.
insecure skip TLS verify	Set this field to Enabled if TLS is enabled on the Kafka broker and a self-signed TLS certificate is used during authentication.

View the dashboards of self-managed Kafka clusters

Log on to the ARMS console.
In the left-side navigation pane, choose Managed Service for Prometheus > Instances.
Click the name of the Prometheus instance instance that you want to manage to go to the Integration Center page.
Click the Cloud Service Self-monitoring Integration tab and click the Kafka (Basic Edition) card in the Installed section. In the panel that appears, click the Dashboards tab and click the diagram of the Grafana dashboard that you want to view.
The dashboards of Kafka (basic edition) application component display the following information:
- The number of Kafka brokers.
- The number of partitions in each topic.
- The numbers of inbound messages, outbound messages, and accumulated messages in each topic.
- The number of in-sync replicas (ISRs) in each topic.

Configure alert rules for self-managed Kafka clusters

On the Integration Center page that appears, click the Cloud Service Self-monitoring Integration tab. In the Installed section, click the Kafka (Basic Edition) card. In the panel that appears, click the Alerts tab to view the Prometheus alerts. You can add alert rules based on your business requirements. For more information, see Create an alert rule for a Prometheus instance.

Use the Kafka (advanced edition) application component provided by Managed Service for Prometheus to monitor self-managed Kafka clusters

Deploy the Kafka (advanced edition) application component for self-managed Kafka clusters

Log on to the ARMS console.

In the left-side navigation pane, click Integration Center. In the Application Components section, click + Add on the Kafka (Advanced Edition) card and perform the following steps.

In the STEP1 section, select the environment where you want to deploy the Kafka exporter.
In the STEP2 section, select the Prometheus instance where you want the Kafka exporter to reside.

On the Configuration tab in the STEP3 section, configure parameters and click OK. The following table describes the parameters.

Parameter	Description
Instance name	The unique name of the exporter.
Kafka instance name	The name of the Kafka instance that you want to monitor. You can specify an instance name on the dashboard to view the producer, broker, and consumer of a Kafka cluster.
JMX Agent listening port	The listening port that is specified when the JMX agent is deployed.
Metrics path	The HTTP path that is used by Prometheus to collect monitoring data from the JMX agent. Default value: `/metrics`.
Metrics scrape interval (seconds)	The interval at which you want the service to collect monitoring data.
Pod/ECS Label Key (service discovery)	The key and value that are specified for the pod or ECS instance when the JMX agent is deployed. Prometheus uses this key-value pair for service discovery.
Pod/ECS Label value

View the dashboard of self-managed Kafka clusters

Log on to the ARMS console.
In the left-side navigation pane, choose Managed Service for Prometheus > Instances.
Click the name of the Prometheus instance instance that you want to manage to go to the Integration Center page.
Click the Cloud Service Self-monitoring Integration tab and click the Kafka (Advanced Edition) card in the Installed section. In the panel that appears, click the Dashboards tab and click the diagram of a Grafana dashboard that you want to view. The Kafka (advanced edition) application component provides dashboards based on instances and topics.
- Instance dashboard
  The metrics of Kafka brokers:
  - Core metrics: the numbers of brokers, offline partitions, under-replicated partitions, and controllers, and information about CPUs and networks
  - Java Virtual Machine (JVM) metrics: the key information about the JVM memory and garbage collection (GC)
  - Partition metrics: the partition information, such as the partition quantity, ISR, unclean leader election, replica lag, offline partitions, and under-replicated partitions
  - Time metrics: the time metrics in the Produce, Request, and Fetch phases
  - Cluster traffic metrics: the overall traffic metrics of the cluster.
  - Broker traffic metrics: the traffic details by broker.
- Topic dashboard
  The metrics of Apache Kafka topics:
  - Producer: the key metrics of the producer, including the message sending rate, message compression ratio, and message sending latency
  - Server (Kafka broker): the number of partitions in a topic, and the rates and traffics of inbound messages and outbound messages
  - Consumer: the message consumption rate, message consumption latency, and rebalance information

Configure alert rules for self-managed Kafka clusters

Log on to the Managed Service for Prometheus console. Click the name of the Prometheus instance you want to manage. On the Integration Center page that appears, click the Cloud Service Self-monitoring Integration tab. In the Installed section, click the Kafka (Advanced Edition) card. In the panel that appears, click the Alerts tab to view the Prometheus alerts.

Producer: Three alert metrics are provided, including the message sending failure rate, message sending duration, and message sending retry rate. You can use the metrics to identify exceptions on the producer.
Instance: Thirteen alert metrics are provided, including topics with excessive partitions, offline partitions, unclean leader election, under-replicated partitions, a decrease in effective brokers, the number of effective controllers, the number of rejected messages, numbers of inbound messages and outbound messages of the instance, and numbers of inbound messages and outbound messages of topics. You can use the metrics to identify exceptions on the broker.
Consumer: The alert metric for message accumulation is provided. You can use this metric to identify exceptions on consumption.

You can also add alert rules based on your business requirements. For more information, see Create an alert rule for a Prometheus instance.