ApsaraMQ for Confluent clusters generate metrics across brokers, topics, and consumer groups. Control Center evaluates these metrics against thresholds you define, and sends notifications through email, Slack, or PagerDuty when anomalies occur -- helping you detect broker failures, consumer lag spikes, or cluster outages before they affect production workloads.
How it works
An alert consists of two parts:
Trigger: A rule that evaluates a metric against a threshold. When the condition is met, the trigger fires.
Action: A notification sent when a trigger fires. Each trigger can have one or more actions.
To set up an alert:
Create a trigger by selecting a metric, a condition (such as "greater than"), and a threshold value.
Create an action that specifies the notification channel (email, Slack, or PagerDuty) and link it to a trigger.
When the metric meets the trigger condition, all associated actions run and send notifications.
Create a trigger
In the top navigation bar, click the
icon.On the Overview page, click the Triggers tab, and then click Add a trigger.
On the New trigger page, specify the trigger name and trigger condition, and then click Save.
After you create the trigger, click the trigger name on the Triggers tab to modify or delete it.
Create an action
Actions define how notifications are delivered when a trigger fires.
On the Overview page, click the Actions tab, and then click Add an action.
On the New action page, configure the following parameters, and then click Save.
Parameter Description Action Name A name for the action. Triggers The trigger to associate with this action. Action The notification channel. Valid values:
- Send email: Deliver notifications by email.
- Send PagerDuty notification: Deliver notifications through PagerDuty. For setup details, see Services and Integrations.
- Send Slack notification: Deliver notifications through Slack incoming webhooks. For setup details, see Sending messages using incoming webhooks.Subject Email addresses of one or more alert contacts, separated by commas. Required only when Action is set to Send email. Each time the action runs, an email is sent to the specified addresses. Max send rate Maximum number of times the action runs within a given frequency. Use with the Frequency parameter. For example, set this to 1and Frequency toPer dayto limit notifications to once per day.Frequency The time interval for the send rate limit. Valid values: Per minute, Per hour, Per 4 hours, Per 8 hours, Per day. Default: Per hour.
After you create the action, click the action name on the Actions tab to modify or delete it.
Pause and resume all alert actions
During maintenance or troubleshooting, pause all alert actions to suppress notifications temporarily. Pausing does not change individual action settings. Each action retains its enabled or disabled state.
While paused, trigger conditions that are met are ignored and all enabled actions associated with those triggers are suppressed.
After you resume actions, triggers fire and send notifications when conditions are met again.
If you stop and restart ApsaraMQ for Confluent or Control Center, paused actions automatically resume and become active.
Pause all actions
On the Overview page, click the Actions tab.
Turn on the Pause all actions switch.
Read the confirmation message and click Confirm.
Resume all actions
On the Overview page, click the Actions tab.
Turn off the Pause all actions switch.
Read the confirmation message and click Confirm.
Disable or enable an alert action
Actions are enabled by default when created. Disable an action to prevent it from running without deleting it. Pausing and resuming respects the disabled state -- resuming paused alerts does not reactivate disabled actions.
On the Overview page, click the Actions tab.
Click the action to manage.
On the action details page, click Edit and turn off the Enabled switch.
To re-enable the action, repeat these steps and turn on the Enable switch.
Alert metrics reference
Control Center provides four categories of trigger metrics. Each metric monitors a specific aspect of your Kafka infrastructure.
Broker metrics
Broker triggers monitor individual broker performance.
| Metric | Description |
|---|---|
| Bytes in | Bytes produced per second. |
| Bytes out | Bytes fetched per second. Internal replication traffic is excluded. |
| Fetch request latency | Latency of fetch requests at the median, 95th, 99th, or 99.9th percentile. Unit: milliseconds. |
| Production request count | Total production requests per minute. |
| Production request latency | Latency of production requests at the median, 95th, 99th, or 99.9th percentile. Unit: milliseconds. |
Cluster metrics
Cluster triggers monitor overall cluster health and availability.
| Metric | Description | Recommended threshold |
|---|---|---|
| Cluster down | Whether a monitored cluster is shut down. | -- |
| Leader election rate | Number of partition leader elections. | -- |
| Offline topic partitions | Total topic partitions that are offline in the cluster. Partitions go offline when brokers with replicas are down, or when unclean leader election is disabled and no in-sync replica can be elected leader. In the latter case, ensure that no messages are lost. | Greater than 0 |
| Unclean election count | Number of unclean partition leader elections reported in the last interval. Data loss may occur if messages were not synced before the former leader was lost. If the number of unclean elections is greater than 0, query the broker logs to determine why leaders were re-elected and search for warning or error messages. We recommend that you set the broker configuration parameter unclean.leader.election.enable to false to prevent out-of-sync replicas from being elected leader. | Not equal to 0 |
| Under replicated topic partitions | Total topic partitions where the number of in-sync replicas is less than the replication factor. | Greater than 0 |
| ZK Disconnected | Whether brokers can connect to ZooKeeper. Valid values: Offline, Online. | -- |
| ZooKeeper expiration rate | Rate at which ZooKeeper session expirations occur across brokers. | -- |
Consumer group metrics
Consumer group triggers detect consumption delays and performance degradation.
| Metric | Description |
|---|---|
| Average latency | Average latency of a consumer group. Requires a Confluent Monitoring Interceptor configured for clients in the consumer group. Unit: milliseconds. |
| Consumer lag | How far behind consumer applications are from producers. Calculated as the difference between the end offset and the current offset. |
| Consumer lead | How far ahead consumer applications are from the earliest available messages. Calculated as the difference between the current offset and the beginning offset. For example, a consumer at offset 15 in a partition that starts at offset 0 has a lead of 15. A shrinking lead indicates that consumption is approaching the earliest available data, which can be used to determine whether data loss occurred. |
| Consumption difference | Difference between the expected consumption value and the actual consumption value within a specific time period. A small gap close to real time is normal and diminishes over time. |
| Maximum latency | Maximum latency of a consumer group. Requires a Confluent Monitoring Interceptor configured for clients in the consumer group. Unit: milliseconds. |
Topic metrics
Topic triggers monitor data flow and replication health for specific topics.
| Metric | Description |
|---|---|
| Bytes in | Bytes coming into a topic per second. |
| Bytes out | Bytes going out of a topic per second. Internal replication traffic is excluded. |
| Out of sync replica count | Total topic partition replicas that are in sync with the leader in the cluster. This value is the product of topic partitions multiplied by the topic replication factor. |
| Production request count | Number of production requests to a topic in the cluster. |
| Under replicated topic partitions | Number of under-replicated topic partitions. Use this metric to determine whether a Kafka broker crash is caused by a specific topic partition. |
Trigger conditions
A trigger fires when the detected metric value meets the configured condition against the threshold.
| Condition | Fires when |
|---|---|
| Equal to | The metric value equals the threshold. |
| Greater than | The metric value exceeds the threshold. |
| Less than | The metric value is below the threshold. |
| Not equal to | The metric value differs from the threshold. |
Related information
For more information about Control Center alert capabilities, see Control Center Alerts for Confluent Platform.