Microservices Engine (MSE) provides built-in alert rules that monitor CPU utilization, memory usage, garbage collection (GC) performance, and capacity limits across your MSE instances. Enable these rules to notify a contact group when any metric breaches its threshold, so you can detect and resolve issues before they affect production traffic.
Prerequisites
Before you begin, make sure that you have:
An MSE instance (Microservices Registry, Nacos, ZooKeeper, or Ingress gateway)
At least one alert contact group
Enable default alert rules
Log on to the MSE console and select a region in the top navigation bar.
In the left-side navigation pane, choose Microservices Registry > Instances.
On the Instances page, find the target instance and choose More > Configure Default Alert in the Actions column.
In the Configure Default Alert dialog box, select a contact group for Alert Contact Group and click OK.
After you click OK, MSE adds the default alert rules for the selected contact group. The rules vary by instance type and edition. See the following sections for details.
Default alert rules
Microservices Registry
Applies to Basic Edition, Developer Edition, and Professional Edition instances.
| Alert rule | Threshold | Timeframe | Description | Solution |
|---|---|---|---|---|
| Excessively High CPU Load in Instances | CPU utilization > 80% per node | Continuous | High CPU usage may indicate version defects or insufficient capacity. | 1. Check the Risk Management page and follow the suggested fixes. 2. If the alert persists, scale out the instance. |
| Excessively High Memory Usage in Instances | Memory usage > 90% per node | Continuous | High memory usage can lead to Out-of-Memory (OOM) errors and service disruption. | 1. Check the Risk Management page and follow the suggested fixes. 2. If the alert persists, scale out the instance. |
ZooKeeper
Basic Edition, Developer Edition, and Professional Edition
| Alert rule | Threshold | Timeframe | Description | Solution |
|---|---|---|---|---|
| Excessive CMS GC Occurrences in ZooKeeper Instances | Concurrent Mark Sweep (CMS) GC count > 5 | 1 minute | Frequent CMS GC cycles indicate memory pressure or insufficient instance capacity. | 1. Scale out the instance. 2. If the alert persists, check whether the instance version has known defects and upgrade if needed. |
| Excessively Long CMS GC Duration in Zookeeper Instances | CMS GC duration > 6 s | 1 minute | Long GC pauses can cause request timeouts and session disconnections. | 1. Scale out the instance. 2. If the alert persists, check whether the instance version has known defects and upgrade if needed. |
Serverless Edition
| Alert rule | Threshold | Timeframe | Description | Solution |
|---|---|---|---|---|
| Snapshot Throttling | Snapshot size > 20 MB (limit: 25 MB) | Continuous | The maximum snapshot size is 25 MB. Exceeding 20 MB means the instance is approaching the limit. | Reduce the data stored in ZooKeeper. If you need a higher limit, submit a ticket. |
Nacos
Basic Edition, Developer Edition, and Professional Edition
These rules detect GC performance issues that indicate insufficient heap memory.
| Alert rule | Threshold | Timeframe | Description | Solution |
|---|---|---|---|---|
| Excessive Full GC Occurrences in Nacos Instances | Full GC count > 2 | 1 minute | Frequent full GC runs indicate insufficient heap memory or client-side misconfigurations. | 1. Check for connection leaks, duplicate registration, or duplicate subscription caused by client misconfiguration. 2. If no such issues exist, scale out or upgrade the instance. |
| Excessively Long Full GC Duration in Nacos Instances | Full GC duration > 5 s | 1 minute | Long full GC pauses block all application threads, causing request failures. | 1. Check for connection leaks, duplicate registration, or duplicate subscription caused by client misconfiguration. 2. If no such issues exist, scale out or upgrade the instance. |
Basic Edition, Developer Edition, Professional Edition, and Serverless Edition
These capacity alerts trigger when resource usage approaches the instance limit.
| Alert rule | Threshold | Timeframe | Description | Solution |
|---|---|---|---|---|
| Excessively High Nacos Service Usage | Service usage > 90% | Continuous | The number of registered services is approaching the instance quota. | Scale out or upgrade the instance to increase the service quota. |
| Excessively High Nacos Service Provider Usage | Service provider usage > 90% | Continuous | The number of service providers is approaching the instance quota. | Scale out or upgrade the instance to increase the provider quota. |
| Excessively High Nacos Connection Usage | Connection usage > 90% | Continuous | The number of connections is approaching the instance quota. | Scale out or upgrade the instance to increase the connection quota. |
| Excessively High Nacos Configuration Usage | Configuration usage > 90% | Continuous | The number of configurations is approaching the instance quota. | Scale out or upgrade the instance to increase the configuration quota. |
| Excessively High Nacos Long Polling Usage | Long polling usage > 90% | Continuous | The number of long polling connections is approaching the instance quota. | Scale out or upgrade the instance to increase the long polling quota. |
| Excessive Decrease of Proportion of Nacos Service Providers | Provider count drops > 50% vs. 3 min ago | 3 minutes | A sudden drop in provider count may cause upstream services to lose connectivity with downstream providers. | 1. Check whether applications are being released or restarted. 2. If no deployment is in progress, verify that CPU, memory, GC, and network resources are healthy for your applications. |
Serverless Edition
| Alert rule | Threshold | Timeframe | Description | Solution |
|---|---|---|---|---|
| TPS Throttling | TPS throttling triggered | Continuous | Transactions-per-second (TPS) throttling has activated on the instance. | Submit a ticket to request a higher TPS limit. |
| Service Capacity Limit | Service capacity exceeded | Continuous | The number of services exceeds the instance limit. | Submit a ticket to request a higher service capacity. |
| Connection Limit | Connection count exceeded | Continuous | The number of connections exceeds the instance limit. | Submit a ticket to request a higher connection limit. |
| Configuration Capacity Limit | Configuration capacity exceeded | Continuous | The number of configurations exceeds the instance limit. | Submit a ticket to request a higher configuration capacity. |
Ingress gateway
Professional Edition
| Alert rule | Threshold | Timeframe | Description | Solution |
|---|---|---|---|---|
| Excessively High CPU Load in Instances | CPU utilization > 80% | Continuous | High CPU usage may indicate plug-in issues or insufficient capacity. | 1. Check for plug-in memory leaks or logic errors. 2. If no such issues exist, scale out the instance. |
| Excessively High Memory Usage in Instances | Memory usage > 80% | Continuous | High memory usage may indicate plug-in issues or insufficient capacity. | 1. Check for plug-in memory leaks or logic errors. 2. If no such issues exist, scale out the instance. |
Professional Edition and Serverless Edition
| Alert rule | Threshold | Timeframe | Description | Solution |
|---|---|---|---|---|
| Low Gateway Accuracy Rate | Accuracy rate < 80% | Continuous | A low accuracy rate indicates that a significant portion of requests are failing. | Check for gateway configuration errors or application-level exceptions. |
| Custom Gateway Plug-in Exception (Recovered) | Plug-in exception detected | Continuous | A custom gateway plug-in encountered an error and was automatically recovered. | Review the plug-in logic and fix the root cause to prevent recurrence. |