System protection capabilities and configurations - Microservices Engine

When individual interfaces lack dedicated traffic protection rules, unexpected traffic surges can destabilize your microservice applications. System protection provides node-level safeguards that cover all interfaces on a node, acting as a baseline safety net to maintain application stability.

Microservices Governance offers five system protection types:

Protection type	Monitors	Applies to	Agent version
Adaptive overload protection	CPU utilization	All server interfaces	V3.1.4+
Total QPS throttling	Sum of QPS across all interfaces on a node	All server interfaces	4.2.0+
Total concurrency throttling	Sum of concurrent requests across all interfaces on a node	All server interfaces	4.2.0+
Abnormal call circuit breaking	Error percentage per client interface	All client interfaces	4.2.0+
Slow call circuit breaking	Slow call percentage per client interface	All client interfaces	4.2.0+

All system protection rules have lower priority than interface-level traffic protection rules. When throttling or circuit breaking triggers, the system returns HTTP status code 429. For a detailed comparison, see System protection vs. traffic protection.

Choose the right protection type

Use this decision matrix to identify which protection types fit your scenario:

Scenario	Recommended protection	Why
CPU-sensitive application with traffic-driven load	Adaptive overload protection	Dynamically adjusts throttling based on CPU utilization gap
Performance limited by memory, network, or other factors (not CPU)	Total QPS throttling	Caps request rate regardless of CPU load
High response times (typically over 1 second) causing request queuing	Total QPS throttling + Total concurrency throttling	QPS throttling alone cannot prevent queue buildup when requests take longer than 1 second to complete
Downstream services returning frequent errors	Abnormal call circuit breaking	Enables fast failure to prevent request queuing in the caller
Downstream services responding slowly but not timing out	Slow call circuit breaking	Detects slow responses independently of timeout settings

Best practice: Start with system protection as a baseline safety net, then add interface-level traffic protection rules to minimize unnecessary throttling on individual interfaces.

Configure system protection rules

Before you begin, make sure that you have:

Microservices Governance Enterprise Edition activated. For more information, see Activate Microservices Governance
Microservices Governance enabled for your application. For more information, see Enable Microservices Governance for Java microservice applications in an ACK or ACS cluster and Enable Microservices Governance for microservice applications on ECS instances

To configure system protection:

Log on to the MSE console and select a region in the top navigation bar.
In the left-side navigation pane, choose Microservices Governance > Application Governance.
On the Application list page, click the resource card of the target application. In the left-side navigation pane, click Traffic management.
Click the System Protection tab and configure the protection types described in the following sections.

Adaptive overload protection

Note

Requires agent V3.1.4 or later.

How it works

Adaptive overload protection uses CPU utilization to gauge system load. When CPU usage approaches the configured threshold, the system adaptively adjusts the throttling percentage of server traffic -- rejecting a portion of incoming requests to keep CPU utilization within a small range around the threshold.

This protection takes effect on all server interfaces.

When to use it

Use adaptive overload protection for CPU-sensitive applications where unexpected traffic surges directly increase CPU load and degrade response times.

To determine a threshold, run stress tests or analyze historical data to identify the maximum CPU utilization during steady-state operation, then set a slightly higher value.

Console layout

The Adaptive Overload Protection section has two panels:

Left panel -- Lists adaptive overload protection events. Events are generated when throttling starts, activates, and ends. Click View in the Actions column to inspect the CPU utilization for a specific node IP and replay data from the reporting interval.
Right panel -- Shows the average CPU utilization trend for each application node over the previous 5 minutes.

Parameters

Parameter	Description
ON	Close: Disabled. Simulated Execution: Events are generated when protection triggers, but traffic is not throttled. Open: Protection triggers and throttles a percentage of ingress traffic.
vCPU Utilization	Target CPU utilization threshold. The system adaptively adjusts throttling probability based on the gap between actual and target utilization.
Exception Settings	Interfaces excluded from this rule. See Exception settings.

Total QPS throttling

Note

Requires agent 4.2.0 or later.

How it works

Total QPS throttling measures the aggregate queries per second (QPS) across all server interfaces on a single node. If the total QPS exceeds the configured threshold, incoming requests are throttled.

This protection takes effect on all server interfaces.

When to use it

Use total QPS throttling when application performance depends on factors other than CPU utilization -- for example, memory, network, or other factors. In these cases, CPU-based adaptive protection alone may not trigger even though the application is degraded.

To determine a threshold, run stress tests or analyze historical data to identify the total QPS during steady-state operation, then set a slightly higher value.

Console layout

The Total QPS Throttling section has two panels:

Left panel -- Lists throttling events. Events are reported for nodes and interfaces that were throttled within the previous 5 minutes, at a 5-minute reporting interval. Click View to inspect the total QPS for a specific node IP and replay data from the reporting interval.
Right panel -- Shows the average total QPS trend for each application node over the previous 5 minutes.

Parameters

Parameter	Description
ON	Close: Disabled. Enable: Throttles requests when the total QPS exceeds the threshold.
Total QPS Threshold	Maximum total QPS allowed on a single node.
Exception Settings	Interfaces excluded from this rule. See Exception settings.

Total concurrency throttling

Note

Requires agent 4.2.0 or later.

How it works

Total concurrency throttling measures the aggregate number of concurrent requests across all server interfaces on a single node. If the total concurrency exceeds the configured threshold, incoming requests are throttled.

This protection takes effect on all server interfaces.

When to use it

Use total concurrency throttling alongside total QPS throttling, especially when interface response times (RT) are high (typically over 1 second). In high-RT scenarios, if system resources such as thread pools, memory resources, and connection pools are occupied, requests are queued and interface RT increases. QPS throttling alone has a limitation: even a small number of new requests per second can accumulate because queued requests take longer than one second to complete. This causes request queuing and inflates RT for both existing and new requests.

Concurrency throttling addresses this by rejecting new requests while queued requests are still processing. After the queue drains, subsequent requests are admitted and processed with shorter wait times -- significantly improving both success rates and average RT.

To determine a threshold, run stress tests or analyze historical data to identify the total concurrency during steady-state operation, then set a slightly higher value.

Console layout

The Total Concurrency Throttling section has two panels:

Left panel -- Lists throttling events. Events are reported for nodes and interfaces that were throttled within the previous 5 minutes, at a 5-minute reporting interval. Click View to inspect the total concurrency for a specific node IP and replay data from the reporting interval.
Right panel -- Shows the average total concurrency trend for each application node over the previous 5 minutes.

Parameters

Parameter	Description
ON	Close: Disabled. Enable: Throttles requests when total concurrency exceeds the threshold.
Total Concurrency Threshold	Maximum number of concurrent requests allowed on a single node.
Exception Settings	Interfaces excluded from this rule. See Exception settings.

Abnormal call circuit breaking

Note

Requires agent 4.2.0 or later.

How it works

Abnormal call circuit breaking monitors the error percentage of each client interface. When the error percentage exceeds the configured threshold, the circuit breaker opens for that interface. During the circuit breaking period, all requests to the interface fail immediately. The system sends probe requests at regular intervals, and if a probe succeeds, the circuit breaker closes and normal traffic resumes.

This protection applies to all client interfaces, except those that already have interface-level circuit breaking rules configured.

When to use it

Abnormal call circuit breaking handles two scenarios:

Timeout scenarios -- Frequent timeouts on a client interface usually indicate that the service provider is experiencing issues. Without circuit breaking, requests queue up and eventually affect other interfaces in the calling application. Circuit breaking enables fast failure, preventing request queuing.
Non-timeout error scenarios -- Frequent non-timeout errors on a client interface. Circuit breaking allows the system to report relevant errors for user handling, minimizing the impact of the issues and optimizing the user experience when the issues occur.

Console layout

The Abnormal Call Circuit Breaking section has two panels:

Left panel -- Lists circuit breaking events reported within the previous 5 minutes, at a 5-minute reporting interval.
Right panel -- Shows the top 10 interfaces with the highest abnormal call percentage over the previous 5 minutes.

Parameters

Parameter	Description
ON	Close: Disabled. Enable: Triggers circuit breaking when the abnormal call percentage exceeds the threshold.
Circuit Breaking Percentage Threshold (%)	Error percentage that triggers circuit breaking on an interface.
Exception Settings	Interfaces excluded from this rule. See Exception settings.

Advanced settings

Parameter	Description
Statistics Window Duration (s)	Length of the statistics window. Valid range: 1 second to 120 minutes.
Circuit Breaking Duration (s)	Duration of the circuit breaking period. During this period, all requests to the affected interface fail immediately.
Minimum number of requests	Minimum number of requests required within the statistics window to trigger circuit breaking. If the request count is below this value, circuit breaking does not trigger even if the error percentage exceeds the threshold.
Fuse recovery strategy	Controls how the circuit breaker recovers after the circuit breaking period ends. Single detection recovery -- The circuit breaker tests the next request after the circuit breaking period. If the request succeeds (no slow call or abnormal call), the circuit breaker closes. Otherwise, circuit breaking triggers again. Progressive recovery -- Requires the Number of recovery phases and Minimum number of passes per step parameters. The circuit breaker gradually increases the percentage of allowed requests across multiple stages. See Progressive recovery.

Progressive recovery

After the circuit breaking period ends, the circuit breaker steps through recovery stages, gradually increasing the percentage of requests allowed to pass.

How the percentage is calculated:

Request percentage per stage = 100 / Number of recovery stages (N)

Stage 1 allows T% of requests
Stage 2 allows 2T% of requests
This continues until 100% of requests are allowed

At each stage, a check triggers when the number of requests reaches the Minimum number of passes per step value. If the error percentage stays below the threshold, the circuit breaker advances to the next stage. If the error percentage exceeds the threshold, circuit breaking triggers again.

Example: With 3 recovery stages and a minimum of 5 passes per step:

Stage	Allowed requests	Check condition
1	33%	After 5 or more requests
2	67%	After 5 or more requests
3	100%	Full recovery

If fewer than 5 requests arrive during a stage, the system advances to the next stage without checking.

Slow call circuit breaking

Note

Requires agent 4.2.0 or later.

How it works

Slow call circuit breaking monitors the slow call percentage of each client interface. A call is classified as slow when its response time exceeds the configured Slow Call RT threshold. When the slow call percentage exceeds the Degradation Threshold, the circuit breaker opens for that interface. During the circuit breaking period, all requests fail immediately. Probe requests are sent at regular intervals, and if a probe succeeds, the circuit breaker closes.

This protection applies to all client interfaces, except those that already have interface-level circuit breaking rules configured.

When to use it

Use slow call circuit breaking in timeout-prone scenarios where abnormal call circuit breaking may also apply. Unlike abnormal call circuit breaking, slow call circuit breaking allows dynamic adjustment of the RT threshold that defines a "slow call," independent of timeout settings.

Console layout

The Slow Call Circuit Breaking section has two panels:

Left panel -- Lists circuit breaking events reported for slow calls within the previous 5 minutes, at a 5-minute reporting interval.
Right panel -- Shows the top 10 average RT values across application interfaces over the previous 5 minutes.

Parameters

Parameter	Description
ON	Close: Disabled. Enable: Classifies calls as slow when their RT exceeds the configured threshold.
Slow Call RT (ms)	RT threshold in milliseconds. Calls with RT above this value are classified as slow calls.
Degradation Threshold (%)	Percentage of slow calls that triggers circuit breaking.
Exception Settings	Interfaces excluded from this rule. See Exception settings.

Advanced settings

Parameter	Description
Statistics Window Duration (s)	Length of the statistics window. Valid range: 1 second to 120 minutes.
Circuit Breaking Duration (s)	Duration of the circuit breaking period. During this period, all requests to the affected interface fail immediately.
Minimum number of requests	Minimum number of requests required within the statistics window to trigger circuit breaking. If the request count is below this value, circuit breaking does not trigger even if the slow call percentage exceeds the threshold.
Fuse recovery strategy	Controls how the circuit breaker recovers after the circuit breaking period ends. Single detection recovery and Progressive recovery. For details, see Progressive recovery.

Exception settings

Note

Requires agent 4.2.0 or later.

Exception settings let you exclude specific interfaces from all system protection rules. Requests on excluded interfaces pass through without rule checking.

When to use them

Configure exception settings for:

Health check interfaces -- Prevent system protection rules from throttling health checks, which could affect the health status of nodes.
Key interfaces of the system -- Key interfaces that have separate throttling limits imposed should not be subject to the system-wide throttling mechanism.

How to configure

In the exception settings dialog:

The Available Interfaces section on the left lists recently called interfaces. If the target interface is not listed, enter its name in the search box and click the search icon.
Add the interface to the Selected Interfaces section on the right.

System protection vs. traffic protection

Both system protection and traffic protection keep applications stable, but they operate at different levels:

Dimension	System protection	Traffic protection
Scope	Node-level. Same rules apply to all interfaces of an application.	Interface-level. Different thresholds per interface.
Granularity	Coarse-grained -- protects the node as a whole.	Fine-grained -- protects individual interfaces based on importance and load characteristics.
Configuration effort	Low. A few thresholds cover the entire application.	Higher. Requires per-interface threshold tuning.
Traffic loss	Higher. Throttling applies broadly across all interfaces.	Lower. Only the interfaces that exceed their individual thresholds are throttled.
Best for	Baseline safety net against unexpected surges.	Production-grade protection with minimized traffic loss.

Note

Both system protection and traffic protection return HTTP status code 429 when throttling triggers. Custom status codes are not supported.

Microservices Engine:Configure system protection

Choose the right protection type

Configure system protection rules

Adaptive overload protection

How it works

When to use it

Console layout

Parameters

Total QPS throttling

How it works

When to use it

Console layout

Parameters

Total concurrency throttling

How it works

When to use it

Console layout

Parameters

Abnormal call circuit breaking

How it works

When to use it

Console layout

Parameters

Advanced settings

Progressive recovery

Slow call circuit breaking

How it works

When to use it

Console layout

Parameters

Advanced settings

Exception settings

When to use them

How to configure

System protection vs. traffic protection

See also