With error detection and correction mechanism, the system automatically monitors RPC calls, downgrades the weights of faulty nodes, and resumes the weights of these nodes after they are recovered. Currently, the Bolt protocol is supported.
In SOFABoot, you just need to configure the parameters for automatic fault elimination in
application.properties. You can configure only the parameters you want, and other parameters will use the default values. Note that the
rpc.aft.regulation.effective parameter is the global switch of this feature. If it is set to
false, the feature is disabled and other parameters become invalid.
|timeWindow||The time window size, that is, the period of calculating statistics.||10s|
|leastWindowCount||The minimum number of calls in the time window. Only data that reaches this minimum value within the time window are included for calculation and control.||10 calls|
|leastWindowExceptionRateMultiple||The downgrade threshold, that is, the ratio of the exception rate in the time window to the average service exception rate. The average exception rate of a service is calculated based on all IP addresses with valid calls to the service. If the exception rate of an IP address to the average exception rate is no less than the threshold value, the weight of the IP address is downgraded.||6 times|
|weightDegradeRate||The weight downgrade ratio of an IP address.||1/20|
|weightRecoverRate||The weight recovery ratio of an IP address.||2 times|
|degradeEffective||The downgrade switch. If the switch for an app is enabled, the system downgrades weights of IP addresses that meet the downgrade criteria. Otherwise, it only prints the logs.||false (off)|
|degradeLeastWeight||The minimum weight after downgrading. If the weight of an IP address is smaller than the value after downgrading, the value is used.||1|
|degradeMaxIpCount||The maximum number of IP addresses whose weights are downgraded. The number of IP addresses whose weights are downgraded for the same service cannot exceed the value.||2|
|regulationEffective||The global switch. If the switch for an app is enabled, the system automatically eliminates single points of failure. Otherwise, it does not execute the logic of this feature.||false (off)|
As configured above, the automatic fault elimination feature and the downgrade switch are enabled. When a node fails, the system downgrades the weight of the node and resumes its weight after the node recovers. The health of nodes is measured every 20s. Only nodes that are called for more than 30 times within 20s are counted. If the exception rate of a node exceeds 1.4 times of the average exception rate, the node is downgraded in weight, with a downgrade rate of 0.5. The minimum weight is 1. If the exception rate of the node returns to a level of less than 1.4 times of the average exception rate, the system restores its weight at a ratio of 1.2. At most two IP addresses are downgraded for a single service.