All Products
Search
Document Center

Error Detection and Correction

Last Updated: Aug 21, 2020

With error detection and correction mechanism, the system automatically monitors RPC calls, downgrades the weights of faulty nodes, and resumes the weights of these nodes after they are recovered. Currently, the Bolt protocol is supported.

In SOFABoot, you just need to configure the parameters for automatic fault elimination in application.properties. You can configure only the parameters you want, and other parameters will use the default values. Note that the rpc.aft.regulation.effective parameter is the global switch of this feature. If it is set to false, the feature is disabled and other parameters become invalid.

Parameter Description Default value
timeWindow The time window size, that is, the period of calculating statistics. 10s
leastWindowCount The minimum number of calls in the time window. Only data that reaches this minimum value within the time window are included for calculation and control. 10 calls
leastWindowExceptionRateMultiple The downgrade threshold, that is, the ratio of the exception rate in the time window to the average service exception rate. The average exception rate of a service is calculated based on all IP addresses with valid calls to the service. If the exception rate of an IP address to the average exception rate is no less than the threshold value, the weight of the IP address is downgraded. 6 times
weightDegradeRate The weight downgrade ratio of an IP address. 1/20
weightRecoverRate The weight recovery ratio of an IP address. 2 times
degradeEffective The downgrade switch. If the switch for an app is enabled, the system downgrades weights of IP addresses that meet the downgrade criteria. Otherwise, it only prints the logs. false (off)
degradeLeastWeight The minimum weight after downgrading. If the weight of an IP address is smaller than the value after downgrading, the value is used. 1
degradeMaxIpCount The maximum number of IP addresses whose weights are downgraded. The number of IP addresses whose weights are downgraded for the same service cannot exceed the value. 2
regulationEffective The global switch. If the switch for an app is enabled, the system automatically eliminates single points of failure. Otherwise, it does not execute the logic of this feature. false (off)

Example

  1. com.alipay.sofa.rpc.aft.time.window=20
  2. com.alipay.sofa.rpc.aft.least.window.count=30
  3. com.alipay.sofa.rpc.aft.least.window.exception.rate.multiple=1.4
  4. com.alipay.sofa.rpc.aft.weight.degrade.rate=0.5
  5. com.alipay.sofa.rpc.aft.weight.recover.rate=1.2
  6. com.alipay.sofa.rpc.aft.degrade.effective=true
  7. com.alipay.sofa.rpc.aft.degrade.least.weight=1
  8. com.alipay.sofa.rpc.aft.degrade.max.ip.count=2
  9. com.alipay.sofa.rpc.aft.regulation.effective=true

As configured above, the automatic fault elimination feature and the downgrade switch are enabled. When a node fails, the system downgrades the weight of the node and resumes its weight after the node recovers. The health of nodes is measured every 20s. Only nodes that are called for more than 30 times within 20s are counted. If the exception rate of a node exceeds 1.4 times of the average exception rate, the node is downgraded in weight, with a downgrade rate of 0.5. The minimum weight is 1. If the exception rate of the node returns to a level of less than 1.4 times of the average exception rate, the system restores its weight at a ratio of 1.2. At most two IP addresses are downgraded for a single service.