In a microservice framework, service calls are affected when consumers cannot perceive the abnormal application instances of a provider. This further affects consumers' serviceability and availability. The outlier instance removal function monitors the availability of High-speed Service Framework (HSF) applications and service instances and dynamically adjusts them. This ensures successful service calls and improves service stability and quality of service (QoS).
Background information
As shown in the following figure, the system includes Applications A, B, C, and D. Application A can call Applications B, C, and D. Some calls fail if Application A cannot perceive the abnormal instances of Application B, C, or D. As shown in the following figure, Application B has one abnormal instance, and Applications C and D each have two abnormal instances. The performance and serviceability of Application A may be affected if Applications B, C, and D have many abnormal instances.
You can configure outlier application removal for Application A to ensure its serviceability and availability. This allows Application A to monitor the instance status of Applications B, C, and D and dynamically add or remove instances to ensure successful service calls.
The process of outlier instance removal is as follows.
- Enterprise Distributed Application Service (EDAS) can detect any abnormal instance of Application B, C, or D and determines whether to remove the abnormal instance from the application based on the configured Maximum Number of Removed Instances.
- Call requests of Application A are not allocated to the removed instance.
- EDAS detects whether the abnormal instance is recovered at the configured Recovery Detection Unit Time.
- The detection interval is proportional to the detection times and linearly increases with the Recovery Detection Unit Time, which is 0.5 minutes by default. When the configured Maximum Number of Cumulative Rollbacks is reached, EDAS detects whether the abnormal instance is recovered at the maximum interval.
- When the instance is recovered, it is added to the instance list of the application to process call requests. The detection interval is reset to Recovery Detection Unit Time, for example, 0.5 minutes.
- When the provider has many abnormal instances (which exceed the configured maximum number), the number of actually removed instances is the same as the configured maximum number.
- When only one available instance is left among the provider's instances, this instance is not removed even if the error rate exceeds the configured threshold.
Create an outlier instance removal policy
For HSF applications, you can create application- and service-level outlier instance removal policies.
Parameters for configuring outlier instance removal
You can configure an outlier instance removal policy through properties on the Configuration Management page, or by using the -D JVM parameter. The configuration completed on the Configuration Management page takes precedence over the configuration through the -D parameter. We recommend that you complete configuration on the Configuration Management page.
Parameter | Property | -D parameter | Description | Default value |
---|---|---|---|---|
Maximum Number of Calls | requestThreshold | -Dhsf.qos.request.threshold | The outlier instance is removed only when the number of calls in the most recent statistics window exceeds the threshold. | 10 |
Lower Error Rate | errorRateThreshold | -Dhsf.qos.error.rate.threshold | When the error rate of an instance in the called application or service exceeds the threshold, the instance is removed. | 0.5 |
Maximum Number of Removed Instances | maxIsolationRate | -Dhsf.qos.max.isolation.rate | The maximum number of abnormal instances to be removed. If the threshold is reached, no more abnormal instances are removed. For example, the total number of instances of an application is 6 and this parameter is set to 60%. The number of instances that can be removed is 3.6 (6 × 60%), which is rounded down to the nearest integer 3. If the calculation result is less than 1, one instance is removed. | 0.2 |
Recovery Detection Unit Time | isolationTime | -Dhsf.qos.isolation.time | After abnormal instances are removed, Enterprise Distributed Application Service (EDAS) continuously detect whether abnormal instances are recovered at an interval that accumulates by the specified time unit. The unit is milliseconds (ms). | 60 × 1,000 ms (1 minute) |
Maximum Number of Cumulative Rollbacks | maxIsolationTimeMultiple | -Dhsf.qos.max.isolation.time.multiple | Set the maximum number of cumulative rollbacks exceeding which the detection interval is no longer increased. For example, Recovery Detection Unit Time is set to 60,000 ms and Maximum Number of Cumulative Rollbacks is set to 60. If the abnormal instance remains unrecovered after being detected 60 times, the instance is subsequently detected at an interval of 60 minutes (60 × 60,000 ms). If the instance has been recovered before the specified threshold, the detection interval is reset to Recovery Detection Unit Time. | 60 |
Enable Outlier Instance Removal | qosEnabled | -Dhsf.qos.enable | Specifies whether to enable outlier instance removal for the application or service. | false |
Time Window for Statistics | timeWindowInSeconds | -Dhsf.qos.time.window.in.seconds | The time window for statistics on THE Maximum number of calls, that is, the statistical period. | 10s |
Exception Type | bizExceptionPredicateClassName | -Dhsf.qos.biz.exception.class.name | The exception type of instances of the application or service. All service exceptions
are considered as exceptions by default. You can also define specific service exceptions
through custom interfaces. For example:
|
com.taobao.hsf.exception.CountBizExceptionPredicate, which defines all business exceptions as exceptions |
Verify the result
The outlier instance removal function is enabled after you configure an outlier instance removal policy. You can go to the details page of the application configured with outlier instance removal to view the application monitoring information. For example, a topology shows whether call requests are still forwarded to abnormal instances. You can check whether Error Rate per Minute for application calls is higher than the configured Lower Error Rate. Based on such information, you can determine whether the outlier instance removal policy takes effect.