In a microservice framework, service calls are affected when consumers cannot perceive the abnormal application instances of a provider. This further affects consumers' serviceability and availability. The outlier instance removal function monitors the availability of High-speed Service Framework (HSF) applications and service instances and dynamically adjusts them. This ensures successful service calls and improves service stability and quality of service (QoS).

Background information

As shown in the following figure, the system includes Applications A, B, C, and D. Application A can call Applications B, C, and D. Some calls fail if Application A cannot perceive the abnormal instances of Application B, C, or D. As shown in the following figure, Application B has one abnormal instance, and Applications C and D each have two abnormal instances. The performance and serviceability of Application A may be affected if Applications B, C, and D have many abnormal instances.

You can configure outlier application removal for Application A to ensure its serviceability and availability. This allows Application A to monitor the instance status of Applications B, C, and D and dynamically add or remove instances to ensure successful service calls.

The process of outlier instance removal is as follows.

  1. Enterprise Distributed Application Service (EDAS) can detect any abnormal instance of Application B, C, or D and determines whether to remove the abnormal instance from the application based on the configured Maximum Number of Removed Instances.
  2. Call requests of Application A are not allocated to the removed instance.
  3. EDAS detects whether the abnormal instance is recovered at the configured Recovery Detection Unit Time.
  4. The detection interval is proportional to the detection times and linearly increases with the Recovery Detection Unit Time, which is 0.5 minutes by default. When the configured Maximum Number of Cumulative Rollbacks is reached, EDAS detects whether the abnormal instance is recovered at the maximum interval.
  5. When the instance is recovered, it is added to the instance list of the application to process call requests. The detection interval is reset to Recovery Detection Unit Time, for example, 0.5 minutes.
Note
  • When the provider has many abnormal instances (which exceed the configured maximum number), the number of actually removed instances is the same as the configured maximum number.
  • When only one available instance is left among the provider's instances, this instance is not removed even if the error rate exceeds the configured threshold.

Create an outlier instance removal policy

For HSF applications, you can create application- and service-level outlier instance removal policies.

  1. In the left-side navigation pane, choose Application Management > Configuration Management.
  2. On the Configuration Management page, select a region and a Namespaces. On the right, click Create +.
  3. On the Create Configuration page, complete the settings and click Publish at the bottom of the page.

    Parameters for configuring outlier instance removal:

    • Data ID: the ID of your configuration, in the format of <App ID>. QOSCONFIG. You can obtain the app ID on the Application Details page.
    • Group: is set to HSF and cannot be modified.
    • Target Region: is set to the region that you selected before configuration, and cannot be modified.
    • Configuration Body: Enter the policy for removing outlier instances.

      Configure an outlier removal policy for HSF applications based on properties and their values. You can configure an outlier removal policy at the application or service level. The following provides configuration examples at these two levels.

      Note The service-level configuration takes precedence over the application-level configuration.
      • Example of configuring an application-level outlier instance removal policy
        {
        "DEFAULT": {
        "errorRateThreshold":0.5,
        "isolationTime":60000,
        "maxIsolationRate":0.2,
        "maxIsolationTimeMultiple":15,
        "qosEnabled":true,
        "requestThreshold":20,
        "timeWindowInSeconds":10,
        "ipDimension":true
        }
        }
      • Example of configuring a service-level outlier instance removal policy
        {
        "DEFAULT": {
        "errorRateThreshold":0.5,
        "isolationTime":60000,
        "maxIsolationRate":0.2,
        "maxIsolationTimeMultiple":15,
        "qosEnabled":true,
        "requestThreshold":20,
        "timeWindowInSeconds":10
        },
        "service:version": {
        "errorRateThreshold":0.5,
        "isolationTime":60000,
        "maxIsolationRate":0.2,
        "maxIsolationTimeMultiple":15,
        "qosEnabled":true,
        "requestThreshold":20,
        "timeWindowInSeconds":10
        }
        }

      If you have other requirements, see Parameters for configuring outlier instance removal.

Parameters for configuring outlier instance removal

You can configure an outlier instance removal policy through properties on the Configuration Management page, or by using the -D JVM parameter. The configuration completed on the Configuration Management page takes precedence over the configuration through the -D parameter. We recommend that you complete configuration on the Configuration Management page.

Parameter Property -D parameter Description Default value
Maximum Number of Calls requestThreshold -Dhsf.qos.request.threshold The outlier instance is removed only when the number of calls in the most recent statistics window exceeds the threshold. 10
Lower Error Rate errorRateThreshold -Dhsf.qos.error.rate.threshold When the error rate of an instance in the called application or service exceeds the threshold, the instance is removed. 0.5
Maximum Number of Removed Instances maxIsolationRate -Dhsf.qos.max.isolation.rate The maximum number of abnormal instances to be removed. If the threshold is reached, no more abnormal instances are removed. For example, the total number of instances of an application is 6 and this parameter is set to 60%. The number of instances that can be removed is 3.6 (6 × 60%), which is rounded down to the nearest integer 3. If the calculation result is less than 1, one instance is removed. 0.2
Recovery Detection Unit Time isolationTime -Dhsf.qos.isolation.time After abnormal instances are removed, Enterprise Distributed Application Service (EDAS) continuously detect whether abnormal instances are recovered at an interval that accumulates by the specified time unit. The unit is milliseconds (ms). 60 × 1,000 ms (1 minute)
Maximum Number of Cumulative Rollbacks maxIsolationTimeMultiple -Dhsf.qos.max.isolation.time.multiple Set the maximum number of cumulative rollbacks exceeding which the detection interval is no longer increased. For example, Recovery Detection Unit Time is set to 60,000 ms and Maximum Number of Cumulative Rollbacks is set to 60. If the abnormal instance remains unrecovered after being detected 60 times, the instance is subsequently detected at an interval of 60 minutes (60 × 60,000 ms). If the instance has been recovered before the specified threshold, the detection interval is reset to Recovery Detection Unit Time. 60
Enable Outlier Instance Removal qosEnabled -Dhsf.qos.enable Specifies whether to enable outlier instance removal for the application or service. false
Time Window for Statistics timeWindowInSeconds -Dhsf.qos.time.window.in.seconds The time window for statistics on THE Maximum number of calls, that is, the statistical period. 10s
Exception Type bizExceptionPredicateClassName -Dhsf.qos.biz.exception.class.name The exception type of instances of the application or service. All service exceptions are considered as exceptions by default. You can also define specific service exceptions through custom interfaces. For example:
  • Define all business exceptions as exceptions: com.taobao.hsf.exception.CountBizExceptionPredicate.
  • Ignore all service exceptions: com.taobao.hsf.exception.IgnoreBizExceptionPredicate.
  • Set the instance used to implement com.taobao.hsf.Predicate in bizExceptionPredicate.
com.taobao.hsf.exception.CountBizExceptionPredicate, which defines all business exceptions as exceptions

Verify the result

The outlier instance removal function is enabled after you configure an outlier instance removal policy. You can go to the details page of the application configured with outlier instance removal to view the application monitoring information. For example, a topology shows whether call requests are still forwarded to abnormal instances. You can check whether Error Rate per Minute for application calls is higher than the configured Lower Error Rate. Based on such information, you can determine whether the outlier instance removal policy takes effect.