In a microservice framework, service calls are affected when service consumers cannot perceive abnormal application instances of service providers. This further affects the serviceability and availability of service consumers. The outlier instance removal feature monitors the availability of High-speed Service Framework (HSF) applications and service instances and dynamically adjusts them. This ensures successful service calls and improves service stability and quality of service (QoS).

Background information

In the following figure, a system includes Applications A, B, C, and D, where Application A calls Applications B, C, and D. If the instances of Application B, C, or D become abnormal and Application A does not identify the abnormal instances, a part of calls initiated by Application A fail. In the following figure, Application B has one abnormal instance, and Applications C and D each have two abnormal instances. If Applications B, C, and D have a large number of abnormal instances, the service performance and availability of Application A may be affected.

To ensure service performance and availability, you can configure an outlier application removal policy. After the policy is configured, Enterprise Distributed Application Service (EDAS) can monitor the instance status of Applications B, C, and D, and dynamically add or remove instances to ensure successful service calls.

The following list describes the outlier instance removal process:

  1. EDAS detects whether Applications B, C, and D have abnormal instances. Then, EDAS determines whether to remove the abnormal instances from the applications based on the configured Upper Limit of Instance Removal Ratio.
  2. EDAS does not distribute the call requests of Application A to the removed instances.
  3. EDAS detects whether the abnormal instances are recovered based on the configured Recovery Detection Unit Time.
  4. The detection interval is proportional to the number of detection times and linearly increases by Recovery Detection Unit Time, which is 0.5 minutes by default. If the value of Maximum Cumulative Number of Times Not Restored is reached, EDAS detects whether the abnormal instances are recovered at the maximum detection interval.
  5. After the abnormal instances are recovered, they are added to the instance lists of the applications to continue processing call requests. The detection interval is reset to the value of Recovery Detection Unit Time, such as, 0.5 minutes.
Note
  • If the provider has a large number of abnormal instances and the ratio of the abnormal instances exceeds the configured Upper Limit of Instance Removal Ratio, the number of actually removed instances equals the configured upper limit.
  • If the provider has only one instance available, this instance is not removed even if the error rate exceeds the configured limit.

Create an outlier instance removal policy

For HSF applications, you can create application- and service-level outlier instance removal policies.

  1. Log on to the EDAS console.
  2. In the left-side navigation pane, choose Configuration Management > Configurations.
  3. On the Configurations page, select the region and Namespaces where you want to create an outlier instance removal policy, and click Create configuration.
  4. In the Create configuration panel, set the parameters. Then, click Create in the lower part of the panel.
    Application Configuration Management - Create an outlier instance removal policy

    The following section describes the parameters for creating an outlier instance removal policy:

    • Region: The value is the region that you select before you create the outlier instance removal policy. This parameter value cannot be modified.
    • Namespace: The value is the namespace that you select before you create the outlier instance removal policy. This parameter value cannot be modified.
    • Data ID: Enter the data ID in the format of <App ID>.QOSCONFIG. You can obtain App ID on the Basic Information page of the application.
    • Group: The value is HSF and cannot be modified.
    • Data encryption: Select whether to encrypt the data. If the outlier instance removal policy contains sensitive data, we recommend that you enable the encryption feature to reduce the risk of data leaks.
    • Configuration format: Select a data format for the content of the outlier instance removal policy. The system verifies the data based on the format that you select.
    • Configuration content: Enter the content of the outlier instance removal policy.

      You can create an outlier instance removal policy for an HSF application at the application or service level by using the related properties and the values that you specify for them. The following examples show how to create outlier instance removal policies at these two levels.

      Note A service-level outlier instance removal policy takes precedence over an application-level outlier instance removal policy.
      • Example on how to create an application-level outlier instance removal policy
        {
        "DEFAULT": {
        "errorRateThreshold":0.5,
        "isolationTime":60000,
        "maxIsolationRate":0.2,
        "maxIsolationTimeMultiple":15,
        "qosEnabled":true,
        "requestThreshold":20,
        "timeWindowInSeconds":10,
        "ipDimension":true
        }
        }
      • Example on how to create a service-level outlier instance removal policy
        {
        "DEFAULT": {
        "errorRateThreshold":0.5,
        "isolationTime":60000,
        "maxIsolationRate":0.2,
        "maxIsolationTimeMultiple":15,
        "qosEnabled":true,
        "requestThreshold":20,
        "timeWindowInSeconds":10
        },
        "service:version": {
        "errorRateThreshold":0.5,
        "isolationTime":60000,
        "maxIsolationRate":0.2,
        "maxIsolationTimeMultiple":15,
        "qosEnabled":true,
        "requestThreshold":20,
        "timeWindowInSeconds":10
        }
        }

      If you have other requirements, see Parameters for creating an outlier instance removal policy.

Parameters for creating an outlier instance removal policy

You can create an outlier instance removal policy by using related properties in configuration management, or by using -D parameters in Java Virtual Machine (JVM). Outlier instance removal policies created in configuration management take precedence over those created by using the -D parameters. We recommend that you create an outlier instance removal policy in configuration management.

Parameter Property -D parameter Description Default value
Maximum number of calls requestThreshold -Dhsf.qos.request.threshold The outlier instance is removed only when the number of calls in the most recent statistics window exceeds the threshold. 10
Lower error rate limit errorRateThreshold -Dhsf.qos.error.rate.threshold When the error rate of an instance deployed with the called application or service exceeds the threshold, the instance is removed. 0.5
Upper limit of instance removal ratio maxIsolationRate -Dhsf.qos.max.isolation.rate The maximum percentage of abnormal instances to be removed. If the threshold is reached, no more abnormal instances are removed. For example, the total number of instances of an application is 6 and this parameter is set to 60%. The number of instances that can be removed is calculated in the following formula: 6 × 60% = 3.6, which is rounded down to the nearest integer 3. If the calculation result is less than 1, one instance is removed. 0.2
Recovery detection unit time isolationTime -Dhsf.qos.isolation.time After abnormal instances are removed, Enterprise Distributed Application Service (EDAS) continuously detects whether abnormal instances are recovered at an interval that accumulates by the specified time unit. The unit is milliseconds (ms). 60 × 1,000 ms (1 minute)
Maximum cumulative number of times not restored maxIsolationTimeMultiple -Dhsf.qos.max.isolation.time.multiple Set the maximum number of detections. EDAS continuously detects abnormal instances, and the detection interval linearly increases with the number of detections based on Recovery detection unit time. When the specified maximum number of detections is reached, EDAS continuously detects whether abnormal instances are recovered based on the longest detection interval. For example, Recovery detection unit time is set to 60,000 ms and Maximum cumulative number of times not restored is set to 60. If an abnormal instance remains abnormal after it is detected for 60 times, the instance is subsequently detected at an interval of 60 minutes, which is calculated in the following formula: 60 × 60,000 ms = 60 minutes. If the instance is recovered before the specified maximum number of detections is reached, the detection interval is reset to the initial interval, which is the value of Recovery detection unit time. 60
Enable outlier instance removal qosEnabled -Dhsf.qos.enable Specifies whether to enable outlier instance removal for the application or service. false
Time window for statistics timeWindowInSeconds -Dhsf.qos.time.window.in.seconds The time window for statistics on the maximum number of calls. This time window is the statistical period. 10s
Exception type bizExceptionPredicateClassName -Dhsf.qos.biz.exception.class.name The exception type of instances of the application or service. By default, all service exceptions are considered as exceptions. You can also define specific service exceptions by using custom interfaces. For example, you can define exceptions in the following ways:
  • Define all service exceptions as exceptions: com.taobao.hsf.exception.CountBizExceptionPredicate.
  • Ignore all service exceptions: com.taobao.hsf.exception.IgnoreBizExceptionPredicate.
  • Configure the instance deployed with the application whose code contains bizExceptionPredicate and com.taobao.hsf.Predicate. com.taobao.hsf.Predicate is the implementation of bizExceptionPredicate.
com.taobao.hsf.exception.CountBizExceptionPredicate: defines all service exceptions as exceptions.

Verify the result

After you configure and submit an outlier instance removal policy, the outlier instance removal feature is enabled. After you configure an outlier instance removal policy for an application, you can go to the details page of the application to view the monitoring information. You can view the monitoring information in topology to check whether all requests are still forwarded to abnormal instances. You can also check whether Error Rate per Minute of the application is higher than the configured Lower Error Rate. Based on the information, you can determine whether the outlier instance removal policy takes effect.