In a microservice framework, service calls are affected when service consumers cannot perceive abnormal application instances of service providers. This further affects the serviceability and availability of service consumers. The outlier instance removal feature monitors the availability of High-Speed Service Framework (HSF) applications and service instances and dynamically adjusts them. This ensures successful service calls and improves service stability and quality of service (QoS).

Background information

A system includes Applications A, B, C, and D, where Application A calls Applications B, C, and D. If the instances of Application B, C, or D become abnormal and Application A does not identify the abnormal instances, a part of calls initiated by Application A fail. Application B has one abnormal instance, and Applications C and D each have two abnormal instances. If Applications B, C, and D have a large number of abnormal instances, the service performance and availability of Application A may be affected.

To ensure the service performance and availability of Application A, you can configure an outlier application removal policy. After the policy is configured, Enterprise Distributed Application Service (EDAS) can monitor the instance status of Applications B, C, and D, and dynamically add or remove instances to ensure successful service calls.

The following list describes the process of outlier instance removal:

  1. EDAS detects whether Applications B, C, and D have abnormal instances. Then, EDAS determines whether to remove the abnormal instances from the applications based on the configured Upper limit of instance removal ratio parameter.
  2. EDAS does not distribute the call requests of Application A to the removed instances.
  3. EDAS detects whether the abnormal instances are recovered based on the configured Recovery detection unit time parameter.
  4. The detection interval is proportional to the number of detection times and linearly increases by the value of the Recovery detection unit time parameter, which is 0.5 minutes by default. If the value of the Maximum cumulative number of times not restored parameter is reached, EDAS detects whether the abnormal instances are recovered at the maximum detection interval.
  5. After the abnormal instances are recovered, they are added to the instance lists of the applications to continue processing call requests. The detection interval is reset to the value of the Recovery detection unit time parameter, such as 0.5 minutes.
Note
  • If the provider has a large number of abnormal instances and the ratio of the abnormal instances exceeds the value of the Upper limit of instance removal ratio parameter, the number of actually removed instances equals the configured upper limit.
  • If the provider has only one instance available, this instance is not removed even if the error rate exceeds the configured limit.

Create an outlier instance removal policy

For HSF applications, you can create application- and service-level outlier instance removal policies.

  1. Log on to the EDAS console.
  2. In the left-side navigation pane, choose Microservice Configurations > Configurations.
  3. In the top navigation bar, select a region. On the Configurations page, select a microservice namespace from the Microservice Namespace drop-down list. Then, click Create configuration.
  4. In the Create configuration panel, set the parameters. Then, click Create in the lower part of the panel.
    Application configuration management - Create an outlier instance removal policy

    The following section describes the parameters for creating an outlier instance removal policy:

    • Region: The value is the region that you select before you create the outlier instance removal policy and cannot be changed.
    • Micro service space: The value is the namespace that you select before you create the outlier instance removal policy and cannot be changed.
    • Data ID: Enter an ID for the outlier instance removal policy in the format of <Application ID>.QOSCONFIG. You can obtain the ID of an application on the details page of the application.
    • Group: The value is HSF and cannot be changed.
    • Data encryption: Turn on or off the switch to specify whether to encrypt the data. If the outlier instance removal policy contains sensitive data, we recommend that you turn on Data encryption to reduce the risk of data leaks.
    • Configuration format: Select a data format for the content of the outlier instance removal policy. The system verifies the data based on the format that you select.
    • Configuration content: Enter the content of the outlier instance removal policy.

      You can create an outlier instance removal policy for an HSF application at the application or service level by using the related properties and the values that you specify for them. The following examples show how to create outlier instance removal policies at these two levels.

      Note A service-level outlier instance removal policy takes precedence over an application-level outlier instance removal policy.
      • Example on how to create an application-level outlier instance removal policy
        {
        "DEFAULT": {
        "errorRateThreshold":0.5,
        "isolationTime":60000,
        "maxIsolationRate":0.2,
        "maxIsolationTimeMultiple":15,
        "qosEnabled":true,
        "requestThreshold":20,
        "timeWindowInSeconds":10,
        "ipDimension":true
        }
        }
      • Example on how to create a service-level outlier instance removal policy
        {
        "DEFAULT": {
        "errorRateThreshold":0.5,
        "isolationTime":60000,
        "maxIsolationRate":0.2,
        "maxIsolationTimeMultiple":15,
        "qosEnabled":true,
        "requestThreshold":20,
        "timeWindowInSeconds":10
        },
        "service:version": {
        "errorRateThreshold":0.5,
        "isolationTime":60000,
        "maxIsolationRate":0.2,
        "maxIsolationTimeMultiple":15,
        "qosEnabled":true,
        "requestThreshold":20,
        "timeWindowInSeconds":10
        }
        }

      If you have other requirements, see Parameters for creating an outlier instance removal policy.

Parameters for creating an outlier instance removal policy

You can create an outlier instance removal policy by using related properties in configuration management, or by using -D parameters for Java Virtual Machine (JVM). Outlier instance removal policies created in configuration management take precedence over those created by using the -D parameters. We recommend that you create an outlier instance removal policy in configuration management.

Parameter Property -D parameter Description Default value
Maximum number of calls requestThreshold -Dhsf.qos.request.threshold The maximum number of calls. An outlier instance is removed only when the number of calls in the most recent statistics window exceeds the threshold. 10
Lower error rate limit errorRateThreshold -Dhsf.qos.error.rate.threshold The lower limit of the error rate. When the error rate of an instance deployed with the called application or service exceeds the lower limit, the instance is removed. 0.5
Upper limit of instance removal ratio maxIsolationRate -Dhsf.qos.max.isolation.rate The maximum proportion of abnormal instances to be removed. If the threshold is reached, no more abnormal instances are removed. For example, an application has six instances in total. If you set this parameter to 60%, the maximum number of instances that can be removed is 3.6, which is rounded down to the nearest integer 3. The number is calculated by using the following formula: 6 × 60% = 3.6. If the calculation result is less than 1, one instance is removed. 0.2
Recovery detection unit time isolationTime -Dhsf.qos.isolation.time The unit time used to detect whether abnormal instances are recovered. After abnormal instances are removed, EDAS continuously detects whether abnormal instances are recovered at an interval that accumulates by the specified unit time. The unit is ms. 60 × 1,000 ms (1 minute)
Maximum cumulative number of times not restored maxIsolationTimeMultiple -Dhsf.qos.max.isolation.time.multiple The maximum number of detections. EDAS continuously detects abnormal instances, and the detection interval linearly increases with the number of detections by the recovery detection unit time. When the specified maximum number of detections is reached, EDAS continuously detects whether abnormal instances are recovered based on the longest detection interval. For example, the recovery detection unit time is set to 60,000 ms, and the maximum cumulative number of times not recovered is set to 60. If an abnormal instance remains abnormal after it is detected 60 times, the instance is subsequently detected at intervals of 60 minutes, which is calculated by using the following formula: 60 × 60,000 ms = 60 minutes. If the instance is recovered before the specified maximum number of detections is reached, the detection interval is reset to the initial interval, which is the value of the recovery detection unit time. 60
Enable outlier instance removal qosEnabled -Dhsf.qos.enable Specifies whether to enable outlier instance removal for the application or service. false
Time window for statistics timeWindowInSeconds -Dhsf.qos.time.window.in.seconds The time window for statistics on the maximum number of calls. This time window is the statistical period. 10s
Exception type bizExceptionPredicateClassName -Dhsf.qos.biz.exception.class.name The exception type of the instances of the application or service. By default, all service exceptions are considered as exceptions. You can also define specific service exceptions by using custom interfaces. For example, you can define exceptions in the following ways:
  • Define all service exceptions as exceptions: com.taobao.hsf.exception.CountBizExceptionPredicate.
  • Ignore all service exceptions: com.taobao.hsf.exception.IgnoreBizExceptionPredicate.
  • Configure the instance deployed with the application whose code contains bizExceptionPredicate and com.taobao.hsf.Predicate. com.taobao.hsf.Predicate is the implementation of bizExceptionPredicate.
com.taobao.hsf.exception.CountBizExceptionPredicate: defines all service exceptions as exceptions.

Verify the result

The outlier instance removal feature is enabled after you configure and create an outlier instance removal policy. You can go to the details page of the application for which you have configured outlier instance removal to view the application monitoring information. For example, you can check whether call requests are still forwarded to abnormal instances and whether the error rate per minute for application calls is higher than the value of the Lower error rate limit parameter in a topology. This way, you can check whether the outlier instance removal policy takes effect.