In a microservice framework, service calls are affected when consumers cannot perceive the abnormal application instances of a provider. This further affects consumers' serviceability and availability. The outlier instance removal function monitors the availability of application instances and dynamically adjusts them. This ensures successful service calls and improves service stability and quality of service (QoS).

Background information

As shown in the following figure, the system includes Applications A, B, C, and D. Application A can call Applications B, C, and D. Some calls fail if Application A cannot perceive the abnormal instances of Application B, C, or D. As shown in the following figure, Application B has one abnormal instance, and Applications C and D each have two abnormal instances. The performance and serviceability of Application A may be affected if Applications B, C, and D have many abnormal instances.

You can configure outlier application removal for Application A to ensure its serviceability and availability. This allows Application A to monitor the instance status of Applications B, C, and D and dynamically add or remove instances to ensure successful service calls.

The process of outlier instance removal is as follows.

  1. Enterprise Distributed Application Service (EDAS) can detect any abnormal instance of Application B, C, or D and determines whether to remove the abnormal instance from the application based on the configured Maximum Number of Removed Instances.
  2. Call requests of Application A are not allocated to the removed instance.
  3. EDAS detects whether the abnormal instance is recovered at the configured Recovery Detection Unit Time.
  4. The detection interval is proportional to the detection times and linearly increases with the Recovery Detection Unit Time, which is 0.5 minutes by default. When the configured Maximum Number of Cumulative Rollbacks is reached, EDAS detects whether the abnormal instance is recovered at the maximum interval.
  5. When the instance is recovered, it is added to the instance list of the application to process call requests. The detection interval is reset to Recovery Detection Unit Time, for example, 0.5 minutes.
  • When the provider has many abnormal instances (which exceed the configured maximum number), the number of actually removed instances is the same as the configured maximum number.
  • When only one available instance is left among the provider's instances, this instance is not removed even if the error rate exceeds the configured threshold.

Create an outlier instance removal policy

  1. In the left-side navigation pane, choose Microservice Governance > Spring Cloud/Dubbo/HSF, and then click Outlier Ejection.
  2. On the Outlier Instance Removal page, click Create Strategy.
  3. In the Create Strategy wizard, set the parameters on the Basic Info page and click Next.
    Set basic information
    • Namespace: Select a region and a namespace from the drop-down lists.
    • Strategy Name: Enter a policy name. The name can be up to 64 characters in length.
    • Framework Used by the Called Service: Select Dubbo or Spring Cloud as needed.
  4. On the Select Effect App page in the Create Strategy wizard, select the target application and click > to add the application to Selected application. Then, click Next.
    Select target applications

    After the target application is selected, all abnormal application instances that are called by this application are removed. Call requests from the effective application are not sent to the removed instances.

  5. In the Create Strategy wizard, set the parameters on the Configuration Strategy page and click Next.
    Configure a policy
    • Exception Type: Select Network Anomaly or Network Anomaly + Business Anomaly (Dubbo Exception) as needed.
    • QPS Lower Limit: Enter a queries per second (QPS) lower limit based on the statistical time window. The time window of applications in Dubbo 2.7 is 15s, and 10s for applications in Dubbo of other versions and Spring Cloud. When the QPS in a statistical time window, for example, 15s, reaches the specified lower limit, Enterprise Distributed Application Service (EDAS) starts collecting and analyzing error rate statistics.
    • Lower Error Rate: Set a call error rate threshold. If the error rate for an instance of the called application exceeds this value, the instance will be removed. Default value: 50%. For example, you set this parameter to 50%. An instance is removed if it is called 10 times in the statistical time window, but six calls fail (that is, the error rate is 60%).
    • Maximum Number of Removed Instances: Set the maximum number of abnormal instances to be removed. No more abnormal instances will be removed after the threshold is reached. For example, the total number of instances of an application is 6 and this parameter is set to 60%. The number of instances that can be removed is 3.6 (6 × 60%), which is rounded down to the nearest integer 3. If the calculated result is less than 1, no instance will be removed.
    • Recovery Detection Unit Time: Set an interval for detecting whether abnormal instances are recovered, in milliseconds. After abnormal instances are removed, EDAS continuously accumulates the detection interval by the specified time unit. Default value: 30000 ms, that is, 0.5 minute.
    • Maximum Number of Cumulative Rollbacks: Set the maximum number of cumulative rollbacks exceeding which the detection interval is no longer increased. For example, you set Recovery Detection Unit Time to 30000 ms and Maximum Number of Cumulative Rollbacks to 20. If the abnormal instance remains unrecovered after being detected 20 times, the instance is subsequently detected at an interval of 10 minutes (20 × 30000 ms). If the instance has been recovered before the specified threshold, the detection interval is reset to Recovery Detection Unit Time.
      Note We recommend that you do not set Maximum Number of Cumulative Rollbacks to a large value. A large value will lead to a long detection interval. If the instance is recovered early in the detection interval, the recovery cannot be detected in a timely manner. This results in resource waste and postponed processing of service call requests.
  6. In the Create Strategy wizard, confirm the settings on the Complete Creation page and click Submit.
    Confirm the settings

Verify the result

The outlier instance removal function is enabled after you configure an outlier instance removal policy. You can go to the details page of the application configured with outlier instance removal to view the application monitoring information. For example, a topology shows whether call requests are still forwarded to abnormal instances. You can check whether Error Rate per Minute for application calls is higher than the configured Lower Error Rate. Based on such information, you can determine whether the outlier instance removal policy takes effect.