In a microservices framework, service calls are affected if consumers cannot detect the exceptions on the application instances of a provider. This further affects the performance and even availability of the services provided by the consumers. The outlier ejection feature monitors the availability of application instances and dynamically adjusts the instances. This ensures successful service calls and improves the service stability and quality of service (QoS). This topic describes how to create an outlier ejection policy.

Background information

The following figure shows a system that requires outlier ejection. In this example, the system has Applications A, B, C, and D, among which Application A calls the instances of Applications B, C, and D. If the instances of Application B, C, or D become abnormal, and Application A does not identify the abnormal instances, a part of calls initiated by Application A fail. In the following figure, Application B has one abnormal instance, and Applications C and D have two abnormal instances each. If Applications B, C, and D have a large number of abnormal instances, the service performance and availability of Application A may be affected.

To ensure the service performance and availability of Application A, you can configure an outlier ejection policy for Application A. After the policy is configured, Enterprise Distributed Application Service (EDAS) can monitor the instance status of Applications B, C, and D, and dynamically add or remove instances to ensure successful service calls.

Outlier ejection

The following content describes the process of outlier ejection:

  1. EDAS detects whether Application B, C, or D has abnormal instances. If abnormal instances are found, EDAS determines whether to remove the abnormal instances from the application based on the Instance Removal Rate Threshold parameter.
  2. EDAS does not distribute the call requests of Application A to the removed instances.
  3. EDAS detects whether the abnormal instances are recovered based on the Recovery Detection Unit Time parameter.
  4. The detection interval linearly increases with the value of the Recovery Detection Unit Time parameter. The default value of Recovery Detection Unit Time is 30000 ms, which equals 0.5 minutes. If the threshold specified by the Max Number of Instance Checked Before Restoration parameter is reached, EDAS detects whether the abnormal instances are recovered at the maximum detection interval.
  5. After the abnormal instances are recovered, EDAS adds the instances back to the application to process call requests. The detection interval is reset to the value of the Recovery Detection Unit Time parameter, such as 30000 ms.
Note
  • If the ratio of abnormal instances of a provider exceeds the threshold that is specified by the Instance Removal Rate Threshold parameter, EDAS removes abnormal instances based on this threshold.
  • If the provider has only one instance available, EDAS does not remove this instance even if the threshold specified by the Error Rate Threshold parameter is exceeded.

Create an outlier ejection policy

  1. Log on to the EDAS console.
  2. In the left-side navigation pane, choose Traffic Management > Microservices Governance > Spring Cloud.
  3. In the navigation pane of the Spring Cloud page, click Outlier Ejection.
  4. In the top navigation bar, select a region. On the Outlier Ejection page, select a microservices namespace and click Create Outlier Ejection Policy.
  5. In the Create Outlier Ejection Policy panel, configure the parameters and click OK.
    ParameterDescription
    Microservice SpaceIn the drop-down list boxes, select a region and a microservices namespace.
    Policy NameEnter a name for the policy. The name can be up to 64 characters in length.
    Framework for Called ServicesSelect Spring Cloud.
    Select a valid applicationSelect an application and click the > icon to add the application to the Selected Applications list.

    After the application is selected, the abnormal instances of all the applications that are called by this application can be removed. Call requests from this application are not distributed to the removed instances.

    Error Rate ThresholdEnter the lower limit of the error rate. If the error rate on an instance of a called application exceeds the limit, the instance is removed. Default value: 50. For example, an instance receives 10 call requests in the statistical time window, and 6 call requests fail. The error rate is 60%. If this parameter is set to 50, the instance is removed.
    Advanced SettingsClick the Show icon to display the Advanced Settings section.
    Exception TypeSelect Network Exception or Network exception + Business exception (HTTP 5xx) based on your business requirements.
    Lower QPS LimitEnter the lower limit of queries per second (QPS) based on the statistical time window. The statistical time window is 15 seconds for applications that are developed based on Dubbo 2.7, and 10 seconds for applications that are developed based on other Dubbo versions and Spring Cloud applications. If the QPS in a statistical time window, such as a 15-second statistical time window, reaches the lower limit that is specified by this parameter, EDAS starts to collect and analyze error rates.
    Instance Removal Rate ThresholdSpecify a threshold for the ratio of the abnormal instances that can be removed. If the threshold is reached, no more abnormal instances are removed. For example, an application has 6 instances in total. If you set this parameter to 60%, the number of instances that can be removed is 3.6, which is rounded down to the nearest integer 3. The number is calculated by using the following formula: 6 × 60%. If the calculated result is less than 1, no abnormal instances are removed.
    Recovery Detection Unit TimeSpecify a unit interval in milliseconds. This unit interval is used to detect whether abnormal instances are recovered. After the abnormal instances are removed, EDAS linearly increases the detection interval based on the specified unit interval. Default value: 30000. Unit: ms. The default value equals 0.5 minutes.
    Max Number of Instance Checked Before RestorationEnter the maximum number of times EDAS detects that an abnormal instance is not recovered. The detection interval linearly increases with the value of Recovery Detection Unit Time. If the number of detection times reaches the value of this parameter, EDAS detects whether the abnormal instance is recovered at the maximum detection interval. For example, you set Recovery Detection Unit Time to 30000 ms and Max Number of Instance Checked Before Restoration to 20. If EDAS detects for consecutive 20 times that an abnormal instance is not recovered, EDAS performs subsequent detection operations at an interval of 10 minutes. The interval is calculated by using the following formula: 20 x 30000 ms. If the instance is recovered before the number of detection times reaches the value that is specified by this parameter, the detection interval is reset to the value of Recovery Detection Unit Time.
    Note We recommend that you do not set Max Number of Instance Checked Before Restoration to a large value. If you set this parameter to a large value, the maximum detection interval is long. If an instance is recovered before a detection interval ends, the recovery cannot be detected at the earliest opportunity. This results in low resource utilization and postponed processing of service call requests.

Verify the result

The outlier ejection feature is enabled after you configure and create an outlier ejection policy. You can go to the details page of the application for which you have configured outlier ejection to view the application monitoring information. For example, you can check whether call requests are still forwarded to abnormal instances and whether Error Rate / 1 Min for application calls is higher than the value of the Error Rate Threshold parameter on the Topology tab. This way, you can check whether the outlier ejection policy takes effect. For more information, see Application overview.

Manage an outlier ejection policy

On the Outlier Ejection page, you can click Edit or Delete in the Operation column of a desired outlier ejection policy to manage the policy.