Implement high availability capabilities of MSE Microservices Registry - Microservices Engine

You can use the high availability capabilities that are provided by Microservices Registry of Microservices Engine (MSE) to significantly improve the risk handling capabilities of applications. This topic describes how to implement the high availability capabilities for instances, service discovery, and configuration management of Microservices Registry in terms of the implementation scope. Microservices Registry Professional Edition of MSE is used in this example.

Recommended versions

spring-cloud-alibaba: 2.2.6.RELEASE or later.
dubbo: 2.7.12 or later.
spring-boot: 2.3.x or earlier. We recommend that you do not use the 2.4.x versions because the versions have compatibility issues.

High availability of MSE registries

High-availability architecture
Applications cannot always run without any downtime. If your applications require higher reliability and higher data security, we recommend that you deploy at least three nodes for your instance. If one of the nodes is down, the workloads on the node can be switched to other nodes within a few seconds. The unhealthy node is automatically removed from the instance.
Microservices Registry Professional Edition is developed based on Nacos 2.0. The Nacos 2.0 architecture ensures the high availability of the MSE instances with less infrastructure resources required. The Nacos 2.0 architecture also enhances the disaster recovery capability of MSE instances. For more information, see Select an edition.
Multi-zone deployment
Each region consists of multiple zones. Applications that are deployed in different zones in the same region can communicate with each other at a latency of less than 3 milliseconds. Multi-zone deployment also allows fault isolation. For example, you can deploy an instance across multiple physical servers that are located in different zones. If the physical server in Zone A is down, the workloads on the physical server can be switched to a physical server in another zone within a short period of time. The failover process is transparent to users and does not require you to make application-facing changes. After you configure multiple nodes for multi-zone deployment, MSE automatically deploys your instance across zones.
Figure 1. Active-active zone-redundancy architecture based on three nodes
Figure 2. Hierarchical disaster recovery architecture

High availability of service discovery

Service discovery involves consumers and providers. Consumers provide the empty list protection feature and providers provide the disaster recovery feature.

Consumers

Consumers subscribe to instance lists that are provided by providers in a service registry. Applications cannot always run without any downtime. When a service registry updates configurations, performs an upgrade or a downgrade, or encounters an exception such as network disconnection or power outage, subscriptions from consumers may be affected. As a result, the availability of consumers is affected.

To handle subscription errors that are caused by exceptions, you can enable the empty list protection feature for consumers.

If you disable empty list protection, your business is interrupted and an error is returned when a consumer subscribes to an empty instance list.
If you enable empty list protection, the subscriptions by consumers to empty instance lists are ignored. This ensures the high availability of your business.

Enable empty list protection

Note

Only nacos-java-client 1.4.1 and later versions support empty list protection. For more information about the Spring Cloud and Dubbo versions that support empty list protection, see Recommended versions.

Spring Cloud applications
Add the following settings to the configurations of the Spring Cloud application:
```
spring.cloud.nacos.discovery.namingPushEmptyProtection=true
```
Dubbo applications
Add the following parameter to registry.url:
```
namingPushEmptyProtection=true
```

Persistent caching

After you enable empty list protection on the client, you can deploy your applications in pods. However, the application data that is stored in the cache directory may be lost after the pods are restarted. To resolve this issue, you can use a volume to mount the cache directory to pods for persistent storage.

The cache directory is ${user.home}/nacos/naming/${namespaceId}.

Providers

Providers provide the disaster recovery feature to prevent service interruptions that are caused by traffic surges.

Note

Providers that are registered by using nacos-java-client 2.x versions do not support the disaster recovery feature.

When disaster recovery is disabled
If the number of requests from consumers surges, some provider nodes may be overloaded and unable to provide services.
1. The service registry removes the unhealthy node and distributes requests to the healthy nodes.
2. The healthy nodes may also be overloaded and fail.
3. As a result, all provider nodes are down and your business is interrupted.
Disaster recovery enabled
If the number of requests from consumers surges, some provider nodes may be overloaded and unable to provide services.
1. The service registry removes the unhealthy node and distributes requests to the healthy nodes.
2. If the number of unhealthy nodes reaches the threshold of disaster recovery, the requests are evenly distributed to all healthy nodes.
3. Half of the nodes can provide services.

Enable disaster recovery

Supported instance types
- Instances that use persistent storage: Disaster recovery is supported.
- Instances that do not use persistent storage:
  - nacos-java-client 1.x: By default, unhealthy nodes are removed at an interval of 30 seconds. The removed unhealthy nodes are not taken into account when the system matches the number of unhealthy nodes with the threshold of disaster recovery. As a result, the disaster recovery feature fails to work as expected.
  - nacos-java-client 2.x: Disaster recovery is not supported. After persistent connections to nodes are disconnected, the system immediately removes the nodes. As a result, the disaster recovery feature does not work as expected.
Enable disaster recovery by using the CLI
- Specify the threshold of disaster recovery for a specific service
```
curl -X PUT "${nacos.address}/nacos/v1/ns/service?namespaceId=public&serviceName=my-provider&protectThreshold=0.6"
```
  - ${nacos.address}: the endpoint of the service registry.
  - namespaceId: the ID of the namespace. Default value: public.
  - serviceName: the name of the Spring Cloud application or the API name of the Dubbo application.
- Query the threshold of disaster recovery
```
curl -X GET "${nacos.address}/nacos/v1/ns/service?namespaceId=public&serviceName=my-provider"
```
- Returned result
```
{"namespaceId":"public","groupName":"DEFAULT_GROUP","name":"my-provider","protectThreshold":0.7,"metadata":{},"selector":{"type":"none"},"clusters":[]}
```

High availability of Microservices Governance

Microservices Governance allows you to implement a variety of features. For example, you can gracefully start or shut down your applications, remove outlier instances, and configure service downgrade. Microservices Governance also helps improve the high availability of your applications.

High availability of configuration management

The high availability capabilities of configuration management lie in the cache directory and backup directory of the configuration management client and the multidimensional throttling capabilities of registries.

Note

By default, the high availability provided by the configuration management feature is enabled in Microservices Registry Professional Edition.

Client
- Cache directory: Each time the client exchanges data with a configuration center, the client saves the latest configurations in the local cache directory. If the server is unavailable, the configurations that are stored in the local cache directory are used.
- Backup directory: If the server is unavailable, you can manually update the configurations that are stored in the backup directory. Then, the client pulls the configurations from the backup directory.
Configuration center
A configuration center manages and maintains the underlying resources, and throttles traffic to applications based on multiple metrics, which help improve the application availability. These metrics include the maximum number of connections per node and the maximum number of connections per client IP address. Configuration centers allow you to limit the number of times that a configuration set can be published per second or per minute, or limit the data transfer per second or per minute for publishing a specific configuration set. This reduces the risks of server failures when traffic surges.