Exploration and practice of large-scale Spring Cloud microservices online and offline without loss

"Speaking from a common release, when a system application is released on the cloud, the restart phase will lead to a large number of OpenAPIs, and the response time of upstream business requests will increase significantly, even timeout failure. With the development of the business, the number of users and calls is increasing, and the system has maintained an efficient iteration frequency of twice a week. The impact on the business during the release period is increasingly unacceptable. The governance of microservices offline is also More and more urgent. "

The development of cloud native architecture has brought our microservice system with automatic elastic scaling, rolling upgrade, batch release and other native capabilities, allowing us to enjoy the optimal solution of resources, costs and stability. However, in the process of application downsizing, release and other processes, the offline processing of instances is not elegant enough, which will lead to short-term service unavailability, and a large number of IO exceptions will be reported in a short time for business monitoring; If the business fails to do a good job in transactions, it will also cause data inconsistency, so you need to manually correct the wrong data urgently; Even every time we publish, we need to post a notice to stop publishing, and our users will be unavailable for a period of time.

Analysis on Lossy Problems of Microservice Offline

Reducing unnecessary API errors is the best user experience and the best microservice development experience. How to solve this headache in the field of micro services? Before we do this, let's understand why our microservices may lose traffic when they are offline.

As shown in the figure above, it is a normal process for a microservice node to go offline

1. Before offline, consumers call the service provider according to the load balancing rules, and the business is normal.

2. Service provider node A is ready to go offline. First, operate one of the nodes. First, trigger the stop Java process signal.

3. When the node stops, the service provider node will send the service node logout action to the registry.

4. The service registry will notify the consumer that the node in the service provider list has been offline after receiving the signal that the service provider node list has changed.

5. After receiving the new service provider node list, the service consumer will refresh the address list cache of the client, and then recalculate routing and load balancing based on the new address list.

6. Finally, service consumers will no longer call the offline nodes

Although the process of microservice offline is relatively complex, the whole process is still very logical. The microservice architecture realizes node awareness through service registration and discovery. Naturally, this is also the way to realize node offline change awareness. There is no problem with the whole process.

With reference to some simple practical data provided here, I think your views may become different. From step 2 to step 6, Eureka takes 2 minutes in the worst case, and even Nacos takes 50 seconds in the worst case; In step 3, all versions of Dubbo before 3.0 use the service level registration and discovery model, which means that when the business volume is too large, it will cause great pressure on the registry. Suppose that each registration/logoff action takes 20~30ms, and five or six hundred services need to register/logoff in nearly 15s; In step 5, the default address cache refresh time of the ribbon load balancer used by Spring Cloud is 30 seconds, which means that the client gets the signals from the registry to offline nodes in real time, and the client will balance the request load to the old nodes for some time.

As shown in the figure above, requests will not be load balanced to offline nodes only when the client perceives that the server is offline and uses the latest address list for routing and load balancing. The time between the start of the offline node and the time when the request is no longer sent to the offline node, the business request may have problems. This time can be called the service call error reporting period.

Under the microservice architecture, facing the traffic peak of tens of thousands of requests per second, even if the service call error reporting period is only a few seconds, it is very painful for enterprises. In some more extreme cases, the service call error reporting period may deteriorate to several minutes, leading many enterprises to dare not release, and finally have to schedule each release at two or three o'clock in the morning. For R&D, every release is frightening and painful.

Lossless offline technology

Through the analysis of the process of microservice offline, we understand that the key to solving the microservice offline problem is to ensure that each microservice offline process can shorten the service call error reporting period as much as possible, and ensure that the offline node can not offline until it has processed any request sent to the node.

How to shorten the service call error reporting period? We have come up with some strategies:

1. Advance step 3, that is, the process of node logoff to the registry, to step 2, that is, let the notification of service logoff be executed before the application logoff. Considering the Prestop interface provided by K8s, we can abstract the process and place it in Prestop of K8s for triggering.

2. If the capacity of the registry is not good, then whether it is possible for us to bypass the registry and directly inform the client of the offline signal of the current server node before offline. This action can also be triggered in the Prestop interface of K8s.

3. Whether the client can actively refresh the address list cache of the client after receiving the server notification.

How to ensure that the server node can go offline after processing any request sent to the node? From the perspective of the server, whether a waiting mechanism can be provided to ensure that all in transit requests and requests being processed by the server can be processed before the offline process is started after the client is informed of the offline signal.

As shown in the figure above, we can ensure that the service consumer can perceive the offline behavior of the service provider node in real time as early as possible through the above strategies. At the same time, the service provider will ensure that all in transit requests and requests in processing are processed before offline. These ideas seem to be all right. Next, let's see how we implement them in the Spring Cloud and Dubbo service frameworks.

First, we need to build an HttpServer external exposure/offline interface in the service provider process to accept the notification of active logoff. We can configure curl in Prestop of K8s http://localhost:20001/offline Trigger the active logoff interface. After receiving the offline command, the interface will trigger the call of the offline instance interface in the registry or execute the service logoff action by calling the ServiceRegistration.stop interface in the microservice program, so that we can finish the offline action of the node address to the registry before stopping the microservice.

We also need to implement an active notification capability in the Prestop interface. It is easy to implement in the Dubbo framework because Dubbo itself is a long connection model. We can find that Dubbo maintains a collection of channels connected to all service consumers in the service provider. After receiving the offline command, it sends a ReadOnly signal to all channels under maintenance, marking the channel as read-only. After receiving the ReadOnly signal, Dubbo consumers, No more requests will be sent to the service provider to achieve the effect of active notification. For the Spring Cloud framework, the implementation idea is similar. Because the requests called by the Spring Cloud have no channel model, after receiving the offline command, we put the ReadOnly tag in the response header of the request. After receiving the ReadOnly tag, the service consumer will actively refresh the load balancing ribbon cache to ensure that no new requests will access the service provider in the offline process.

Our service provider needs to wait for all in transit requests to be processed before proceeding with the application stop process. Due to the uncertainty of the business, the request processing time is uncertain. How long does the service provider need to wait until all in transit requests are processed? To solve this problem, we design an adaptive waiting strategy. We allow the application to have an adaptive waiting period before going offline. We count and calculate all the traffic entering the service provider and completing the call. During this process, the application will wait until the application has processed all the traffic flowing to the current application, and then stop the offline process.

Through the three strategies of early service logout, active notification and adaptive waiting, we have realized the ability of microservices to go offline without loss. This avoids a long service error reporting period in the offline process of the microservice node, and solves the problem of business flow loss in the publishing process.

Large scale lossless offline practice

So far, the above series of solutions and strategies seem to be perfect. But when we face cloud customers, especially in the face of large-scale microservice scenarios, the lossless offline solution still encounters many problems in the process of implementation. After the Spring Cloud application of a customer's production environment on the cloud was connected to our solution, a large number of errors still appeared in the process of publishing ErrorCode: ServiceUnavailable. After we analyzed and investigated with the customer, we found that the root cause of the problem was that some Consumer failed to receive the offline notification from the Provider in time. Even though the server node has been offline, there is still traffic accessing the offline server node. On a large scale, the timeliness of notification in the registry cannot be guaranteed. We also realize that the timeliness of the method of "after receiving the offline command, we can put the ReadOnly tag in the return value of the request" cannot be guaranteed. Especially when the QPS is small, the RT is long, and the number of application nodes is too large, many consumers cannot receive the offline notification from the provider. We checked the logs of ReadOnly tags received by each consumer node, and found that many consumers did not have log records, which proved our suspicion.

Proactive notification

In order to solve the problems in large-scale practice, we must have a more real-time and reliable active notification scheme. Considering that the request of Spring Cloud calls is a model without channel, we need to maintain the address list of service consumers who have called this instance in the recent period on the Spring Cloud service provider side. After receiving the offline command, the service provider will traverse the list of service consumer addresses cached in memory and initiate a GoAway Http call for each consumer. Of course, we need to add an interface to receive GoAway notifications in the HttpServer exposed by service consumers. After the service consumer receives the call, the service consumer will actively refresh the load balanced ribbon cache of the current node, and isolate the provider node that sends the GoAway request during the flow routing process, so that the current service consumer will not send a request to the corresponding provider node. After the provider makes a GoAway call to each consumer node, it means that the service provider has notified all active consumers of the "offline" signal. In this way, we have realized the relatively reliable active notification capability in a large-scale environment.

Observability construction

The process of lossless logoff is very complex, and it also involves the notification mechanism between multiple nodes. Especially in large-scale situations, the confirmation of the integrity and reliability of the logoff process becomes very complex and cumbersome. We need a perfect observability to help us observe whether there are any problems in the offline process. When problems occur, observability is needed to help us quickly locate problems and root causes.

How to judge whether the lossless offline of our applications released each time is effective?

The most intuitive way is to look at the service flow. We need to stand in the Provider's perspective to see whether the service flow stops before the Provider goes offline, and the service flow is not lost in this process. With this in mind, we should provide the traffic of the provider node and associate lossless offline events. In this way, you can intuitively see that the lossless offline process is triggered first, and then the application is stopped after there is no business flow.

• Metrics traffic view

With the ability of observable Metrics, we can count and display the business traffic of each Pod, and associate lossless offline events in the process of traffic execution, so that we can intuitively see whether there are problems in the offline process of microservice nodes at a glance.

How to judge whether the process implementation of lossless offline meets our expectations?

According to the logic of our active notification, during the offline process of our microservice node, we need to make a GoAway call to each Consumer node. Imagine that in a large-scale scenario, if the current application has 5 consumer applications and each application has 50 nodes, how can we ensure that GoAway notifies each of these 250 consumers? The lossless offline process itself is very complex, especially in large-scale scenarios, where the complexity of the observable problems of lossless offline has risen sharply. We thought that we could improve the observability of lossless offline with the help of Tracing.

• New idea of non-destructive offline observable relying on Tracing

As shown in the above figure, when we shrink the capacity of 108 nodes, we can get a Tracing link, which includes active notification, service logoff, application stop and other steps, and we can see the required information in each step.

In the active notification phase, we can see which consumers are called by the current provider node to make GoAway requests. As shown in the figure below, we will actively notify the two consumer nodes and

After the consumer receives the GoAway call, it will refresh the load balance list and isolate the route. We will display the latest captured address list of the current consumer cache for the current service in the load balance address list. We can see in the following figure that only the call address of the service provider node is left in the address list.

We can also see the call result of Spring Cloud executing offline service to Nacos (registry), and the logout is successful.

We found that the strategy of abstracting the lossless offline workflow into a Tracing structure can help us reduce the cost of troubleshooting lossless offline problems in large-scale scenarios and complex links, and help us better solve the problem of lossless traffic when large-scale microservices are offline.


In addition to risk control in the process of software iteration, there is also a common problem in the field of microservices, which is the traffic governance in the online and offline process of applications. The purpose is also relatively clear, to ensure that applications will not lose any business traffic loss in the release, expansion, restart and other scenarios. It is against this background that the lossless offline technology came into being. It solves the problem of business traffic loss during the change process and is also a very important link in the traffic governance system. The lossless offline function effectively ensures the smoothness of our business traffic and improves the happiness of microservice development.

MSE's lossless offline function is also evolving and improving with the enrichment of customer scenarios. It is worth mentioning that in the process of practicing microservice governance, we open source the OpenSergo project, aiming to promote microservice governance from production practice to standard. Welcome interested students to participate in the discussion and co construction to define the future of microservice governance.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us