This topic describes how to gracefully release, start, or shut down a microservice application by using Microservices Governance.

Problem description

When a downstream microservice application is released or restarted, an upstream application may initiate a call to the downstream application that is being shut down. As a result, a service traffic error is reported. Such errors include connection timeouts and business errors.

Possible causes

The downstream application is shut down after the call is initiated. As a result, the downstream application does not respond to the call request.
The downstream application takes a longer time to shut down because of its complex logic. This results in a delay before the application is deregistered from the registry.
The downstream application is shut down as expected, but the upstream application fails to obtain and use an IP address in the new IP address list for the downstream application from the registry in time. This may be due to reasons such as network failures, resource insufficiency, or abnormal processing logic.
The client in use is of an earlier version and does not remove the IP address list of the downstream application that is shut down at the earliest opportunity due to an invalid mechanism.

Solutions

The best solution is to enable the graceful rolling deployment feature provided by Microservices Governance. You can use the graceful rolling deployment feature to gracefully release, start, and shut down microservice applications. This helps prevent the issues discussed in this topic. For more information, see Configure graceful rolling deployment.

If you cannot enable the graceful rolling deployment feature for your microservice applications, you need to view the logs related to the upstream application on the Nacos client. Then, search for log entries by the name of the downstream application and the keyword current ips and check whether the status change time of the downstream application, the error logging time on the Nacos client, and the error reporting time of the upstream application are similar.

The time information helps you determine whether the upstream application reports an error after the status of the downstream application changes, and whether the upstream application stops reporting errors after the Nacos client logs the error information. If the time information is similar, you can use the common solution described in this topic. For more information, see Common solution.
If the status change time of the downstream application and the error logging time on the Nacos client are similar but the error reported by the upstream application persists, the Nacos server pushes a valid IP address and the Nacos client receives the IP address, but the application does not use the IP address. In this case, use one of the following methods to identify the cause of the issue:
- If you do not use an open source framework, check the application logic to determine whether a cache mechanism is used and whether a cache update fails.
- If you use an open source framework, ask for help from the open source community.
If the status change time of the downstream application and the error logging time on the Nacos client are similar but the error reported by the upstream application is fixed after an extended period of time, the Nacos server pushes a valid IP address and the Nacos client receives the IP address, but the upstream application does not use the IP address at the earliest opportunity. Use one of the following methods to identify the cause of the issue:
1. If you do not use an open source framework, check whether a cache mechanism is used and whether a cache update latency occurs.
2. Check whether an auxiliary framework such as Ribbon, Feign, or loadbalance is used. If you use an auxiliary framework, the IP address list is cached and is not updated at the earliest opportunity. Modify the cache update configuration based on the framework that you use.
3. If you use an open source framework, ask for help from the open source community.
4. If the issue persists after you perform the preceding operations, use the common solution described in this topic. For more information, see Common solution.
If all the time information greatly differs, the status change of the downstream application is not detected by the Nacos client. Use the following methods to identify the cause of the issue:
1. Upgrade the Nacos client of the upstream and downstream applications to version 2.X or later.
2. Check whether the upstream application encounters issues such as network failures and resource insufficiency.
3. Check whether blocking logic exists when the downstream application is shut down. If the blocking logic exists, the downstream application cannot respond to the call request but the IP address list of the downstream application still exists in the registry.
4. If the issue persists after you perform the preceding operations, use the common solution described in this topic. For more information, see Common solution.

Common solution

Before you shut down the downstream application, call a Nacos API operation to update the status of the downstream application by setting the enabled parameter to false or shut down the downstream application in the MSE console. Use information such as metric data and logs to confirm that no call requests are initiated to the downstream application. For more information, see OpenAPI and Shut down an application instance.
Shut down the downstream application and change the status of the downstream application.
If the downstream application normally provides services after it is shut down, call a Nacos API operation to update the status of the downstream application by setting the enabled parameter to true or start the downstream application in the MSE console. For more information, see OpenAPI and Start an application instance.