How to Build a Traffic-Lossless Online Application Architecture

By Guyi and Shimian

1. Preface

How to Build a Traffic-Lossless Online Application Architecture – Part 1 discussed the following figure. We explained some key technologies for traffic analysis and traffic access to achieve lossless traffic. This article discusses some technical details that will affect online traffic in the formal service process from the perspective of traffic service.

2. Services

Recently, we analyzed the causes of online failures of a large Internet company in 2021. Among them, product quality problems (such as unreasonable design and bugs) accounted for the highest proportion, reaching 37%. Problems caused by online release, product, and configuration changes accounted for 36%. There are also high-availability problems with dependent services, such as equipment failure, upstream and downstream dependent service problems, or resource bottlenecks.

We can see from the cause analysis in the left part of the figure above that properly managing the problems of change, product quality, and high availability are the keys to achieving traffic-lossless service. We divide these reasons into several key stages according to the application lifecycle:

Application Change State: During the process of version release and configuration change of our service (application), we need to have available means to protect it.
Running State: After the change is completed, it is in a cold state at the beginning. How can the system run smoothly before normal traffic or abnormal traffic arrives?
High-Availability Dependency State: Is there a corresponding method when some nodes of our service have problems or external dependencies (such as other microservices, DB, and caches) have bottlenecks?

We will list relevant examples to explore the corresponding solutions and efficiently deal with these scenarios.

2.1 Change State: Graceful Application Going Offline

One step in the application change is to stop the original application. Before the online application in the production environment is stopped, the service of the service traffic needs to be offline. There are two main types of traffic we often refer to – synchronous call traffic (such as RPC and HTTP requests) and asynchronous traffic (such as message consumption and background task scheduling).

If the process is stopped when the server still has unprocessed requests, both of the preceding traffic types will cause traffic losses. We can follow these two steps to solve this situation:

1) Remove the existing node from the corresponding registration service.

Scenarios include removing the RPC service from the node in the registry, removing the HTTP service from the upstream load balancing, and trying to close the thread pool of background tasks (such as message consumption) so there are no new consumption or services performed.

2) Stop for a period (depending on the business situation) to ensure that the incoming traffic in the process can be properly handled before closing the process.

2.2 Change State: Application Scheduling

Another action in the change process is to initiate a deployment after selecting resources (machines or containers). Generally speaking, scheduling is how resources are selected. Traditional physical machines or virtual machines with limited scheduling capability do not have much potential in scheduling, as their resources are virtually fixed. However, the emergence of container technology (especially the popularity of Kubernetes) brought many changes to delivery and scheduling. It ushered in an era of flexible scheduling from traditional resource allocation planning.

text1

2.3 Change State: Application Graceful Go-Live

After the needed resources are scheduled by the application, the next step is to deploy and start the application. Similar to the application stop scenario, our node is very likely to have been registered before the application is fully initialized, and the thread pool of background tasks is likely to have started. At this time, traffic is scheduled for upstream services (such as SLB starting routing and message starting consumption). However, the service quality of traffic cannot be guaranteed until the application is fully initialized. For example, the first few requests after starting a Java application are usually in a lagging state.

How can we solve this problem? How can we solve this problem? We need to wait until the application is fully initialized and then start to deal with the initialization of service registration, background task thread pool, and consumer thread pool of message. If there is an external load balancing routing traffic, the automation tools of application deployment are needed.

2.4 Change State: Application Service Prefetching

After the system go-live is completed, sometimes a surge in traffic may lead to an exceedingly high system water level, causing a system crash. Great promotion activity at midnight is a typical scenario. During peak traffic, a large amount of traffic rushes into application instances within a short time. This triggers underlying resource optimization issues, such as JIT compilation, framework initialization, and class loading. These optimizations cause high loads to the system in a short time, resulting in traffic loss. If we want to solve this problem, we need to control the traffic to make it slowly increase. We can realize the lossless traffic of operations such as scale-out and go-live in large traffic scenarios by enabling the class loader to load classes in parallel, initializing the framework in advance, and using log asynchronization to increase the business capacity of the newly started application.

2.5 Change State: Kubernetes Service Combination

Since 2020, we have seen a trend towards Spring Cloud and Kubernetes, which has become the most popular technology combination in microservice systems. Effectively combining the microservice system with Kubernetes is a challenge in a Kubernetes-based microservice system. Pod lifecycle management in Kubernetes provides two detection points:

RreadinessProbe is used to detect whether a Pod is ready to accept traffic. If the detection fails, the node will be removed from Kubernetes Service, and the node state is set to NotReady.
LivenessProbe is used to detect whether the Pod is healthy. If the detection fails, the Pod will be restarted.

If our application is not configured with readinessProbe, Kubernetes only checks whether the process in the container is started and whether it is running by default. However, the health state of the running businesses in the process is hard to determine. In the release process, if we use a rolling release strategy, when Kubernetes finds that the business process in the new Pod has been started, Kubernetes will start to destroy the old-version Pod. It seems there is no problem with the whole process. However, if we think about it carefully, "the business process in the new Pod has been started" does not mean it is ready. If there is a problem with the business code, our process is started, and the business ports are exposed, the service likely does not have enough time to register after the process is started due to abnormalities (such as the business code). However, the old-version Pod has already been destroyed at this time. Therefore, consumers may encounter a No Provider issue, resulting in a large amount of traffic loss during the release process.

Similarly, if our application is not configured with livenessProbe, Kubernetes only checks whether the process in the container is alive by default. However, when a process of application is in a suspended state due to resource competition, FullGc, full thread pool, or some unexpected logic, the process service quality is low or even 0, although it is alive. At this moment, all traffic entering the current application will report errors, resulting in a large amount of traffic loss. Therefore, our application should use livenessProbe to tell Kubernetes that the Pod of the current application is unhealthy and cannot be restored by itself. The current Pod needs to be restarted.

The configuration of readinessProbe and livenessProbe is to provide timely and sensitive feedback on the health status of the current application. It can ensure that all processes in the Pod are in a healthy state, thus ensuring that no traffic loss is caused in the service.

2.6 Change State: Canary Release

After a version iteration, it is difficult to ensure that the new code does not have any problems although it has been tested. Why are most failures related to releases? This is because the release is the last step of the overall business to go live. Often, some issues accumulated during the development process are only triggered in the final release stage. In other words, an unspoken rule is that bugs are inevitable in online publishing. It is just a matter of their size. Now that bugs cannot be avoided, there is one problem to be solved in the release stage. How can the impact be minimized? The answer is canary release. If there are some problems that have not been tested, and we also have a full batch of online releases, the errors will be amplified due to their spread in the entire network. It results in a large number of long-term online traffic losses. If our system has a canary release (even comprehensive-procedure canary release) capability, we can keep the impact of the problem to a minimum. If the system has complete observability capability in the canary release, the release will be much more stable and safer. If the canary release can support comprehensive procedures, lossless online traffic can be guaranteed even when multiple applications are released at the same time.

2.7 Running State: Service Degradation

A downstream service provider may run into a performance bottleneck that even affects the business during peak hours. In this case, we can use the service degradation feature to degrade some service consumers. Unimportant service consumers do not actually call the service when using this feature. Instead, mocked responses are directly returned, including sample errors. Valuable resources of the downstream service provider are reserved for important service consumers. This way, the overall service stability is improved. We call this process service degradation.

If the downstream services on which the application depends are unavailable, the business traffic will be lost. You can configure the service degradation capability. When an exception occurs to a downstream service, the service degradation enables traffic to fail fast on the calling end, effectively preventing the avalanche effect.

2.8 Running State: Automatic Outlier Removal

Similar to service degradation, automatic outlier removal is the capability to automatically remove nodes when a single service is unavailable during traffic service. It is different from service degradation in two main points:

1) Automatic Completion: Service degradation is an operation and maintenance (O&M) action, which needs to be configured through the console, and the corresponding service name is specified to achieve the corresponding effect. However, the automatic outlier removal is to proactively detect the survival of upstream nodes and perform degradation operation on this procedure as a whole.

2) Removal Granularity: The service degradation target is service and node IP. Let's take Dubbo as an example. A process will publish a microservice with the service interface name as the service name. If the degradation of this service is triggered, this service of this node will not be called next time, but other services will still be called. However, all services of the node will not be called next time in the outlier removal mode.

2.9 High Availability: Registry Disaster Recovery

As the core component of service registration discovery, the registry is an essential part of the microservices architecture. In the CAP model, the registry can abandon a little bit of data consistency. This way, the service address obtained by each node at the same time allows for short-term inconsistency, but the availability must be guaranteed. Once the registry is unavailable due to some problems, the nodes connected to it may cause a catastrophic blow to the entire system because the service address cannot be obtained. In addition to common high-availability methods, the unique disaster recovery methods of the registry include:

Push-Empty Protection: Batch transient connection often occurs in the data center network jitters or in the process of publishing. However, this situation is not the unavailability of business services. If the registry recognizes this is an abnormal situation (batch transient connection or address empty), it should adopt a conservative strategy to avoid when mispush causes no provider issues of all services, resulting in a large amount of traffic loss for all microservices.
Client Cache Disaster Recovery: The logic of push-empty protection is also applicable to the client. We often encounter a situation where the address is updated when the client and the registry have network problems. The client cannot fully believe all the results fed back by the registry. Only when clearly told that the results are normal can the address in memory be updated. This is especially the case for the last address, for which we must take a cautious strategy. After getting the address, we cannot fully believe it because it is very likely that the address pushed down by the registry is not accessible at all. In this case, there must be a policy similar to heartbeat and keepalive mechanisms to dynamically adjust whether the peer service is available to avoid traffic loss caused by sending the service directly to the address that cannot be connected.
Local Cache Disaster Recovery: Is it enough when the registry and the client both have disaster recovery capabilities? Usually, this is enough if we do not make changes, but what will happen if the registry is unavailable when an application is making changes? One issue is that the address cannot be registered. Another issue is that if there is a service initiating dependency call, the no provider problem will occur after the traffic comes in, resulting in traffic loss or failure to start. If we simplify the dependency of the registry and understand it as the address resolution of a service name, we can save the resolution result locally to make a disaster recovery backup. This way, traffic loss caused by the unavailability of the registry can be avoided during the change process.

2.10 High Availability: Zone-Disaster Recovery with Multiple Data Centers

The characteristic of the same zone is that RT is generally within a relatively low latency (less than 3ms). Therefore, we can build a large Local Area Network (LAN) based on different data centers in the same zone by default. Then, we distribute our applications in multiple data centers to deal with the risk of traffic damage when a single data center fails. Compared with active geo-redundancy, the building cost of such infrastructure is lower, and the structural changes needed are also less. However, the links between applications are complicated in the microservice system. As the links go deeper, the complexity of governance will also increase. As shown in the following figure, the frontend traffic is likely to cause a surge in RT due to calls between different data centers, resulting in traffic loss.

The key to solving the problem above is supporting the capability of priority routing in the same data center at the service framework level. This means if the target service is in the same data center as the service provider, the traffic is preferentially routed to the nodes located in the same data center. The method to realize this capability is to report the information of the data center where the service provider is located when registering the service. The information of the data center is also pushed to the caller as meta information. The caller preferentially selects the addresses in the same data center as itself as the routing destination addresses through the Router capability of the customized service framework.

3. Summary

This article is Part 2 of a 3-part series entitled How to Build a Traffic-Lossless Online Application Architecture. It aims to use the simplest language to classify the technical problems that affect the stability of online application traffic. Solutions to these problems are only code-level details. Some require tools to cooperate, and others require expensive costs. If you expect to have a one-stop experience of traffic lossless for your application on the cloud, please pay attention to Alibaba Cloud Enterprise Distributed Application Service (EDS). Part 3 will explain the perspective of data service exchange and two keys to key prevention.

Community

How to Build a Traffic-Lossless Online Application Architecture – Part 2

1. Preface

2. Services

2.1 Change State: Graceful Application Going Offline

2.2 Change State: Application Scheduling

2.3 Change State: Application Graceful Go-Live

2.4 Change State: Application Service Prefetching

2.5 Change State: Kubernetes Service Combination

2.6 Change State: Canary Release

2.7 Running State: Service Degradation

2.8 Running State: Automatic Outlier Removal

2.9 High Availability: Registry Disaster Recovery

2.10 High Availability: Zone-Disaster Recovery with Multiple Data Centers

3. Summary

Read previous post:

Read next post:

Alibaba Cloud Native

You may also like

Comments

Alibaba Cloud Native

Related Products

Bastionhost

Managed Service for Grafana

DevOps Solution

IT Services Solution