In-depth Analysis of Traffic Isolation Technology of Online Application Nodes

By Wenxin Xie (Fengjing)

Why Do We Need to Isolate Traffic?

Why Do We Need to Isolate Traffic? This question originates from a difficult situation encountered by a customer using Enterprise Distributed Application Service (EDAS). They have encountered abnormal CPU metrics in an online Pod and want to diagnose the problem without rebuilding the pod. However, conducting the diagnosis while traffic still passes through the abnormal Pod impacts the service quality. The customer requests a solution to remove the traffic flowing into the abnormal node, creating an isolated diagnostic environment. Once the node is fixed, they need to restore normal operation by removing the isolation.

In addition to isolating all input traffic for diagnostic purposes, there is a need to isolate specific traffic for simulation drills. To address this, we have conducted thorough research and developed a set of ready-to-use traffic isolation tools. These tools can dynamically isolate specific traffic and easily recover when needed, meeting the traffic isolation requirements for various scenarios.

Which Traffic is Isolated?

The purpose of traffic isolation is to block inbound traffic to application nodes. Let's first identify the types of inbound traffic for microservice application nodes.

Inbound traffic to microservice application nodes can be broadly categorized into two types: service traffic and event traffic. The figure below illustrates the composition of traffic for a typical microservice application.

Service traffic refers to calls made when all nodes of a microservice application act as a network entity, providing a set of services and being requested by other systems, services, or users. For service traffic, the node itself does not directly determine whether the traffic flows in or not. Instead, a service registration and discovery mechanism maintains the logical relationship of the traffic path. The node is registered as an endpoint for the service. When a caller initiates a request to a service, the callee is the logical address of the service. After forwarding and address translation, the request is routed to the entity node of the service endpoint. Isolating service traffic could involve disrupting the communication connection of service invocations. However, this approach would inevitably affect the service quality. A more elegant solution is to break the mapping relationship between the service and the entity node while keeping the overall function of the service running normally. This way, during the routing process, traffic can be directed away from a particular node as intended. Service traffic mainly covers Kubernetes services and services built using microservice frameworks like Spring Cloud and Dubbo, published by registries such as Nacos.

Event traffic refers to the traffic generated by the event-driven architecture within an application, including events or messages delivered by middleware to application nodes. This type of communication is usually asynchronous, such as message traffic from the Apache RocketMQ message queue and event traffic triggered by the SchedulerX scheduling framework. Middleware and application nodes typically follow a client-server communication pattern. Therefore, isolating message or event traffic sent by middleware can be achieved by breaking the communication connection.

Service Traffic Isolation

K8s Service

For applications that expose services through Kubernetes Service, the mapping between the services declared by the Service and the application Pods is maintained by Endpoint objects. The subsets field of the Endpoint objects represents a set of endpoints for the Service. Each endpoint corresponds to the network address of an application Pod, which is an instance that actually provides the service. The subsets field contains detailed information about these endpoints, such as IP addresses and ports.

The Endpoint controller listens to pod changes through the API server and synchronously updates the endpoint list. To isolate the traffic of a Kubernetes Service, the endpoints pointing to the pods need to be destroyed, removing the network address of the isolated pod from the endpoint list. Additionally, the Informer mechanism is used to monitor changes in the Endpoint objects, ensuring that the endpoints maintain the expected state during subsequent changes or the controller's reconciliation process.

Dubbo

For applications that use a registry to expose services, the registry is responsible for managing the service nodes. As long as the registration relationship exists and the application node is active, the registry routes traffic to the corresponding application node. The process of destroying the service registration relationship is known as service deregistration. When an application node performs service deregistration, the registry stops directing traffic to the deregistered node, thereby achieving traffic isolation.

To implement dynamic deregistration of Dubbo microservices, it is necessary to understand the Dubbo service registration principle at the source code level. Taking Dubbo 2.7.0 as an example, the structure of the service registration module is as follows:

In a Dubbo application, there is an AbstractRegistryFactory singleton responsible for initializing the container for the Registry. The class attribute REGISTRIES maintains the mapping between the microservice list and the Registry instance.
The AbstractRegistry class implements the Registry interface and serves as a template. It implements specific public methods such as doRegister and doUnregister, and also maintains a list of registered service URLs.
FailbackRegistry extends AbstractRegistry and provides a failure retry mechanism. It also provides the abstract methods doRegister and doUnregister of the registry. When register/unregister operations are executed, the corresponding doRegister/doUnregister method is called.
Specific registry implementations such as NacosRegistry and RedisRegistry encapsulate the logic for doRegister and doUnregister.

As can be seen from the source code, Dubbo's service registration module already includes built-in methods that allow dynamic unregistering/re-registering of services. Therefore, Dubbo microservice isolation can be achieved by actively triggering the service unregister method of its registry object. Similarly, if a service node needs to be restored, the service register method can be triggered to update the service mapping relationship in the registration center.

After determining the technical approach of "triggering the service unregister method of the registry object," two problems need to be addressed: how to obtain the object and how to trigger the method. In the Java environment, it is natural to consider using Agent technology to intervene in the process behavior. However, conventional bytecode-based Agent techniques cannot meet the requirement of enabling at any time because they depend on specific execution paths of the application code. The agent code is triggered only when the execution path reaches the tracking point, allowing it to obtain the object from the context and call the relevant method using reflection. However, tracking points related to the registry are usually set during the initial stages of program startup, where operations like registry initialization and service registration occur, making it easier to find an appropriate tracking point. During the program's external service provision phase, there are fewer registry operations initiated by the program, making it challenging to find a suitable tracking point to obtain the desired context. When dynamic attachment of the Agent is required for isolating application traffic, the agent code fails to take effect because there is no tracking point in the execution path that can access the registry context.

Therefore, we need an out-of-the-box Agent tool that can actively obtain objects and trigger object methods. Here, we introduce the JVMTI technology. JVMTI (JVM Tool Interface) is a native programming interface provided by the virtual machine that allows developers to create Agents to probe the internal operating state of the JVM and even control the execution of JVM applications. JVMTI can retrieve specific class and object information from the Java heap and trigger methods through reflection to fulfill our needs.

Since JVMTI is a set of JVM native programming interfaces, it needs to be written in C/C++. The compiled product is a dynamic link library (.so or .dll file). The Java Runtime Environment interacts with JVMTI through the JNI (Java Native Interface). As a Java agent, it is dynamically attached to the target JVM using the attaching API.

Thanks to the power of the JVMTI Agent, we can conveniently implement control logic within Java applications. To isolate Dubbo service traffic, the first step is to obtain the static attribute REGISTRIES of the AbstractRegistryFactory class, which contains the list of registered services for the application and their corresponding registry instances. For a specific microservice, simply calling the register/unregister method of its registry allows for dynamic removal and restoration of the service. This solution operates directly at a higher level of abstraction without relying on a specific registry implementation class, making it compatible with all registries.

Spring Cloud

The method of isolating service traffic in Spring Cloud is similar to Dubbo. Once you understand the service registration principle of Spring Cloud, you can obtain the path of the service register/deregister method and then intervene in the application's service register/deregister behavior using JVMTI.

The service registration principle in Spring Cloud is relatively straightforward. When the Spring container starts, the AbstractAutoServiceRegistration listens for the startup event and calls the register method of the ServiceRegistry to register the Registration (service instance data) with the registry. For example, the Nacos service registration class, NacosServiceRegistry, implements the ServiceRegistry interface to register and deregister services in the registry by overloading the register/deregister methods.

// The service registration class
public abstract class AbstractAutoServiceRegistration<R extends Registration>...{      
    // Registry instance
    private final ServiceRegistry<R> serviceRegistry;
    // Register the service
  protected void register() {
    this.serviceRegistry.register(getRegistration());
  }
    // Deregistered the service
  protected void deregister() {
    this.serviceRegistry.deregister(getRegistration());
  }
}

When processing Spring Cloud service traffic isolation, first obtain the AbstractAutoServiceRegistration service registration instance, and then call the register/deregister method to complete the service deregistration and re-registration on the registration center. This approach also does not depend on the specific implementation class of a particular registry, so it is compatible with all registries.

Event Traffic Isolation

Application nodes and middleware typically communicate in client-server mode. For example, RocketMQ and SchedulerX utilize Netty as the underlying network framework to facilitate communication between the client and the server. In this context, we will use RocketMQ as an example to demonstrate how to implement similar traffic isolation for event-driven middleware.

The main implementation class for the Apache RocketMQ client's message queue is NettyRemotingClient. As depicted in the diagram below, the NettyRemotingClient class stores the Channel used for data transfer in the channelTables property, and the lockChannelTables is the lock used to control updates to the channelTables. Additionally, there are several invoke methods responsible for handling the communication process.

The following figure shows the communication process. First, try to get the Channel used for communication from channelTables. If no Channel is available, reconnect to the server side to create the Channel. To ensure synchronization between threads, a new Channel needs to acquire a lockChannelTables lock when updating to channelTables. If the lockChannelTables is occupied for the specified time window, a connection exception is shown.

Based on the above principle analysis, we can prevent the establishment of the Channel by occupying the lockChannelTables lock, and then close the existing Channels. In this case, the client cannot establish a communication connection with the server until the lockChannelTables is released. To resume traffic, you only need to release the lockChannelTables lock. The client will reestablish the channel and resume communication. Because this control is performed at the network client layer, it is not affected by the application message model, and applies to both synchronous and asynchronous messages. It is also independent of the client role, and applies to both consumers and producers.

Summary

The traffic isolation tool is available for trial use in the EDAS-cloud-native toolbox. If you are interested in traffic isolation and more cloud-native tools, please leave a comment.

Community

In-depth Analysis of Traffic Isolation Technology of Online Application Nodes

Why Do We Need to Isolate Traffic?

Which Traffic is Isolated?

Service Traffic Isolation

K8s Service

Dubbo

Spring Cloud

Event Traffic Isolation

Summary

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Cloud-Native Applications Management Solution

Container Service for Kubernetes

ApsaraMQ for RocketMQ

ACK One