×
Community Blog An Exploration and Improvement of Dubbo in Proxyless Mesh Mode

An Exploration and Improvement of Dubbo in Proxyless Mesh Mode

This article discusses the advantages, deficiencies, and broad market prospects of Dubbo and Proxyless Service Mesh.

1. Background

With the advent of Docker and Kubernetes, a large monolithic application can be split into multiple independently deployed microservices, which are packaged and run in corresponding containers. Different applications communicate with each other to complete a functional module. The benefits of the microservices model and containerized deployment are clear. The microservices model reduces the coupling between services, facilitates development and maintenance, and makes more efficient use of computing resources. The microservices model also has disadvantages:

  • Strong Dependence on SDK: The business module is seriously coupled with the governance module. In addition to related dependencies, embedding SDK code or configurations in the business code is required.
  • Difficulty in Unified Governance: Each time the framework is upgraded, SDK needs to be modified, and regression testing needs to be performed again to confirm that features work properly before redeploying the framework for each machine. The SDKs referenced by different services have different versions and different capabilities. This makes unified governance difficult.
  • Lack of a Unified Solution: Currently, there is no complete set of microservice governance and solutions with flawless features. It is often necessary to introduce multiple governance components in the actual production environment to complete features (such as canary release and fault injection).

Service Mesh was created to solve these pain points. The classic sidecar mode is used as an example. Service Mesh injects the sidecar container into business pods to govern and control proxy traffic. This way, the governance capability of the framework is moved to the sidecar container and decoupled from the business system, easily realizing the requirements of unified traffic control and monitoring in multiple languages and protocols. Service Mesh solves the problem of strong dependence on SDK by stripping the SDK capability and disassembling it into independent processes. In this case, the developers can focus more on the business and realize the sinking of the basic framework capability, as shown in the following figure (from Dubbo’s official website):

1
The classic sidecar mesh deployment architecture has many advantages, such as reducing SDK coupling and small business intrusion. However, the additional layer of proxy brings the following problems:

  • Sidecar proxy degrades performance, especially when the network structure is complex, which causes problems for services that require high performance.
  • The architecture is more complex, which has higher requirements for O&M personnel.
  • The deployment environment is required to support the running of the sidecar proxy.

Proxyless Service Mesh was created to solve these pain points. Traditional Service Mesh intercepts all business network traffic through the proxy. The proxy detects the configuration resources issued by the control plane to control the direction of network traffic as required. Take Istio as an example. The Proxyless mode means the application communicates directly with the Istiod process responsible for the control plane. The Istiod process monitors and obtains Kubernetes resources (such as Service and Endpoint) and distributes these resources to different RPC frameworks through the xDS protocol. Then, the RPC framework forwards requests, enabling capabilities (such as service discovery and service governance).

The Dubbo community is the first community in China to explore Proxyless Service Mesh. Compared with Service Mesh, the Proxyless mode has a lower cost and is a better choice for small and medium-sized enterprises. Dubbo 3.1 supports the Proxyless mode by parsing the xDS protocol. xDS is a generic name for a type of discovery service. Applications can dynamically obtain Listeners, Routes, Clusters, Endpoints, and Secret configurations through xDS APIs.

2

Based on the Proxyless mode, Dubbo can directly establish communication with the control plane to implement unified control over traffic control, service governance, observability, and security. This avoids performance loss and deployment architecture complexity caused by the sidecar mode.

2. A Detailed Explanation of Dubbo xDS Push Mechanism

@startuml

' ====== Adjust style ===============
' Single state definition example: state uncommitted #70CFF5 ##Black
' hide footbox can close the modules in the lower part of the sequence diagram.
' autoactivate on is automatically activated or not
skinparam sequence {
ArrowColor black

LifeLineBorderColor black
LifeLineBackgroundColor #70CFF5

ParticipantBorderColor #black
ParticipantBackgroundColor  #70CFF5
}
' ====== Define process ===============

activate ControlPlane
activate DubboRegistry
autonumber 1


ControlPlane <-> DubboRegistry : config pull and push
activate XdsServiceDiscoveryFactory
activate XdsServiceDiscovery
activate PilotExchanger
       
DubboRegistry -> XdsServiceDiscoveryFactory : request
XdsServiceDiscoveryFactory --> DubboRegistry: get registry configuration

XdsServiceDiscoveryFactory -> XdsChannel: return the list information (if the data has not been imported, it is not visible).
XdsServiceDiscoveryFactory-> XdsServiceDiscovery: init Xds service discovery
XdsServiceDiscovery-> PilotExchanger: init PilotExchanger

alt PilotExchanger
  PilotExchanger -> XdsChannel: init XdsChannel
  XdsChannel --> PilotExchanger: return
  PilotExchanger -> PilotExchanger: get cert pair
  PilotExchanger -> PilotExchanger: int ldsProtocol
  PilotExchanger -> PilotExchanger: int rdsProtocol
  PilotExchanger -> PilotExchanger: int edsProtocol
end

alt PilotExchanger
  XdsServiceDiscovery --> XdsServiceDiscovery: parse xDSds protocol
  XdsServiceDiscovery --> XdsServiceDiscovery: init node info based on Eds
  XdsServiceDiscovery --> XdsServiceDiscovery: write the SLB and routing rules of Rds and Cds into the running information of the node.
  XdsServiceDiscovery --> XdsServiceDiscovery: send back to the service introspection framework to build the invoker.
end

deactivate ControlPlane
deactivate XdsServiceDiscovery
deactivate XdsServiceDiscoveryFactory

@enduml

3

On the whole, the interaction sequence diagram between the Istio control plane and Dubbo is shown above. The main logic of xDS processing in Dubbo is in the specific implementation of the corresponding protocols of PilotExchanger and xDS APIs (LDS, RDS, CDS, and EDS). PilotExchanger is responsible for the concatenation logic. There are three main logic:

  • Obtain credit certificates
  • Invoke getResources() of different protocols to obtain resources
  • Invoke observeResource() of different protocols to listen for resource changes

Take LDS and RDS as an example. PilotExchanger invokes the getResource() method of LDS to establish a communication connection with Istio, sends data, and parses the response from Istio. The parsed resource is used as the input parameter for RDS to invoke the getResource() method, and RDS sends data to Istio. When LDS changes, the observeResource() method of LDS triggers the changes in itself and RDS. The same is true for RDS and EDS. The existing interactions are listed below. The preceding process corresponds to the process of the red line in the diagram.

4

After successfully obtaining resources for the first time, each DS continuously sends requests to Istio through scheduled tasks, parses the response results, and maintains interaction with Istio, thus realizing the traffic control, service governance, and observability control by the control panel. The process corresponds to the blue line in the preceding diagram.

3. Deficiencies in the Current Dubbo Proxyless Implementation

The Dubbo Proxyless mode has been validated and proven to be reliable. The existing Dubbo Proxyless implementation solutions have the following problems:

  • Currently, the logic for interacting with Istio is push mode. As getResource and observeResource are two different streams, each time a new request is sent, the connection needs to be re-established. But the streams we create are bidirectional, which are actively pushed by Istio after it monitors resource changes. LDS, RDS, and EDS only need to maintain one stream, respectively.
  • After the stream mode is changed to a persistent connection, a local cache pool is required to store existing resources. After Istio pushes resource changes, the data in the cache pool needs to be refreshed.
  • Previously, the logic of the observeResource() method was to perform polling on Istio using scheduled tasks. Now, the logic of the observeResource() method is no longer regular polling but adding the resources to be monitored to the cache pool. Then, Istio automatically pushes the data. In addition, the data pushed by Istio needs to be divided according to the app to achieve multi-point monitoring. Dubbo can reuse the corresponding logic to support other DS modes later.
  • Currently, Dubbo applications hosted by Istio display exceptions after Istio is disconnected. After disconnection, Dubbo applications cannot be reconnected and can only be redeployed. This increases the complexity of O&M and management. We need the feature of reconnection after disconnection, which allows us to reconnect applications without redeployment after Istio recovers.

The transformed interaction logic is shown below:

5

4. Implementation Scheme of xDS Monitoring Mode

4.1 Resource Cache Pool

Currently, Dubbo’s resources include LDS, RDS, and EDS. For the same process, all the resources monitored by the three types of resources correspond to the list of resource listeners cached by Istio. Therefore, we should design corresponding local resource cache pools for these three resources. When Dubbo uses resources, it goes to the cache pools to query first. If there is a result, it will directly return it. Otherwise, Dubbo will aggregate the resource list of the local cache pools with the resources to be sent and then send the aggregation result to Istio to update its resource listener list. The cache pool is shown below, where the key represents a single resource, and T is the return result from different DSs.

protected Map<String, T> resourcesMap = new ConcurrentHashMap<>();

After the cache pool is built, a structure or container that monitors the cache pool is required. Here, we design it in the form of Map:

protected Map<Set<String>, List<Consumer<Map<String, T>>>> consumerObserveMap = new ConcurrentHashMap<>();

Key is the resource to be monitored, and value is a List. Value is designed as a List because it supports repeated subscriptions. The item stored in List is of the consumer type in JDK8. It can be used to pass a function or behavior. Its input parameter is Map, its key corresponds to a single resource to be monitored, and it can be easily retrieved from the cache pool. As mentioned above, PilotExchanger connects the entire process, and the update relationship between different DSs can be transmitted by consumers. The following code provides an example of how to monitor observeResource of LDS:

// Listener.
void observeResource(Set<String> resourceNames, Consumer<Map<String, T>> consumer, boolean isReConnect);

// Observe LDS updated
ldsProtocol.observeResource(ldsResourcesName, (newListener) -> {
    // LDS data is inconsistent.
    if (!newListener.equals(listenerResult)) {
        //Update LDS data.
        this.listenerResult = newListener;
        // Trigger an RDS listener.
        if (isRdsObserve.get()) {
            createRouteObserve();
        }
    }
}, false);

After the stream mode is changed to a persistent connection, we need to store the behavior of the consumer in the local cache pool. After receiving a push request from Dubbo, Istio refreshes its cached resource list and returns a response. In this case, the response returned by Istio is the aggregated result. After Dubbo receives the response, it splits the response resources into resources with smaller granularities and then pushes them to the corresponding Dubbo applications to notify them to change.

Pitfalls

  • The data pushed by Istio may be an empty string. In this case, the cache pool does not need to store the data. Otherwise, Dubbo will bypass the cache pool and continuously send requests to Istio.
  • Consider the following scenario: a Dubbo application subscribes to two interfaces at the same time, which are provided by app 1 and app 2. When sending data to Istio, it is necessary to aggregate all the resource names to be monitored and initiate monitoring at one time to avoid mutual overwriting between listeners.

4.2 Multi-Point Independent Monitoring

When Dubbo sends a request to Istio for the first time, the getResource() method is invoked to query the data in the cache pool. If the data is missing, Dubbo will aggregate resources before requesting data from Istio. Then, Istio will return the corresponding result to Dubbo. We have two implementation solutions for processing responses from Istio:

  1. The user creates a new completeFuture in the getResource scheme, and the cache analyzes whether the response is the required data. If it is confirmed to be new data, Future callback functions pass the result.
  2. getResource establishes the resource listener consumerObserveMap, defines a consumer, and synchronizes the fetched data to the original thread. After the cache receives the push from Istio, it will push the received data to all listeners and send the data to the listener of the resource.

Both of the preceding methods can be implemented. However, the biggest difference is whether users need to sense the existence of getResource when invoking onNext to send data to Istio. In summary, solution 2 is selected. After Dubbo establishes a connection with Istio, Istio pushes its monitoring resource list to Dubbo. Dubbo parses the response, divides the data according to the monitoring apps, refreshes the data in the local cache pool, and sends an ACK response to Istio. The process is listed below:

@startuml

object Car
object Bus
object Tire
object Engine
object Driver

Car <|- Bus
Car *-down- Tire
Car *-down- Engine
Bus o-down- Driver

@enduml

6

Some of the key code is listed below:

public class ResponseObserver implements XXX {
        ...
        public void onNext(DiscoveryResponse value) {
            //Accept data from Istio and split the data.
            Map<String, T> newResult = decodeDiscoveryResponse(value);
            //The local cache pool data.
            Map<String, T> oldResource = resourcesMap;
            //Refresh the cache pool data.
            discoveryResponseListener(oldResource, newResult);
            resourcesMap = newResult;
            // for ACK
            requestObserver.onNext(buildDiscoveryRequest(Collections.emptySet(), value));
        }
        ...
        public void discoveryResponseListener(Map<String, T> oldResult, 
                                              Map<String, T> newResult) {
            ....
        }  
}
//The specific implementation is carried out by LDS, RDS, and EDS.
protected abstract Map<String, T> decodeDiscoveryResponse(DiscoveryResponse response){
  //Compare the new data with the resources in the cache pool, and retrieve the resources that are not in the two pools at the same time.
    ...
    for (Map.Entry<Set<String>, List<Consumer<Map<String, T>>>> entry : consumerObserveMap.entrySet()) {
    // Skip this step if the local cache pool does not exist.
    ...
  //Aggregate resources.
    Map<String, T> dsResultMap = entry.getKey()
        .stream()
        .collect(Collectors.toMap(k -> k, v -> newResult.get(v)));
    //Refresh the cache pool data.
    entry.getValue().forEach(o -> o.accept(dsResultMap));
    }
}

Pitfalls

  • Streams are multiplexed with incremental request IDs in the original scenario of multiple streams. After a persistent connection is adopted, a resource has multiple request IDs, and these request IDs may overwrite each other. Therefore, this mechanism must be removed.
  • The initial implementation plan does not split resources. Considering the subsequent support for other DSs, the data returned by Istio is split, which leads to a strange consumerObserveMap.
  • The three types of DS can share the same channel when sending data, but the one used for monitoring must be the same channel. Otherwise, Istio will not push data when the data changes.
  • After the bidirectional stream is established, the initial solution future is globally shared. However, there may be such a scenario: two time-adjacent onNext events of the same DS are recorded as event A and event B. Event A is sent first, and event B is followed. However, it is possible that the result of event B is returned first, and the time of the Istio push is uncertain. Therefore, the future must be a local variable instead of a globally shared one.

4.3 Use Read-Write Lock to Avoid Concurrency Conflict

Concurrency conflicts may occur in the listener consumerObserveMap and the cache pool resourcesMap. For resourcesMap, since the put operation is concentrated in the getResource() method, we can use a pessimistic lock to lock the corresponding resources to avoid concurrent monitoring of resources.

There are put, remove, and traverse operations for consumerObserveMap. In terms of timing, the use of a read-write lock can avoid conflicts. In terms of the traverse operation, use read lock, and in terms of the put and remove operations, use write lock to avoid concurrency conflicts. In summary, use the pessimistic lock to avoid concurrency conflicts for resourcesMap. The consumerObserveMap involves the following operation scenarios:

  • When sending requests to Istio remotely, data is added to consumerObserveMap, and a write lock is used.
  • After CompleteFuture returns data across threads, stop monitoring the future and use a write lock.
  • When monitoring the cache pool, monitoring data is added to the consumerObserveMap, and a write lock is used.
  • When the connection is restored and monitoring data is added to the consumerObserveMap, use a write lock.
  • When Dubbo parses the data returned by Istio, traverses the cache pool, and refreshes the data, use a read lock.

Pitfalls

  • Dubbo and Istio establish a bidirectional stream. Two time-adjacent onNext events of the same DS are recorded as event A and event B. Event A is sent first, and event B is followed. However, it is possible that the result of event B is returned first, and the time of the Istio push is uncertain. Therefore, a lock is required.

4.4 Reconnection

If disconnection occurs, we only need to use a scheduled task to regularly interact with Istio and try to obtain a credit certificate. If the certificate is obtained, Istio is thought to be reconnected. Dubbo will aggregate local resources to request data from Istio, parse the response, refresh the local cache pool data, and disable the scheduled task.

Pitfalls

  • The globally shared scheduled task pool is adopted and cannot be disabled. Otherwise, other services may be affected.

5. Summary

In this feature transformation, the author was really distracted and often could not find bugs. In addition to the pitfalls mentioned above, other pitfalls include (but are not limited to):

  • Dubbo changed the way to obtain the Kubernetes certificate in an iteration, and the authorization failed.
  • There was no problem with the original feature. I merged the master code, but as the code of the grpc version was incompatible with the code of the envoy version, various errors were reported. Finally, the errors were solved by lowering the version.
  • There was no problem with the original feature. I merged the master code and sent Triple protocol to metadataservice in the latest branch code. However, the Proxyless mode only supports Dubbo protocols. After debugging for three or four days, I finally found that the configuration needed to be added.

I have to admit that Proxyless Service Mesh has advantages and broad market prospects. Since Dubbo 3.1.0 was released, Dubbo has implemented Proxyless Service Mesh capabilities. In the future, the Dubbo community will deeply connect with the business to solve more pain points in the actual production environment and improve service mesh capabilities.

0 1 0
Share on

You may also like

Comments

Related Products