Soul's Cloud-Native Gateway Best Practices

1. Company Introduction

Soul is a product based on interest maps and playing patterns, which is a virtual social network platform for young people. Founded in 2016, Soul is committed to creating a social meta-universe for young people with the ultimate vision of no more lonely people. In Soul, users can freely express themselves, recognize others, explore the world, exchange interests and opinions, gain spiritual resonance and identity, obtain information in communication, and build new, quality relationships.

2. Challenges

2.1 Long Multi-Layer Gateway Link

Soul began to try container service in 2020. Container ingress gateway (Ingress-Nginx), microservice gateway, and SLB + the Tengine of the unified access layer appeared in the container transformation phase of ECS. This has resulted in a multi-gateway architecture. The long link brings about the problems of cost and RT. The investigation of a request exception requires a lot of manpower to solve, and the positioning problem is costly.

2.2 Ingress-Nginx Open-Source Issue

In 2023, the Ingress-Nginx community gives more feedback about stability and security problems. Soul temporarily stops receiving new features, which is a huge hidden danger.

2.3 gRPC Forwarding Load Imbalance

Some internal network services open gRPC ingress. gRPC is based on HTTP/2, which is designed as a long-standing TCP connection through which all are multiplexed. This reduces the overhead of managing connections but introduces new problems in load balancing.
Since we cannot balance at the connection level, we need to switch from connection-level balancing to request-level balancing to do gRPC load balancing. In other words, we need to open an HTTP/2 connection to each destination and balance the requests between these connections.
This means we need the 7-layer load balancing. The Service core of Kubernetes uses kube proxy, which is the 4-layer load balancing, so it cannot meet our requirements.
Currently, an independent envoy + headless solution is used to solve the gPRC forwarding imbalance problem. SLB exposes the envoy port for other services to call. However, the maintenance cost is high, and envoy node resources are more likely to be wasted.

2.4 Ingress Stability and Limitations

Due to the uncertainty of business, the number of connections of the Nginx ingress controller increases suddenly with the fluctuation of business requests, resulting in health check failure. The upstream detection of the Nginx ingress controller takes time and failure accumulation, resulting in a large number of failed or retried user requests at this stage (as shown below):

HTTP routing only supports host and path match. There is no general configuration for advanced routing functions, which can only be implemented through annotation. For example, to realize URL redirection using Nginx Ingress Controller, we need to configure nginx.ingress.kubernetes.io/rewrite-target annotation, which can no longer meet the requirements of programmable routing.
Services in different namespaces often need to be bound to the same gateway, and the ingress gateway cannot be shared among multiple namespaces. This makes it more difficult to split Ingress-Nginx and Ingress-Controller.

2.5 Business Release Jitter

Although Kubernetes has an online mechanism and readiness probes (such as Liveness and Readiness), after the service is started, it receives requests instantly, and the service is affected by the impact of instantaneous traffic and pressure at the link level.
Service release can be divided into multiple batches. However, when we look at the entire release process as a whole, we see that the service RT suddenly increases, causing the phased response of local services to slow down. The most intuitive feeling of users is network lagging (a single request is slow or retry after a request fails). Users may perceive service degradation or service unavailability, which affects their experience.

3. Technology Selection

Open-source Ingress-Nginx encounters many problems, and it is difficult to locate and solve the probability timeout problem due to the huge online traffic. Therefore, will we consider whether to invest in more R&D personnel to solve this problem or choose Envoy gateway and ASM and MSE cloud-native gateway? We have made a comprehensive evaluation of these three new technologies.

Nginx vs. Envoy	Benefit and Drawback
Nginx Ingress Controller (Default)	1. It is easy for Nginx ingress to connect to Kubernetes clusters with many files. 2. The configuration loading depends on the original Nginx config reload, and the use of lua reduces the number of reloads. 3. The plug-in capabilities and scalability of Nginx ingress are poor. 4. Unreasonable Deployment Architecture: The control plane controller and the data plane Nginx process run together in one container, and CPU preemption will be carried out. When Prometheus is turned on to collect monitoring metrics, OOM will occur under high load, the container will be killed, and the health check times out. (The control plane process is responsible for a health check and monitoring metric collection.)
MSE Cloud-Native Gateway/ASM Ingress Gateway	1. Envoy is the third project that graduated from the Cloud Native Computing Foundation (CNCF), which is suitable for the data plane of the ingress controller in Kubernetes. 2. Based on C++, many advanced functions required by the proxy are implemented, such as advanced load balancing, fusing, throttling, fault injection, traffic replication, and observability. 3. XDS protocol dynamic hot update - no "reload" required 4. Control plane and data plane are separated. 5. The managed component has a high SLA.
ASM Envoy Listio Ingress Gateway)	1. Install a complete set of istio components (at least ingressgateway and virtualservice are required). 2. Relatively complex 3. ASM does not support direct transfer from http to Dubbo but supports transfer from http to gRPC. 4. Managed Dubbo service https://help.aliyun.com/document_detail/253384.html

In summary, Envoy is a good choice for the data plane at this stage (it can solve the performance and stability problems of the existing Nginx ingress controller). Due to high-performance requirements, we gave priority to performance stress testing.

3.1 Stress Testing Data

We mainly tested the performance and gRPC load balancing capability by comparing the stress testing data of three different schemes of online services (SLB + Envo + headless svc, ALB, MSE). The data shows that MSE cloud-native gateway has advantages in RT and success rate and can meet the forwarding needs of Soul gRPC. Can MSE meet all the business requirements of Soul? Can it solve the maximum cluster timeout problem? We have conducted a more comprehensive evaluation of MSE.

Dimension	ACK Scheme	ALB Scheme	MSE Scheme
Scenario Documentation	Use an existing SLB instance to expose an application. istio+envoy	Ingress proxy grpc Based on the Cloud Network Management Platform	Introduction to MSE Ingress istio+envoy
Traces	client→SLB->envoy→headless svc→pod	client→alb ingress → grpc svc→pod	client→mse ingress → grpc svc→pod
Conclusions	Benefits · The stress testing response takes less time and is error-free. · Stable services and uniform traffic during scaling. Drawbacks · You need to build an Envoy service and a headless service.	Benefits · Only deployment + ALB service binding is required. Drawbacks · Time-consuming instability with an error response	Benefits · Only deployment + MSE service binding is required. · Time-consuming is further optimized from 6ms to 2ms with no errors. Drawbacks · MSE instances + SLB instances are charged separately.

3.2 Comprehensive Technical Evaluation

Evaluate the functionality, stability, performance, and security of the MSE cloud-native gateway to see if it meets the future requirements of Soul

Comparison Items	MSE Cloud-Native Gateway	Self-Built Ingress-Nginx
O&M Costs	Resources are fully managed and O&M-free. Ingress and microservice gateway are integrated. In container and microservice scenarios, save 50% of costs. Built-in Free Prometheus Monitoring and Log Analysis Capabilities	In microservice scenarios, you must build a microservice gateway separately. You must pay for additional resources and products To implement metric monitoring and log analysis. Manual O&M costs are high.
Ease of Use	You can perform operations by connecting the backend services of ACK and MSE Nacos.	The operations not connected to microservice registries are error-prone.
Convert HTTP to Dubbo	One method is to expose Dubbo/http interfaces, which has high research and development cost. Another method is to develop http to Dubbo gateway with high research and development and operation costs.	Not supported. You need to develop your service or forward the service. The performance is deeply optimized. TPS is about 80% higher than Ingress-Nginx and Spring Cloud Gateway. Manual performance optimization is required.
Performance	The performance is deeply optimized. TPS is about 80% higher than Ingress-Nginx and Spring Cloud gateways.	Manual performance optimization is required.
Monitoring and Alerting	It is deeply integrated with Prometheus, Log Service, and Tracing Analysis, provides various dashboards and monitoring data at the service level, and supports custom alert rules and alert notifications using DingTalk messages, phone calls, and text messages.	Not supported. You must build your monitoring and alerting system.
Gateway Security	Multiple authentication methods are supported, including blacklist and whitelist, JWT, OIDC, and IDaaS. You can also use a custom authentication method.	You must perform complex security and authorization configurations by yourself
Routing	It supports the hot update of rules and provides precheck and canary release of routing rules.	Rule updates are detrimental to long links, and routing rules take a long time to precheck and take effect slowly.
Service Governance	It supports lossless connection and disconnection of backend applications and fine-grained governance features (such as service-level timeout retries).	Not supported.

The business scenario of Soul is complex. Evaluate the MSE cloud-native gateway to integrate the traffic gateway, microservice gateway, and security gateway into 10+ cloud products, which are out-of-the-box and meet business requirements.
Soul has high requirements for stability, and any jitter will affect a large number of users. Considering MSE cloud-native gateway has been verified by Taobao's Double 11 and has been polished for a long time, it gives us confidence in production and use.
Due to the large amount of Soul traffic and the large scale of gateway machines, the cost is a key consideration. The stress test shows that the MSE cloud-native gateway uses a software and hardware-integrated solution, which is about twice the performance of self-built gateways.

There are a large number of Dubbo services in the Soul backend. Currently, the in-house service gateway is used to convert the HTTP to the Dubbo protocol. Considering the MSE cloud-native gateway supports HTTP to Dubbo protocol conversion and directly links Dubbo services, this is conducive to future architecture convergence.

3.3 Migration Solution

MSE is compatible with the Ingress standard. Therefore, after you create a cloud-native gateway instance and monitor the existing Ingress resources, you can directly migrate the backend to the routing forwarding rule.
MSE and Ingress-Nginx can coexist. Therefore, you only need to gradually switch traffic from the upstream Ingress-Nginx to the MSE cloud-native gateway. You can perform a canary based on different domain names to reduce the risk of changes.
In the Soul scenario, after the traffic is switched to MSE, the Ingress-Nginx is not completely offline. Two nodes are maintained, and HPA configuration is added for emergencies.
gRPC forwards MSE to replace the original independent Envoy, and business services can modify service exposure protocol and port in svc with one-by-one service migration.

3.4 Technical Solutions

3.4.1 Short-Term Solution

The gateway link of Soul is relatively long, which solves the most urgent timeout problem and service release warm-up problem. Therefore, in the first phase, the Ingress-Nginx is replaced, and the container ingress gateway/microservice gateway is combined.

3.4.2 Final-State Solution

Decrease the gateway link to the shortest (unlink the microservice gateway) and manage the HTTP forwarding RPC capability to MSE. Unlink Tengine and manage the ECS forwarding capability to MSE. Finally, implement SLB->MSE->POD/ECS.

4. Landing Effect

4.1 Stability and RT Comparison

After MSE switching, the processing and response request time is stable, from the peak value of 500ms to the peak value of 50ms.

4.2 Comparison of Error Codes Generated by Service Release

Ingress-Nginx is compared with MSE error codes. The 502 is reduced to 0 during the service release period, and the average reduction of 499 is 10%.

4.3 Warm-Up and Startup RT Issue

Landing has solved most of the timeout problems, but the timeout problem of the slow Java program release has not been solved. Therefore, we turn on the service warm-up function. The service gradually receives traffic to prevent a large amount of traffic to cause the timeout of the newly started Java process.

Turn on the warm-up effect: As shown in the figure, the Pod does not receive the full amount immediately after starting but gradually preheats in five minutes. You can see that in the number of service http ingress requests, Pod network inbound and outbound traffic, and Pod CPU utilization. Nginx needs to monitor from the bottom to the upper layer. After using the cloud-native gateway, it provides a comprehensive observation view and rich gateway Prometheus metrics. It is convenient to observe and solve complex problems.

5. Future Planning

The cloud-native gateway combines traffic, security, and microservices gateways to significantly reduce the number of request links and architecture complexity.
Reduce O&M and troubleshooting costs, reduce the entire link RT, and improve customer satisfaction
Enable HTTP 3.0 to improve network transmission efficiency and customer experience
Use service autonomy (online packet capture, diagnosis, and inspection) to reduce troubleshooting costs
Chaos engineering is used to identify stability risks in advance.

6. Landing Experience

6.1 Service-Weight and Canary-Weight

After the cluster is imported to MSE, it is found that the ingress of the service-weight is unavailable.

You can configure how to implement canary releases using one of the following methods in the ACK console:

Use canary- Annotation to configure blue-green and canary release. canary- Annotation is a canary release method that is implemented by the community.

Use service- Annotation to configure blue-green and canary release. service- Annotation is the early implementation of ACK Nginx Ingress Controller.

Root Cause: This annotation is not the standard annotation recommended by Nginx. - nginx.ingress.kubernetes.io/canary-weight is the usage of standard Nginx routing by weight: service-weight (added by ACK, is not to be abandoned by maintenance, and can be used normally).

6.2 The Problem of x-Real-ip

After the service is switched to MSE, it is found that the user IP obtained by the service is the 100 network segment, which is the Alibaba intranet segment. The anti-black product, IP feature library, and X-Real-ip field in Nginx are used. Suspected request header processing is inconsistent.
Ingress-Nginx sets and processes X-Real-ip and xff request header.

MSE Envoy Pass-through without Special Processing

Root Cause: MSE Envoy does not process X-Real-ip and XFF headers.

6.3 MSE Ingress NAT Forwarding Mode and FullNAT Mode

After the xx service is switched to MSE, one SLB (traditional CLB) cannot meet the traffic requirements due to an increase in the number of requests. The MSE ingress is increased from one SLB to four SLB instances.

It is found that the failure rate of service requests has increased slightly, and the number of dial-up alert requests has also increased to varying degrees. The errors reported are all HTTPSConnectionPool(host='xxx.xx.cn, port=443): Read timed out. (read timeout=10).
According to rquestid, the request Nginx is normally accepted and forwarded, and the corresponding request can be found in the backend MSE log with a processing time of 200-300ms. However, Nginx has not received the packet and has been waiting. The http1.1 long link has not been disconnected.

It was found through packet capture analysis that Nginx actively disconnected the link after sending 13 packets. The suspected Nginx recycling link mechanism bug increases the number of keepalive free resource pools and reduces the number of recycling. After modification, the number of errors reported decreased but still existed.
Continue to observe the Tengine error log and find that there is only a recv failed in the log. Continue to capture and analyze 2022/12/15 21:28:16 [error] 14971#0: *35284778395 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 106.119.54.167.

The packet capture found that the link was reset. After investigation, the reset occurred in the scenario of large concurrency, the source IP, source PORT, destination IP, and destination PORT were consistent, and the Linux kernel with the quadruple conflict reset the link.

Change the NAT mode of the SLB instance connected to MSE to FullNat to solve the quad conflict problem. The quad conflict problem occurs in MSE access scenarios but also when multiple 4-Layer SLB instances are introduced into the same ECS mode. This problem is ignored since the number of errors is relatively low.

6.4 Rewrite-Target Annotation, Differences between Nginx-Ingress and Envoy

After the xx service is connected to MSE, a 404 problem appears. Access is normal when upstream is Nginx ingress controller.

Looking at the request log, it is found that the request path is gone.

Post a script

Ingress Configuration File

Root Cause: Nginx ingress controller found that the request path is /. Rewrite / is equivalent to not rewriting, so the generated nginx.conf does not have rewrite rules.
Envoy rewriting / in MSE will generate a rewrite configuration because the path is added after the domain name of the accessed service gateway, resulting in the path loss. After removing the annotation of this rewrite-target in the gateway ingress, it can normally access. Envoy also needs to be compatible with this rewrite, and rewriting / is not processed by default.

Community