Kubernetes Gateway Selection: Nginx or Envoy?

By Zhang Tianyi (Chengtan)

First, let's clarify some key definitions to avoid confusion:

Traditional Gateway: It is not containerized, and Kubernetes is not used. It consists of a traffic gateway and a service gateway. Traffic gateways (such as Tengine) provide global policy configurations unrelated to backend services. Service gateways provide the policy configurations tightly coupled with backend services at the independent business domain level. Application architecture patterns are evolved from a monolithic architecture to distributed microservice architecture. The service gateway has a new name: microservice gateway.
Kubernetes Gateway: It is the cloud-native gateway (also known as the next-generation gateway). Ingress has become the gateway standard of the Kubernetes ecosystem, prompting the combination of traffic gateway and service gateway. The implementation based on the Ingress is mainly divided into two camps: Nginx and Envoy. The Nginx-based Ingress Controller is currently the choice of most Kubernetes clusters. As a rising star, Envoy has the potential to catch up.
MSE Cloud-Native Gateway: This is a cloud service deeply optimized based on Envoy.

This article compares the two open-source implementations from the three aspects of performance and cost, reliability, and security, hoping to provide some reference for enterprises going through Kubernetes gateway selection.

Performance and Cost

The throughput performance of the MSE cloud-native gateway is almost double that of the Nginx Ingress Controller, especially when transmitting small text. The following figure shows the throughput comparison for the gateway CPU usage rate up to 30%:

Gateway Specification: 16 Cores, 32G * 4 Nodes

ECS Model: ecs.c7.8xlarge

When the CPU workload increases, the throughput gap becomes bigger. The following figure shows the throughput comparison when the CPU usage rate reaches 70%:

The decrease of throughput in the Nginx Ingress Controller under high loads is due to the pod restart. Please see the Reliability analysis in the next section for more information.

With the increasing attention paid to network security, HTTPS is now widely used in the Internet industry for encrypted data transmission. The asymmetric encryption algorithm TLS used for HTTPS implementation consumes most CPU resources on the gateway side. As such, MSE cloud-native gateway uses CPU SIMD technology to implement the hardware acceleration of TLS encryption and decryption algorithm:

The pressure test data in the preceding figure shows that after the TLS hardware acceleration solution is applied, compared with HTTPS requests, the latency of the TLS handshake is half of that without acceleration, and the limit value of QPS is increased by more than 80%.

Based on the data above, using the MSE cloud-native gateway can achieve the throughput of the Nginx Ingress Controller with only half the resources, and the throughput can be improved further in HTTPS with hardware acceleration optimization.

Reliability

As mentioned earlier, under high loads, the Nginx Ingress Controller will experience a drop in throughput due to pod restarts. There are two main reasons for pod restarts:

LivenessProbe is prone to fail due to timeouts at high loads. The community improved version 0.34 by reducing redundant detection, but the problem still exists.
When Prometheus is enabled to collect monitoring metrics, OOM occurs under high loads, causing the container to be killed. Please see this link for more information.

These two problems are essentially caused by the unreasonable deployment architecture of the Nginx Ingress Controller. The control plane (Controller implemented by Go) and the data plane (Nginx) run in one container. Under high loads, CPU preemption occurs between the data plane and the control plane. The control plane process is responsible for livenessProbe and collecting monitoring metrics. OOM and livenessProbe timeout are caused due to request backlog as a consequence of insufficient CPU memory.

This is extremely dangerous and can cause an avalanche effect on the gateway under high loads, which has a serious impact on the business. The MSE cloud-native gateway uses an architecture in which the data plane and the control plane are isolated. This architecture has an advantage in reliability:

As shown in the figure above, the MSE cloud-native gateway is not deployed in the user's Kubernetes cluster but in a pure managing model. This model has more advantages in reliability:

It will not mix with business containers on one ECS node.
Multiple instances of the gateway will not run on one ECS node.
It provides SLA assurance for gateway availability.

If the Nginx Ingress Controller is used to achieve high-reliability deployment, the Nginx Ingress Controller needs to occupy ECS nodes exclusively, and multiple ECS nodes need to be deployed to avoid a single point of failure. And the resource cost increases sharply. In addition, since the Nginx Ingress Controller is deployed in a user cluster, there is no SLA assurance for gateway availability.

Security

Different versions of the Nginx Ingress Controller still have some potential CVEs. Please see the following table for specific affected versions:

After the change from the Nginx Ingress Controller to the MSE cloud-native gateway, all CVEs are fixed at one time. In addition, the MSE cloud-native gateway provides a smooth upgrade solution. Once a new security vulnerability occurs, the gateway version will be upgraded quickly while ensuring the upgrade process minimizes the impact on the business.

In addition, the MSE cloud-native gateway has a built-in Alibaba Cloud Web Application Firewall (WAF). Compared with traditional WAF, the requested link is shorter, and the RT is lower. Compared with Nginx Ingress Controller, it can achieve fine-grained routing-level protection. The usage cost is 2/3 of the current Alibaba Cloud WAF architecture.

MSE Cloud-Native Gateway

The MSE cloud-native gateway has been launched on Alibaba Cloud Container Application Market, and it can replace the gateway component Nginx Ingress Controller installed by default.

The MSE cloud-native gateway has been used on a large scale as a gateway middleware within Alibaba Group. Its strong performance and reliable stability have been verified throughout the many years of service during the Double 11 Global Shopping Festival.

In the scenario of the Kubernetes container service, compared with the default Nginx Ingress Controller, the MSE cloud-native gateway has the following advantages:

Stronger performance, better architecture, and reduce gateway resource costs by at least 50%
Better reliability and SLA assurance, 100% hosting without O&M, and supported by the Alibaba Cloud Technical Team.
Better security assurance, one-time resolution of existing CVEs, and the protection function: built-in WAF

The MSE cloud-native gateway provides more features in routing strategy, gray governance, and observability. It also supports the development of custom extended plugins in multiple languages. Please see this link for more information.

Seamless Migration Solution

Deploying an MSE cloud-native gateway does not directly affect the original gateway traffic. You can use the DNS weight configuration to implement smooth business traffic migration without any awareness of backend services. The following figure shows the core traffic migration processes:

The complete steps are listed below:

Step 1: Find the mse-ingress-controller in the Alibaba Cloud Container Application Market and install it in the target ACK cluster
Step 2: Configure MseIngressConfig (configuration guide) in Kubernetes to automatically create an MSE cloud-native gateway with the specified specifications
Step 3: Obtain the IP address of the MSE cloud-native gateway from the Address field of the Ingress, bind the host locally, and resolve the service domain name to the IP address to complete the service test
Step 4: Modify the DNS weight configuration of the service domain name, add the cloud-native gateway IP address, gradually increase the weight, and perform gray release based on traffic
Step 5: After the gray release, remove the original IP address of the service domain name from the DNS configuration to migrate all traffic to the cloud-native gateway

Click here to learn more about cloud-native gateway products!

Community

Kubernetes Gateway Selection: Nginx or Envoy?

Performance and Cost

Reliability

Security

MSE Cloud-Native Gateway

Seamless Migration Solution

Read previous post:

Read next post:

Alibaba Cloud Native

You may also like

Comments

Alibaba Cloud Native

Related Products

Cloud-Native Applications Management Solution

Microservices Engine (MSE)

Container Service for Kubernetes

Function Compute