The Company's Journey of Optimizing Gateway Performance Has Gone Viral on Reddit

This article discusses Sealos' optimization journey for gateway performance, detailing their transition from Nginx Ingress to Higress and their in-depth performance improvements using Istio and Envoy.

By Wang Chen

Sealos shared their optimization journey for gateway performance in the Reddit r/kubernetes community, including production issues brought about by business characteristics, solution selection, performance comparisons, code snippets, flame graphs, and improvements, with over 40,000 views and accolades from many foreign internet users.

Sometimes I almost feel like I'm doing impactful engineering at work.

Then I see posts like this, and I immediately feel like a three-year-old who just received praise from his father for successfully putting a triangle into a square hole, while his father has just merged his PR into the Linux kernel...

I love having a well-paying and comfortable job, but damn it, I might have to move somewhere that allows me to work on cutting-edge projects.

However, my question is—how much manpower and overtime have we invested to solve this issue? How many years of experience (YoE) does the team have working on this? I know a few people who have worked for over 15 years, but I'm not sure they could contribute to such a large-scale operation.

I love these improvements; they help not only large companies but also small ones. Although these improvements may not be significant, even a little makes a tremendous difference. These subtle enhancements (for small companies) combined can allow them to run more services on the same hardware. This is the opposite of what happens in desktops and networking, where we always assume that computer performance is getting better, so no one cares about optimization.

Finally, there is some good content here—real code snippets, flame graphs, and production results… amazing stuff.

This makes me feel completely ignorant, so that's good.

This is how we make progress! (Learn to nod when talking to smarter people)

Sealos went through two phases of optimizing gateway performance.

Phase One: Migration from Nginx Ingress to Higress

Sealos Cloud is built on K8s, where developers can quickly create cloud applications.

During the development of Sealos, due to its ultra-simple product experience and cost-effective charging model, users grew explosively, starting in early 2024 with support for over 2,000 tenants and more than 87,000 users. Every user goes to create applications, and each application needs its own access point, leading to a massive number of routing entries throughout the cluster, requiring the capability to support hundreds of thousands of Ingress entries.

In addition, providing shared cluster services in a public network has extremely strict requirements for multi-tenancy, where routing between users must not interfere with each other and requires excellent isolation and traffic control capabilities. Moreover, the public cloud services provided by Sealos Cloud have a large attack surface, and hackers may attack both the user applications running in the cloud and directly target the platform’s outbound network, posing significant security challenges.

This necessitated high performance and stability requirements for the controllers, as many controllers consume significant resources when there are a lot of routing entries, even leading to OOM and gateway crashes.

In specific high-scale, multi-tenant Kubernetes environments, the limitations of Nginx Ingress began to become apparent:

Unstable Reloading and Connection Interruptions: Frequent configuration changes lead to network instability.
Long Connection Issues: Long connections are often terminated during updates.
Scaling Performance Bottlenecks: Under a large number of Ingress entries, configuration propagation speed and resource consumption face challenges.

Sealos began evaluating other open-source gateways, including APISIX, Cilium Gateway, Envoy Gateway, and Higress, ultimately finding that Higress met Sealos's needs for quick configuration, controller stability, and resource efficiency, while also providing compatibility with Nginx Ingress syntax. The complete evaluation process, including performance testing of several open-source gateways and the reasons for not selecting others, can be viewed in "Which Cloud-Native Gateway is Best: The Blood and Tears of Sealos Gateway".

Sealos also emphasized that migrating away from Nginx Ingress does not deny Nginx—it performs excellently in many scenarios. However, Sealos believes that the specific challenges faced and the summarized experiences at this scale may provide valuable references for the community.

Phase Two: Optimizing Higress, Istio, and Envoy at the Low Level, Increasing Gateway Performance by Nearly 100 Times

After over a year of development, Sealos Cloud users approached 200,000, with instance numbers reaching 40,000, and the gateway became abnormally slow when updating Ingress configurations.

While Higress outperformed Nginx, serious issues still arose when the number of Ingress configurations in a single Kubernetes cluster exceeded 20,000.

As a public cloud service provider, Sealos allocates secondary domain names based on the main domain name to facilitate user development and testing, and to ensure security, TLS encryption is used for traffic from the user side (public network) to the cluster gateway. Additionally, it also supports users' custom domain CNAME resolution needs and supports automatic signing and renewal. These product advantages and strategies led to a situation where a large number of secondary domain names coexist with user-customized domain names in the same availability zone.

Sealos is the first public cloud vendor in the industry to support such a massive number of users with a single K8s cluster, but it also needs to tackle a problem never encountered before—gateway performance bottlenecks in scenarios with massive domain names and Ingress.

Currently, no products in the entire open-source ecosystem have adapted well to the scenario of massive tenant and massive application gateway entries in a single cluster. Observing Sealos, as the number of domain names increases, the synchronization time for Ingress configurations deteriorates non-linearly, and users' services need to wait a long time to be accessible from the public network after creating Ingress.

Issue Performance: New domain names take more than 10 minutes to be activated, sometimes even up to 30 minutes.
Impact: Users are complaining, development progress is hindered, and adding new domain names further exacerbates system delays.

Therefore, Sealos conducted in-depth research on Higress, Istio, Envoy, and Protobuf to try to find the root of the issue. Perhaps these experiences can help other teams facing similar large Kubernetes Ingress problems.

Optimization Process

Istio (Control Plane):

GetGatewayByName Method Inefficiency: This method performs O(n²) level checks in the LDS cache. Sealos optimized it to O(1) complexity by using hash tables.
Protobuf Serialization Performance Bottlenecks: Frequent data format conversions lead to performance declines. Sealos introduced a caching mechanism to ensure that objects are only converted once.

Optimization Results: The response speed of the Istio controller improved by more than 50%.

Envoy (Data Plane):

FilterChain Serialization Takes Too Long: Envoy converts the entire FilterChain configuration to text as hash table keys. When processing more than 20,000 FilterChains, even using a fast hashing algorithm like xxHash, performance is still significantly impacted.
Frequent Hash Function Calls: absl::flat_hash_map calls the hash function too frequently.

Solutions:

Utilize recursive hashing algorithms to calculate overall hash from component hash values, avoiding full text conversion.
Implement a comprehensive hash caching mechanism and develop CachedMessageUtil tool class, appropriately extending Protobuf::Message.

Optimization Results: Performance bottlenecks in Envoy were significantly improved.

Performance Comparison:

Lab Testing (7000 Ingress Entries): Ingress update time reduced from 47 seconds to 2.3 seconds, a 20-fold improvement.
Production Environment (20,000+ Ingress Entries):
- Domain activation time reduced from over 10 minutes to under 5 seconds.
- No longer need to wait 30 minutes during peak traffic.
- Significant improvements in scalability, supporting more domain name configurations.

The complete technical details of the optimization process, code implementation, and flame graph analysis can be found in "We Improved Sealos Gateway Performance by Nearly 100 Times to Support 100,000 Domains in a Single Cluster".

These optimization experiences not only apply to Higress but also reveal common performance issues with Istio and Envoy in large Kubernetes environments. Sealos has accumulated valuable experience regarding system performance bottlenecks during the investigation process.

Someone summarized on Reddit that the selection of any technology should not be solely based on reputation and feeling, but rather through thorough selection and PoC validation before formal construction.

Indeed, many people choose technologies based on reputation and feeling. Then they invest a lot of effort into building, yet do not conduct any load testing or proof of concept (POC). They find out after going live that their choices have disappointed them and placed them in difficult situations.

Colleagues from HAProxy also strongly agree that benchmark testing should be conducted before any selection, rather than relying on popular terms.

I work at HAProxy.

Before joining the company, I tested various HAProxy ICs, and HAProxy Tech was interesting to me because it utilizes Client Native rather than templates, which are the real bottlenecks in other solutions.

At Namecheap, we migrated from LUA-based NGINX to HAProxy, and HAProxy Tech ultimately showed that TTR (reconfiguration time) decreased from 40 minutes to 7 minutes.

I just finished giving a presentation on stage in San Francisco, where the team shared an enhancement publicly that reduced configuration settings from 3 hours to 26 seconds: a large customer with a considerable number of objects.

Sometimes I feel technology is reflected through popular terms and trends: Benchmark testing should be the first thing to do, and it should not carry bias or preconceived notions.

Aside from praises, many internet users hope that these performance optimization breakthroughs can be open-sourced to give back upstream and benefit more developers.

I hope these breakthroughs have been pushed as PRs to the original repository so that upstream projects and the community can benefit from them.

This internet user can't wait to share that their company’s domain activation time is much shorter, in a production environment with 20,000+ Ingress entries, taking less than 50ms!

Great improvements! It should be noted that O(n²) code is not in Istio but in a branch of Istio. Moreover, 5 seconds is still too long :-) Stay tuned, and I will share how we achieved this in 50ms at a scale of 20k domains.

But there are also questions from foreign netizens about why Sealos didn’t design it as multiple smaller clusters.

After reading the entire article, there are still some questions:

Is concentrating all these entries into a single cluster really the best method? Wouldn't it make more sense to divide them into a few smaller clusters?
How is the performance of etcd at this scale? Have you employed any innovative techniques in it?

By the way, this article is excellent! I will try to gather a few colleagues for an impromptu seminar to learn from this example. Truly amazing.

The author replied:

When maintaining high-performance infrastructure, we prioritize the utilization of computing resources. The design of a monolithic cluster naturally excels in efficiency. Currently, replicating the same efficiency across several smaller clusters remains a challenge. Even small clusters inevitably face scaling limitations as workload (including but not limited to ingress resources) grows. Our performance team focuses on addressing these issues fundamentally rather than using workload splitting as a stopgap. While approaching the limit of a single cluster brings some challenges, we believe these challenges are engineering challenges that can be solved, not barriers.
We have encountered I/O bottleneck issues previously, which we can temporarily solve by upgrading the production environment disks to higher tiers. Currently, investigating the root cause in the code is not our top priority, but we are conducting an investigation.

Feel free to leave comments for discussion:

Have you encountered similar performance issues when scaling Kubernetes Ingress or service meshes?
What tools and methods do you use to identify and resolve performance issues with Istio/Envoy?
What unique experiences and suggestions do you have for handling a large number of Ingress configurations?

Original post link: https://www.reddit.com/r/kubernetes/comments/1l44d4y/followup_k8s_ingress_for_20k_domains_now_syncs_in/

If you want to learn more about Alibaba Cloud API Gateway (Higress), please click: https://higress.ai/en/

Community

The Company's Journey of Optimizing Gateway Performance Has Gone Viral on Reddit

Phase One: Migration from Nginx Ingress to Higress

Phase Two: Optimizing Higress, Istio, and Envoy at the Low Level, Increasing Gateway Performance by Nearly 100 Times

Optimization Process

Istio (Control Plane):

Envoy (Data Plane):

Solutions:

Performance Comparison:

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Cloud-Native Applications Management Solution

Managed Service for Prometheus

Bastionhost

Container Service for Kubernetes