By Wang Chen
Sealos shared their optimization journey for gateway performance in the Reddit r/kubernetes community, including production issues brought about by business characteristics, solution selection, performance comparisons, code snippets, flame graphs, and improvements, with over 40,000 views and accolades from many foreign internet users.

Sometimes I almost feel like I'm doing impactful engineering at work.
Then I see posts like this, and I immediately feel like a three-year-old who just received praise from his father for successfully putting a triangle into a square hole, while his father has just merged his PR into the Linux kernel...
I love having a well-paying and comfortable job, but damn it, I might have to move somewhere that allows me to work on cutting-edge projects.
However, my question is—how much manpower and overtime have we invested to solve this issue? How many years of experience (YoE) does the team have working on this? I know a few people who have worked for over 15 years, but I'm not sure they could contribute to such a large-scale operation.

I love these improvements; they help not only large companies but also small ones. Although these improvements may not be significant, even a little makes a tremendous difference. These subtle enhancements (for small companies) combined can allow them to run more services on the same hardware. This is the opposite of what happens in desktops and networking, where we always assume that computer performance is getting better, so no one cares about optimization.

Finally, there is some good content here—real code snippets, flame graphs, and production results… amazing stuff.

This makes me feel completely ignorant, so that's good.

This is how we make progress! (Learn to nod when talking to smarter people)
Sealos went through two phases of optimizing gateway performance.
Sealos Cloud is built on K8s, where developers can quickly create cloud applications.
During the development of Sealos, due to its ultra-simple product experience and cost-effective charging model, users grew explosively, starting in early 2024 with support for over 2,000 tenants and more than 87,000 users. Every user goes to create applications, and each application needs its own access point, leading to a massive number of routing entries throughout the cluster, requiring the capability to support hundreds of thousands of Ingress entries.
In addition, providing shared cluster services in a public network has extremely strict requirements for multi-tenancy, where routing between users must not interfere with each other and requires excellent isolation and traffic control capabilities. Moreover, the public cloud services provided by Sealos Cloud have a large attack surface, and hackers may attack both the user applications running in the cloud and directly target the platform’s outbound network, posing significant security challenges.
This necessitated high performance and stability requirements for the controllers, as many controllers consume significant resources when there are a lot of routing entries, even leading to OOM and gateway crashes.
In specific high-scale, multi-tenant Kubernetes environments, the limitations of Nginx Ingress began to become apparent:
Sealos began evaluating other open-source gateways, including APISIX, Cilium Gateway, Envoy Gateway, and Higress, ultimately finding that Higress met Sealos's needs for quick configuration, controller stability, and resource efficiency, while also providing compatibility with Nginx Ingress syntax. The complete evaluation process, including performance testing of several open-source gateways and the reasons for not selecting others, can be viewed in "Which Cloud-Native Gateway is Best: The Blood and Tears of Sealos Gateway".
Sealos also emphasized that migrating away from Nginx Ingress does not deny Nginx—it performs excellently in many scenarios. However, Sealos believes that the specific challenges faced and the summarized experiences at this scale may provide valuable references for the community.
After over a year of development, Sealos Cloud users approached 200,000, with instance numbers reaching 40,000, and the gateway became abnormally slow when updating Ingress configurations.
While Higress outperformed Nginx, serious issues still arose when the number of Ingress configurations in a single Kubernetes cluster exceeded 20,000.
As a public cloud service provider, Sealos allocates secondary domain names based on the main domain name to facilitate user development and testing, and to ensure security, TLS encryption is used for traffic from the user side (public network) to the cluster gateway. Additionally, it also supports users' custom domain CNAME resolution needs and supports automatic signing and renewal. These product advantages and strategies led to a situation where a large number of secondary domain names coexist with user-customized domain names in the same availability zone.
Sealos is the first public cloud vendor in the industry to support such a massive number of users with a single K8s cluster, but it also needs to tackle a problem never encountered before—gateway performance bottlenecks in scenarios with massive domain names and Ingress.
Currently, no products in the entire open-source ecosystem have adapted well to the scenario of massive tenant and massive application gateway entries in a single cluster. Observing Sealos, as the number of domain names increases, the synchronization time for Ingress configurations deteriorates non-linearly, and users' services need to wait a long time to be accessible from the public network after creating Ingress.
Therefore, Sealos conducted in-depth research on Higress, Istio, Envoy, and Protobuf to try to find the root of the issue. Perhaps these experiences can help other teams facing similar large Kubernetes Ingress problems.
Optimization Results: The response speed of the Istio controller improved by more than 50%.
Optimization Results: Performance bottlenecks in Envoy were significantly improved.
Production Environment (20,000+ Ingress Entries):
The complete technical details of the optimization process, code implementation, and flame graph analysis can be found in "We Improved Sealos Gateway Performance by Nearly 100 Times to Support 100,000 Domains in a Single Cluster".
These optimization experiences not only apply to Higress but also reveal common performance issues with Istio and Envoy in large Kubernetes environments. Sealos has accumulated valuable experience regarding system performance bottlenecks during the investigation process.
Someone summarized on Reddit that the selection of any technology should not be solely based on reputation and feeling, but rather through thorough selection and PoC validation before formal construction.

Indeed, many people choose technologies based on reputation and feeling. Then they invest a lot of effort into building, yet do not conduct any load testing or proof of concept (POC). They find out after going live that their choices have disappointed them and placed them in difficult situations.
Colleagues from HAProxy also strongly agree that benchmark testing should be conducted before any selection, rather than relying on popular terms.

I work at HAProxy.
Before joining the company, I tested various HAProxy ICs, and HAProxy Tech was interesting to me because it utilizes Client Native rather than templates, which are the real bottlenecks in other solutions.
At Namecheap, we migrated from LUA-based NGINX to HAProxy, and HAProxy Tech ultimately showed that TTR (reconfiguration time) decreased from 40 minutes to 7 minutes.
I just finished giving a presentation on stage in San Francisco, where the team shared an enhancement publicly that reduced configuration settings from 3 hours to 26 seconds: a large customer with a considerable number of objects.
Sometimes I feel technology is reflected through popular terms and trends: Benchmark testing should be the first thing to do, and it should not carry bias or preconceived notions.
Aside from praises, many internet users hope that these performance optimization breakthroughs can be open-sourced to give back upstream and benefit more developers.

I hope these breakthroughs have been pushed as PRs to the original repository so that upstream projects and the community can benefit from them.
This internet user can't wait to share that their company’s domain activation time is much shorter, in a production environment with 20,000+ Ingress entries, taking less than 50ms!

Great improvements! It should be noted that O(n²) code is not in Istio but in a branch of Istio. Moreover, 5 seconds is still too long :-) Stay tuned, and I will share how we achieved this in 50ms at a scale of 20k domains.
But there are also questions from foreign netizens about why Sealos didn’t design it as multiple smaller clusters.

After reading the entire article, there are still some questions:
By the way, this article is excellent! I will try to gather a few colleagues for an impromptu seminar to learn from this example. Truly amazing.
The author replied:

Feel free to leave comments for discussion:
Original post link: https://www.reddit.com/r/kubernetes/comments/1l44d4y/followup_k8s_ingress_for_20k_domains_now_syncs_in/
If you want to learn more about Alibaba Cloud API Gateway (Higress), please click: https://higress.ai/en/
Alibaba Cloud Bailian Open Source NL2SQL Intelligent Framework for Java Developers
Higress MCP Service Management Helps Build a Private MCP Market
626 posts | 54 followers
Followroura356a - July 24, 2019
Alibaba Clouder - November 3, 2020
Alibaba Cloud Native Community - November 27, 2025
Alibaba Clouder - June 28, 2017
Alibaba Cloud Industry Solutions - January 12, 2022
Rupal_Click2Cloud - October 16, 2023
626 posts | 54 followers
Follow
Cloud-Native Applications Management Solution
Accelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn More
Managed Service for Prometheus
Multi-source metrics are aggregated to monitor the status of your business and services in real time.
Learn More
Container Service for Kubernetes
Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn More
Bastionhost
A unified, efficient, and secure platform that provides cloud-based O&M, access control, and operation audit.
Learn MoreMore Posts by Alibaba Cloud Native Community