Community Blog 4-Step Method for Large-scale Container Deployment

4-Step Method for Large-scale Container Deployment

This blog post shares 4 steps to help you evaluate the need for and the deployment strategy of large-scale containers.

The Alibaba Cloud 2021 Double 11 Cloud Services Sale is live now! For a limited time only you can turbocharge your cloud journey with core Alibaba Cloud products available from just $1, while you can win up to $1,111 in cash plus $1,111 in Alibaba Cloud credits in the Number Guessing Contest.

By Alibaba Container Service

With the development of cloud native technology, many enterprises are beginning to migrate their traditional IT infrastructure to the cloud, making it an inevitable trend for businesses to stay competitive. What's more, amid the ongoing pandemic, we have witnessed a major shift in consumer habits in which industries such as online education, audio and video, and public health have experienced substantial growth. Many enterprises have grasped the opportunity of rapid business growth in this period and realized leapfrog development by taking advantage of cloud computing and container technologies. As one of the most well-known and widely adopted cloud native technologies, container technology can help enterprises improve the agility of IT architectures, accelerate application innovation, and deal with uncertainties in business development more flexibly.

Large-scale Container Deployment Has Become a Compulsory Course for Enterprises

The development of enterprise digitalization has been accelerated by the epidemic. Low-latency and high-concurrency online scenarios frequently appear in the daily operations of enterprises. Moreover, the demand for business innovation is forcing enterprises to continuously use emerging technologies. Nowadays, Kubernetes has gradually become the infrastructure in the cloud native era, and the container technology is widely used in scenarios such as AI, big data, blockchain, and edge computing. As a lightweight computing carrier, container technology offers high elasticity and agility to more scenarios. Under the pressure of daily operations and business innovation, more and more enterprises are trying to embrace the large-scale container deployment on a trial basis, so as to ensure the healthy and long-term development of their business.

According to the China Cloud Native User Survey Report 2020 issued by the China Academy of Information and Communication Technology (CAICT), over 60% of users have applied container technology in production. Nearly 80% of users need a node size of 1,000 or more to meet their production needs, while more than 13% of users have containers with more than 5,000 nodes, and 9% more than 10,000 nodes. With the further popularization of cloud native technology, more and more enterprises are deploying their core business in containers. As the scale of container clusters in the production environment of enterprises shows an explosive growth trend, large-scale deployment of containers has become a "compulsory course" for enterprises. Currently, open-source Kubernetes can support up to 5,000 nodes and 150,000 pods, however, it can no longer meet the growing business needs.

Difficulties in Large-scale Container Deployment

Large-scale container clusters provide greater power on service load and traffic burst with more efficient cluster management. As a practitioner and leader in the cloud native field, Alibaba Cloud takes the lead in achieving a scale of 10,000 nodes and 1 million pods in a single cluster. Compared with community Kubernetes, the number of nodes in a single cluster has doubled, and the number of pods has increased 6.7-fold. Based on the experience of serving millions of customers, Alibaba Cloud has developed a "4-step method for large-scale container deployment". With this method, Alibaba Cloud can help enterprises overcome deployment difficulties to easily cope with the increasing scale demand. We will be addressing these four steps in the form of questions in order to make it easier for you to apply our advice based on your actual situation.

Step 1: Does My Enterprise Need a Large-scale Container Cluster?

Faced with business or IT demands such as traffic burst, complex computing, and the need to further improve O&M efficiency, the capacity of a single cluster is a bottleneck to the current development for enterprises. For example, business such as genetic computing and online flash sales will generate large loads in a short time, posing a severe challenge to the computing resource capacity of a single cluster. A single cluster that can support a large number of nodes to run pods in batches is urgently needed. Therefore, enterprises are about to consider cluster expansion. However, the pursuit of cluster scale is not a perfect way. Enterprises need to optimize cluster capabilities to realize business value based on their own business development characteristics. Blindly pursuing large cluster scales will just increase the risk of faults and suboptimal performance.

Step 2: Container Scaling Is Complex. How Can We Optimize the Entire System?

As an operating system in the cloud native era, Kubernetes itself and cloud environment it deployed are very complex and tremendous. Therefore, container scaling refers to the comprehensive system optimization from underlying cloud resources to upper-layer applications. Enterprises need to focus on three aspects of optimization. The first one is to break the restrictions on cloud resource quotas at the cloud product level. The second one is to improve the resource scale at the cluster component level. The third is to optimize cluster configuration policies at the Kubernetes resource level to ensure the large-scale resources.

Step 3: It Is Difficult to Ensure the Original Performance After Large-scale Container Deployment. How to Further Improve the Performance?

After scaling up the container cluster to many times in size, great challenges on the storage, cluster network, and application distribution are imposed. For example, as the network traffic in a large-scale cluster data center is relatively high, the problem of network latency and jitter will also be increased. Thus, the transmission efficiency and stability of the cluster network will be affected. Another example is conventional scenarios where applications are published and updated in batches in a large-scale cluster. The instantaneous image pulling of 10,000 nodes will cause huge network impact, greatly influencing the image service and network bandwidth. The large-scale container deployment is intended to provide powerful technical support, which not only guarantees the original performance, but also further improves the overall performance. Enterprise customers can optimize the performance in four aspects, which are node and pod scaling efficiency, network efficiency including throughput and latency, DNS resolution efficiency, and image acceleration.

Step 4: How Can We Ensure "Stability" for Large-scale Container Deployments?

If large-scale cluster deployment is the first difficult step, running a cluster with tens of thousands of nodes stably is even more difficult. The most significant thing for a large system is to control the fault domain and prevent a collapse. The stability of containers after scaling is more important than the container scale. This is because the recovery of large-scale clusters cannot be done simply by restart. If collapse occurs, an overall crash is inevitable, which has a serious impact on business continuity. For enterprises, the stability of large-scale clusters lies in their online business security. Therefore, enterprises must consider the contingency plans in advance, as well as resource index and system component optimization. Furthermore, all nodes should be monitored to start self-repairing process at any time.

Alibaba Cloud Helps Enterprises Implement All-in-one Large-scale Container Deployment

To address challenges of large-scale cluster deployment in enterprises, Alibaba Cloud provides enterprise-level container cluster management based on ACK Pro. Alibaba Cloud also provides various performance optimizations in API Server and scheduler to break restrictions on resource scale, enhance performance, and ensure cluster stability. Terway, a proprietary high-performance container network, is used to optimize the Pod latency by 30% and reduce the performance overhead of large-scale services. With these products and services, the network bottleneck of large-scale clusters can be solved. The cluster can be faster in response with cloud-native network performance provided. In addition, ACR EE, an enterprise-level image repository, supports exclusive storage, provides the capability to load images as needed, and reduces the boost time by 60%. It can solve the problem of slow image pulling of large-scale nodes. By integrating Alibaba Cloud's storage, network, and security capabilities, Alibaba Cloud provides enterprises with the best performance for large-scale container operation in one stop. Enterprise customers can enjoy more efficient network forwarding, more scalable storage, more efficient application and image distribution, and more stable and secure large-scale cluster management.


It is worth mentioning that, at the Cloud Native Industry Conference (CNIC) 2020, Alibaba Cloud is the first cloud service vendor who passed the large-scale container performance test held by CAICT. In the test, Alibaba Cloud obtains the highest level of certification. Many evaluation results of Alibaba Cloud Container Service, such as full load stress test, network latency, and network performance penalty, are far ahead among all vendors involved.

That is to say, Alibaba Cloud has "service capability space" with enough elasticity. Container cluster services can be customized to meet enterprises' current needs based on their own business volume. Alibaba Cloud offers support for Alibaba Cloud's products and the container migration of Alibaba Group's internal core systems to the cloud. In addition to these, Alibaba Cloud also provides products with years of large-scale container technology to many ecology companies and independent software vendors (ISV) participated in the Double 11. By supporting cloud containers from various industries around the world, Alibaba Cloud Container Service has built a middle platform for cloud native application hosting, supporting unitized, global, and flexible architectures. Moreover, the Container Service has managed more than 10,000 container clusters and provided reliable services for enterprises.

Alibaba Cloud has the largest container cluster, the most variable cloud native product family, and the most comprehensive open-source achievements in China. More than 100 innovative products have been provided, including ECS Bare Metal Instance, PolarDB-X, AnalyticDB, Data Lake Formation (DLF), Container Service, Microservices, DevOps, and Serverless, in various fields such as new retail, government, healthcare, transportation, and education. Alibaba Cloud is the only vendor in China that was included in Gartner's 2019 and 2020 Competitive Landscape: Public Cloud Container Services. Alibaba Cloud provides nine capabilities, such as Serverless Kubernetes, service mesh, and container images, lining with Amazon Web Services (AWS) and being ahead of Google, Microsoft, IBM and Oracle in product diversity.

With the popularization of container technology, container performance evaluation has become a major concern in the industry. To address industry pain points, the industrial first performance test result for ultra-large-scale container has been released by CAICT, which objectively and truly reflects the component-level performance of container clusters. At the CNIC 2020, Ding Yu, an Alibaba Cloud researcher and director of Alibaba Cloud native technologies, said that "Alibaba Cloud has always been committed to promoting the popularization of cloud native in China. Alibaba Cloud will work with CAICT to promote the normalized and standardized development of the container market in China."

0 0 0
Share on

Alibaba Clouder

2,600 posts | 751 followers

You may also like