By ECS Team.
DingTalk is an enterprise-oriented instant messaging software launched by Alibaba Group, providing functions such as online organization, online communication, online collaboration, online business, and online ecosystem to enterprise users. DingTalk has served more than 10 million enterprise customers and recently expanded to the education market to provide rich remote classroom solutions. So far, DingTalk has more than 200 million users and is the largest work-on-the-go application in China. In February 5, 2020, DingTalk ranked the No. 1 free app in Apple's App Store. Its underlying computing resources are provided by Alibaba Cloud Computing.
Due to the recent travel restrictions from the coronavirus outbreak, there has been rising demands for online education and online office applications all over the world, especially in Mainland China. DingTalk, the preferred online office software for many enterprises, has seen a surge in user traffic, especially users with the demand for video conferencing and live streaming. In response to the Chinese Education Department's call to keep learning outside physical classrooms, DingTalk provides free trial of online classes for teachers.
The surging traffic in DingTalk ushered in the technical challenge of resource scale-up. DingTalk was put into a predicament when primary and secondary school students vented their anger online about DingTalk's performance while joking about giving one star at a time out of a five-star rating to DingTalk. A thought-provoking comment popped up on the Internet among the negative opinion of netizens:
Since January 28, DingTalk has experienced multifold increase in access traffic from audiovisual conferencing and live streaming. Developed on the cloud, DingTalk is the target of Alibaba Cloud's initial effort of resource scale-up, satisfying the needs of users who work and learn from home with superior experience. So, how did DingTalk achieve this?
A pressing deadline. The DingTalk technical team had to solve the problem of traffic spikes in just a few days. The DingTalk team has been fully devoted to scale-up in Alibaba Cloud 24 hours a day since January 29. As of February 2, resources were scaled up only by several times, from the initial 20,000 vCPUs to 30,000 vCPUs, which seriously fell short of business demand.
Complex purchase and configuration procedures. Different from a single cluster of Elastic Compute Service (ECS), the system architecture of DingTalk contains a variety of resources, including Server Load Balancer (SLB), ApsaraDB for MongoDB, ApsaraDB for Redis, and Elastic IP (EIP). These resources need to be purchased separately, and their relationships need to be manually configured.
Inefficient manual deployment with a high error rate. DingTalk has a large user base. It takes about 1 hour to manually deploy a cluster, and only 3 to 4 clusters can be simultaneously operated at a time. The deployment process requires many configuration steps, which is prone to mistakes.
Complex deployment. Cluster service capabilities are self-closed and can be scaled up without limits, but this increases deployment complexity. The scale-up project of DingTalk involves 8 regions and 16 zones. The conventional deployment approach may reduce scale-up efficiency and increase the complexity of large cluster management. To enable hundreds of millions of workers and students working and learning from home, the DingTalk team needs to scale up nearly one thousand clusters within a short period of time. It is difficult to manage the relationships of one thousand resources, let alone the case of over one million.
Poor fault tolerance and difficult troubleshooting in manual deployment mode. Configuration differences often occur from cluster to cluster. For example, a cluster uses port 300 as the SLB listening port, whereas another cluster uses port 3000 for the same purpose. This makes troubleshooting difficult.
The DingTalk team also faces great technical challenges from creating and operating massive clusters.
Before February 2 when traffic spikes emerged, DingTalk used Alibaba Cloud Resource Orchestration Service (ROS) to improve cluster deployment efficiency and implement quick cluster scale-up. ROS helped DingTalk deploy more than 10,000 ECS instances in just 2 hours, setting a new record of rapid scale-up in Alibaba Cloud.
Alibaba Cloud ROS is an orchestration service that helps you automatically create, update, and delete cloud resources. ROS uses stacks (a type of logical collection) to centrally manage grouped cloud resources. For Alibaba Cloud, a stack is a group of Alibaba Cloud resources. ROS allows you to create, delete, and clone cloud resources in the form of resource stacks. By using ROS in DevOps practices, you can easily clone the development, testing, and production environments, and migrate and scale up applications as a whole.
ROS is an infrastructure as code (IaC) solution provided by Alibaba Cloud to quickly implement IaC as a key component of DevOps.
ROS is a fully managed service and does not require you to purchase any resources for using ROS, allowing you to focus on the cloud resources for your business which is defined in the resource template. Managed automation accelerates the process of creating multiple projects that correspond to multiple stacks.
You can use the same templates to deploy resources for development, test, and production environments. You can set a parameter to different values for different environments. For example, you can set the number of ECS instances in the test environment to 2 and that of ECS instances in the production environment to 20. You can also use the same templates to deploy resources to multiple regions. This improves the efficiency of multi-region deployment.
In practice, subtle differences in different environments often lead to complex management and high costs, prolonged troubleshooting time, and interference with the normal operation of your business. By using ROS for repeated deployment, you can standardize deployment environments, minimize the differences between different environments, and set environment configurations into templates. A rigorous management process similar to code management can ensure standardized deployment practices.
Compared with other similar products, the native IaC service ROS by Alibaba Cloud provides better integration with other Alibaba Cloud services. Integration with Resource Access Management (RAM) provides unified authentication, eliminating the need to establish a separate user authentication system. Operations on all cloud products are called through APIs. In this way, you can use ActionTrail to review all O&M operations, including those on ROS.
ROS helps DingTalk quickly create the templates that describe required resources (such as ECS instances and database instances) in Alibaba Cloud to define the cluster architecture of DingTalk. ROS provides a visual editor that is used to create templates through drag-and-drop. After templates are created, ROS automatically creates and configures the template-described resources to implement IaC.
Upon receiving the stack creation request, ROS parses templates before creating stacks. The parsing includes syntax check, parameter verification, and dependency analysis.
Dependency analysis aims to analyze the dependencies between resources for two purposes:
After templates are parsed, ROS creates resources one by one based on their roles in dependencies, which is similar to the state machine mechanism.
Resource templates can be quickly and repeatedly deployed, especially in multi-region and multi-zone scenarios. This reduces the differences between environments, standardizes the deployment process and results, and reduces the system problems caused by environmental differences.
Alibaba Cloud Resource Orchestration Service (ROS) helped DingTalk scale up and deploy 100,000 ECS instances in a gradual and quick manner, with a hundredfold increase of efficiency, setting a record in Alibaba Cloud.
At present, ROS can scale up one cluster per minute on average and more than one million vCPUs every day. It is a tremendous project to reclaim and release millions of resources. ROS provides the one-click destruction function to automatically reclaim all resources in a cluster, which avoids troublesome operations and omissions.
Elasticity is the greatest strength of cloud computing and supports the provision of all-benefiting and convenient services. As Alibaba Cloud's native automatic orchestration and deployment service, ROS maximizes the elasticity of cloud computing and provides powerful support for DingTalk to make it the most popular and streamlined online communication platform.
While continuing to wage war against the worldwide outbreak, Alibaba Cloud will play its part and will do all it can to help others in their battles with the coronavirus. Learn how we can support your business continuity at https://www.alibabacloud.com/campaign/supports-your-business-anytime
Alibaba Clouder - April 7, 2020
Alibaba Clouder - April 8, 2020
Alibaba Clouder - April 2, 2020
Alibaba Clouder - April 2, 2020
Alibaba Clouder - April 6, 2021
Alibaba Clouder - March 5, 2020
An online computing service that offers elastic and secure virtual cloud servers to cater all your cloud hosting needs.Learn More
High Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.Learn More
Apsara Stack Agility Elastic Compute Service (Alibaba Cloud ZStack) is a light-weight hybrid cloud solution.Learn More
A HPCaaS cloud platform providing an all-in-one high-performance public computing serviceLearn More
More Posts by Alibaba Clouder