Community Blog Let's Talk About Why DingTalk Didn't Crash

Let's Talk About Why DingTalk Didn't Crash

In this blog post, we'll discuss how DingTalk was able to maintain high-availability throughout the COVID-19 outbreak using Alibaba Cloud Networking solutions.

In order to win this inevitable battle and fight against COVID-19, we must work together and share our experiences around the world. Join us in the fight against the outbreak through the Global MediXchange for Combating COVID-19 (GMCC) program. Apply now at https://covid-19.alibabacloud.com/

By Network team

The recent COVID-19 outbreak, has led many businesses to go online; more and more organizations and consumers are carrying out their day-to-day tasks using online channels. This newfound convenience has led many popular websites and applications to crash due to traffic spikes. But why hasn't DingTalk crashed despite the recent trend of working from home? Alibaba Cloud answered this question through its official Weibo account, by sharing what their major work focus has been recently: "Helping scale out DingTalk for you."


Enabling Remote Work through Cloud Computing

We all know that since this Chinese New Year, there has been a strong demand for working and attending classes from home, which has driven peaks in online traffic for DingTalk-related systems to surge by hundreds of times. More than 200 million office workers organized by 10 million enterprises have started working online. Nearly 50 million students took online classes through DingTalk. DingTalk, which is built on Alibaba Cloud, has scaled out tens of thousands of Elastic Compute Service (ECS) instances with ease, making DingTalk the most frequently used, and smoothly performing platform. In fact, in addition to scale-out, enterprise IT administrators need to consider how to deal with such large traffic surges, how to build a high-availability business system, and how to ensure the security of the business system. These challenges are nerve-wracking for enterprise IT administrators. As a former IT administrator and now a cloud product manager, I would like to say a few words here as well.

First of all, you need to build your business system based on the public cloud. I think everyone now agrees that the elasticity of the public cloud is the best solution for sudden traffic bursts. This is not a problem for customers whose business system is built on the cloud or has already been completely migrated to the cloud. The key is to choose a cloud service provider that has sufficient resources. If you have also deployed a business system in on-premises data centers, you need to at least connect it to the public cloud in hybrid cloud mode. Only in this way can you immediately leverage the elasticity of the public cloud during peak times. Public cloud service providers usually provide a variety of methods to build a hybrid cloud, such as common Express Connect circuits and VPNs. Alibaba Cloud also provides software-defined wide-area networking (SD-WAN), such as Smart Access Gateway (SAG), for you to connect to the cloud. We recommend that you use Express Connect circuits to build a hybrid cloud in high-traffic scenarios. This allows you to easily cope with traffic surges.


Importance of Deploying in Multiple Regions

We also recommend that you deploy your business system in multiple regions. Not only does this implement disaster recovery to improve reliability, but also makes full use of the larger elastic resource pool of the public cloud. The public cloud also greatly reduces the cost and complexity of deploying a business system in multiple regions. You can create VPCs in multiple regions to deploy your business system, and use products such as Cloud Enterprise Network (CEN) to connect the VPCs across regions. As such, when the network traffic suddenly increases, you can elastically increase the multi-region interconnection bandwidth of the CEN at any time to prevent crashing. Simply put, enterprises can instantly build their own core networks and scale them elastically at any time. This was unimaginable before the public cloud was created. To enable access control after multiple VPCs are connected, you can use the routing policy function of CEN. After the communication between VPCs where the business system is deployed has been resolved, you need to consider how to process and schedule a large volume of traffic. In this regard, products such as Server Load Balancer (SLB) are distinctly preferred.


Ensuring Data Consistency

The complexity of deploying a business system across regions lies in data synchronization or data consistency

Considering the technical complexity of active geo-redundancy, I personally recommend that IT administrators make decisions based on their actual situations. For most enterprises, it may be more feasible to deploy the frontend system in multiple regions first. After all, the frontend system is usually the bottleneck for high traffic. In addition, apart from multi-region deployment, you need to try your best to adopt multi-zone deployment in the same region. This is also a technique for disaster recovery. It not only improves reliability, but also expands the resource pool. Multi-zone deployment does not increase complexity much, because Alibaba Cloud Virtual Private Cloud (VPC) provides cross-zone disaster recovery, and SLB supports traffic scheduling across zones. It may be important to note that, although latency is low between different zones in the same region of Alibaba Cloud, the latency may increase if the frontend system and the backend system are deployed in different zones and multiple cross-zone operations are performed. In most cases, such an increase is not critical. However, we still need to take it into account in latency-sensitive scenarios and avoid multiple cross-zone calls as far as possible. For a large-scale business system, you further need to consider the capacity of the VPC. A VPC may need to accommodate hundreds of thousands of instances, such as ECS instances, Elastic Network Interface (ENI) instances, and containers.

Third, in terms of system architecture, you also need to consider...

Sorry, I am getting off topic. How is it that I can't stop talking about system architecture? It seems that I, yet again, am suffering from the IT administrator's occupational hazard. I just can't help it! Let's go back to where we were. For office applications such as DingTalk, users are distributed all over the world, where the network conditions may vary. Therefore, it is critical to improve the network access quality for these users. In particular, the network quality must be highly reliable in scenarios where video interactions are required. We can think about this issue from two aspects. On the one hand, select high-quality Internet bandwidth for the public cloud. We recommend that you use the Border Gateway Protocol (BGP) bandwidth that is known to all IT administrators. Public cloud service providers usually boast about their BGP bandwidth in terms of access through multiple carrier lines, low costs, 95th percentile billing, and high elasticity. However, I believe that the bandwidth is truly good only if it can withstand traffic bursts created by numerous visitors, regardless of how many carrier lines are connected and how good the quality is.


Ensuring Network Quality

Imagine that the service provided by a service provider is not sensitive to network quality and the service provider does not have high-quality BGP bandwidth. How can the service provider provide high-quality bandwidth for users? On the other hand, you must ensure that the cloud service provider is able to guarantee the specifications during peak times. A service provider is untrustworthy if it does not give guarantees.

As far as I know, to scale out DingTalk, Alibaba has prepared a high-quality BGP bandwidth of more than 4 Tbps. This BGP bandwidth is the same as that used in Taobao and Tmall. In addition, Shared Bandwidth is used for bandwidth management. As an edge tool for top-tier customers, it can facilitate the management of large amounts of public IP addresses and provide guarantees for ultra-high peak bandwidth.

Here I'm getting off topic again. As I just mentioned, for office applications, you need to choose reliable BGP bandwidth and products from a reliable cloud service provider. Next, I would like to tell you that it is also critical to use clients or acceleration products to accelerate access. Generally, a powerful public cloud service provider has great resource advantages. It deploys its resources in many regions around the world, provides services globally, and connects these regions through a network, forming a core network with global coverage. Based on this global network, the service provider develops acceleration products by leveraging its own R&D technologies. Enterprise office service providers can make better use of such products to improve network access quality for their users.


Global Acceleration and SAG App

I recently heard that Alibaba Cloud is going to launch a new version of Global Acceleration. Many advanced technologies have been used in the new version. But I don't want to spoil the surprise, so do head on to the official product page to learn more!

Lastly, remote O&M and work-on-the-go are demanding for IT administrators. I would like to recommend Alibaba Cloud Smart Access Gateway App (SAG APP).

The SAG App allows you to securely and directly connect to the VPC that you have deployed on Alibaba Cloud. Enterprise employees can easily implement secure work-on-the-go and remote O&M by using SAG APP.


SAG APP is expected to support all operating systems, including Windows, Android, iOS, and MacOS, and will launch many more advanced features soon.

While continuing to wage war against the worldwide outbreak, Alibaba Cloud will play its part and will do all it can to help others in their battles with the coronavirus. Learn how we can support your business continuity at https://www.alibabacloud.com/campaign/fight-coronavirus-covid-19

0 0 0
Share on

Alibaba Clouder

2,600 posts | 754 followers

You may also like