What Would You Do If Your App Crashes during Coronavirus Outbreak?

In order to win this inevitable battle and fight against COVID-19, we must work together and share our experiences around the world. Join us in the fight against the outbreak through the Global MediXchange for Combating COVID-19 (GMCC) program. Apply now at https://covid-19.alibabacloud.com/

By Ding Jie, nicknamed Yanshun at Alibaba.

What can ordinary people do in the long battle against the coronavirus? Well, in short, the most important contribution most people can make is simply to stay at home. Some people like to escape reality by playing hours of video games, binge watching online video streams, or clicking through their favorite apps. But say, you're the guy running one of those systems and your app happens to crash given all the new spikes you're getting from everyone using your app at home, wouldn't that just be devastating to your business? This is why making plans to protect your application is so important.

Nowadays we rely on some very talented programmers to keep our favorite online games, movies, and apps up and running. But as online traffic surges as it has doing throughout this pandemic, many online applications are facing challenges they never faced before—with traffic far exceeding normal level.

To stop your app from crashing under the pressure of high traffic loads and sudden traffic spikes, you need to take the necessary measures to protect your app. One of the most important measures is simply determining whether your company's current IT architecture can actually support future business growth. Another solution is to consider investing in cloud computing, and enjoy the elasticity it has to offer.

Of course, it's important to know that not all cloud vendors offer the same package. Here at Alibaba Cloud, we have built up a technical system that has high concurrency, high availability, and support for massive traffic spikes—that's how we're able to support the massive traffic spikes seen every year on Double 11. At Alibaba Cloud, we have enabled several of our customers to get ready for whatever the future may bring. One of these companies is Luogic, which has an entertaining and informative app in China. We helped them implement end-to-end stress testing for their systems.

In this article, we're going to summarize some actual cases where high user volumes and sudden traffic spikes caused difficulties, and then we will enlist the help and practical advice of some of Alibaba's top engineers who were behind Alibaba's high-availability architecture. We hope that this article can inspire enterprises that are currently coping with higher than normal traffic on their web applications.

Why Did Your Application Crash?

Complex Server Environment

Normally, apps themselves are relatively stable, so crashes usually occur on servers or in the cloud, rather than in an app itself. Therefore, server environment complexity is a major concern in this case.

Let's look at a mature cloud-based architecture. At Alibaba Cloud, when it comes to building an online service, you have access to about 200 cloud computing product for building your enterprise's IT infrastructure, security, and application systems. In addition, accessing the server from a client, such as an app or PC, involves many key nodes, such as content delivery networks, dynamic accelerators, anti-DDoS services, application firewalls, layer-4 and layer-7 load balancers, frontend and backend service sets, caches, database storage, middleware, and the infrastructure layer. The system can be complex. For example, there are five product specifications that affect traffic during load balancing, and the scale of backend services is even more complicated and difficult to evaluate. A problem in any of these nodes can lead to your service being unavailable, making end users think that the app has crashed. The same problems impact private cloud, hybrid cloud, and on-premises data center systems.

All enterprises need to find a way to effectively and comprehensively test the throughput of their servers, spot all potential problems, and even plan for future scaling capacity. Among such efforts, throttling and scheduling during peak traffic periods is an approach that all companies must consider.

Not Doing the Relevant Planning

If you do not plan the service capabilities and key nodes of your app in advance and do not have any online measures for emergencies, such as elastic scaling, online protection, and fault tolerance, it is difficult to ensure the stability of the core interface of your system in the case of spikes in traffic. Once their app crashes, many enterprises cannot take appropriate measures, and rapid capacity expansion will not solve the problem and can even lead to additional problems.

Apart from crashes due to problem detection, capacity planning, throttling, and fault tolerance degradation, there are hidden O&M risks that involve the fault impact surface, configuration consistency, monitoring and root cause analysis tools, and high availability of complex personnel organizations. If enterprises do not conduct proper drills and develop verification solutions, their apps can often crash at critical times.

Advice on Building a High-availability Architecture

Next, we will look at some high-availability architecture construction practices based on the wealthy experience of Alibaba engineers.

Architecture Design

When it comes to building a high-availability architecture, the first thing we should do is implement architecture visualization. With the architecture awareness of Application High Availability Service (AHAS), you can fully understand the cloud system architecture and intuitively display the hierarchical dependencies among cloud resources, containers, and applications. Servers, storage, and networks are the infrastructure of modern cloud platforms. With the popularization of cloud strategies, more and more enterprises are building their businesses, services, and systems on cloud platforms.

The diversity of open-source software and cloud services, the heterogeneity of development languages, and the organizational and capability differences of enterprise IT teams make it very difficult to formulate standards. This led to the development of the AHAS architecture awareness function. This function captures process-level call relationships by collecting and analyzing operating systems and third-party standard interfaces. Then, it uses a feature library algorithm to identify the technical components used by processes. Finally, it visually presents the application architecture, showing its servers, containers, and processes. This gives you a clear and comprehensive map of your cloud architecture. Based on this basic view, AHAS can derive multi-dimensional architectural views of cloud resources, containers, and application architectures as well as scenario-based views, such as site migration, restructuring, and asset management. This real CMDB visualization facilitates problem detection to enable business growth and allow you to take advantage of more of the benefits of the cloud.

Concerning the governance of strong and weak dependencies, the mere presence of strong dependencies means that you have tied the stability of one component to the stability of another. When we introduce and embed the AHAS SDK, once the maximum throughput of the platform reaches a bottleneck, in addition to peak traffic throttling for portal or web applications, you can smoothly disable services previously labeled as weak dependencies to free up more resources and ensure core computing capabilities. In addition, this solution can remove the impact of non-core services on core services. Ultimately, we can achieve a balance between business performance and cost through reasonable and efficient service degradation. When using the AHAS SDK, you only need to consider how to define resources in your code. This means you simply need to identify the methods and code blocks that need to be protected, rather than finding a way to protect these resources. Then, you can add rules to protect resources. The added rules take effect immediately.

Capacity Planning

Now, let's discuss some points involved with capacity planning.

Simulated public network stress testing: You can use Performance Testing Service (PTS) to efficiently and quickly simulate business traffic of the same model and magnitude. This service is 100% compatible with popular open-source JMeter scripts. If no script is available, you can use the visual interaction feature developed by PTS for zero-code orchestration. After orchestration is completed, simulated Internet traffic is initiated from the regional Internet carrier to simulate specific business scenarios. This allows you to comprehensively verify and detect bottlenecks and problems anywhere in the cloud or on-premises architecture, including network access, application services, the storage layer, and infrastructure.
End-to-end stress testing: Going a step further, if you want to accurately measure your business capacity in the production environment, you can use the PTS solution to enable the production environment to identify the stress testing traffic and route it to the specified shadow storage area. You'll need to prepare the shadow storage area and then perform business traffic stress testing by using basic data of the same scale in the same production environment. This allows you to precisely evaluate the online production environment. As such, streaming data for stress testing will be isolated and therefore also easily cleaned and managed.

Business Monitoring

In the face of complex application environments and rapidly growing business, Application Real-Time Monitoring Service (ARMS) can help you quickly build a complete monitoring system in various environments. This allows you to implement end-to-end monitoring from pages to databases and from application performance to infrastructure resources,. By using ARMS, you can reduce the troubleshooting time, the costs of cross-department communication, and the losses caused by faults and poor user experience.

Online Management

New and existing applications can use the AHAS agent, which is a lightweight solution, for strong traffic control during peak hours and load shifting for message scenarios without having to modify application code. For complicated structures, unstable elements inside and outside the system can be quickly downgraded to maintain business stability. In addition, a single-machine overload protection function is available, which dynamically adjusts inbound traffic based on the response time. Even when it is too late to stress test the system or you do not know how to configure the rules, single-machine intelligent overload protection fills in as a great feature. All of the preceding solutions can be introduced and controlled during app runtime and O&M. You can use the lightweight solution provided by the AHAS switch module to manage online configuration items and business attribute values in a secure and unified manner. This feature will be available soon.

Routine Inspection

Early risk exposure: Comprehensive inspection and risk identification for major cloud resources are carried out through Intelligent Advisor, an intelligent consultant. The rules are based on the experience of our Technical Account Management (TAM) team in customer-oriented technical systems and the integration of site reliability engineering (SRE) best practices from the Alibaba ecosystem. Based on the preceding architecture map and user input, you can conduct deeper inspections at the application or business architecture level and receive appropriate recommendations.

Regular Drills

The AHAS fault drill module follows the principles of chaos engineering experiments and integrates Alibaba's internal practices. Based on this, you can build a highly visual fault drill system with a complete set of processes. This allows you to easily orchestrate and customize infrastructure resources, application services, container services, and cloud platforms in a multi-dimensional manner. It also provides a wide range of proven fault experience libraries. This can help you improve the high availability of your architecture, business, and personnel. Fault drills are very important in scenarios such as dependency management, business continuity improvement, and fault correction verification.

Tool List

1. AHAS

AHAS is a cloud tool that improves the high availability of applications. It provides automatic detection of application architecture, fault-injection HA evaluation, and one-click application throttling and degradation, which allow users to quickly enhance application availability in a cost-effective manner.

2. PTS

Performance Testing Service (PTS) is a cloud-based testing tool designed for all technical personnel. It provides various features such as online performance testing, API debugging, and business monitoring. You can use the built-in features as well as open-source features compatible with PTS to simulate any type of workload. The simulation can be performed at any time suitable for your business and removes sophisticated preparation or high maintenance costs. In addition, PTS provides high-availability monitoring and throttling capabilities, helping you test and manage your business performance in an efficient way.

3. Intelligent Advisor

By looking at the customer's situation and drawing on proven Alibaba Cloud best practices and the core capabilities of the Technical Account Management (TAM) service system, Intelligent Advisor provides users with comprehensive diagnosis and optimization suggestions in regards to cloud resources, application architectures, business performance, and security. Currently, more and more Alibaba Cloud native customers can easily access professional TAM services through Intelligent Advisor, allowing them to make better use of the cloud. Intelligent Advisor also allows us to provide in-depth TAM services to customers with relevant needs.

4. Enterprise-level High-availability Architecture Solution

Our high-availability technology system originated in Alibaba's e-commerce business and has been tested under peak traffic conditions during the Double 11 Shopping Festival as well as for routine stability. This solution serves the entire Alibaba ecosystem and is now open to external enterprise customers. It provides enterprises with support for marketing activities, overall cost control, including end-to-end stress testing, capacity planning, throttling, and scheduling, as well as emergency response capabilities, which includes switches and contingency plans. Last, this also includes disaster recovery and avoidance capabilities, such as architecture awareness, fault drills, multi-active geo-redundancy, and unitization.

5. ChaosBlade

ChaosBlade is a chaos engineering tool that follows the principles of chaos engineering experiments and is based on Alibaba's practical experience in fault testing and drills over the past decade. It integrates the best ideas and practices of various businesses in the Alibaba Group to provide a wide range of fault scenario implementations. In this way, it helps distributed systems improve their fault tolerance and recovery capabilities.

6. Sentinel

Sentinel is a lightweight throttling framework that helps you protect the stability of your services in a variety of ways, such as traffic throttling, fault tolerance, and system load protection.

While continuing to wage war against the worldwide outbreak, Alibaba Cloud will play its part and will do all it can to help others in their battles with the coronavirus. Learn how we can support your business continuity at https://www.alibabacloud.com/campaign/supports-your-business-anytime

Community