Five Conditions and Six Lessons for Elastic Scaling

Preface

Elastic scaling is a core technology dividend brought to us by the cloud computing era, but in the IT world, no system function can be applied to all scenarios without thinking. In this article, we systematically sort out the problems encountered by customers applying enterprise-level distributed application service-EDAS in the elastic scenario when designing the system architecture, and summarize them into five conditions and six lessons to share with you.

Five conditions

1. Start without manual intervention

Whether manual intervention is required is the essential difference between elastic scaling and manual scaling. In the operation and maintenance of traditional applications, it is often necessary to manually prepare a series of things on the machine to start a process, such as: environment construction, service-dependent configuration sorting, local environment configuration adjustment, etc. If the application is in the cloud, it may also need to manually adjust the security group rules, rely on service access control, etc; However, these actions that need to be performed manually will become infeasible in automatic elasticity.

2. The process itself is stateless

To be precise, statelessness mainly refers to the dependence of the business system on data during operation. Data is generated during process execution, and the generated data will have a continuous impact on subsequent program behavior. When coding logic, programmers need to consider whether this data will cause inconsistency in behavior if the system is restarted in a new environment? The recommended practice is that the data should ultimately be based on the storage system, so that the storage computing can be truly separated.

3. Start quickly and walk with "dignity"

Elasticity, especially on the cloud, is characterized by its frequent occurrence. In particular, the business with sudden flow has certain uncertainty. The system after startup is often in a "cold" state, and how to "heat" quickly after startup is the key to elastic effectiveness. After the end of elasticity, there is often an automatic volume reduction. Since this process is also automatic, we need to be able to achieve the ability of automatic traffic removal technically. The traffic here not only includes HTTP/RPC, but also includes messages, tasks (background thread pool) scheduling, etc.

4. Disk data can be lost

During application startup, our application may use disk to configure some startup dependencies; In the process of running, we also habitually use disk to print some logs or record some data. The elastic scenario is whether the process is fast or not, and the data placed on the disk after it is gone, so we should be prepared for the loss of disk data. Some people may ask how to deal with the log? The log should be collected through the log collection component for unified aggregation, cleaning and viewing. This is also emphasized in 12 factor apps.

5. Dependent services are fully available

Large-scale business systems are often not fought by one person. Some central services such as cache and database are also used in the most typical architecture. After the elastic expansion of a business, it is easy to ignore the availability of the center's dependent services. If dependent services are unavailable, it may be an avalanche effect on the entire system.

Six lessons

1. The setting of indicator value is unreasonable

The whole elasticity is divided into three stages: index acquisition, rule calculation, and execution scaling; The indicators are generally obtained through the monitoring system or the components of the PaaS platform. Common basic monitoring indicators include: CPU/Mem/Load, etc. In the short term, some basic index values will have the characteristics of instability, but the time will be prolonged, and it will normally be in a "stable" state. When we set the index, we cannot take the characteristics of the short time as the basis, and only by referring to some water level data of a long time can we set a reasonable value. Moreover, there should not be too many indicators, and there should be a significant numerical difference between the shrinkage index and the expansion index.

2. Take "delay" as an indicator

In many cases, we have to make a big judgment on the system availability to see whether the system screen is "rotating in circles", that is, the system is very slow. It is reasonable to infer that the capacity will be expanded soon. So some of our customers directly regard the average RT of the system as an expansion index, but the RT of the system is multi-dimensional. For example, health check is generally very fast. This kind of API appears a little more frequently, which lowers the average value. Some customers will be accurate to the API level, but the logic of the API is different according to different parameters, resulting in different RTs. In short, it is very dangerous to make flexible strategies according to the delay.

3. Specify a single expansion specification

The expansion specification refers to the resource specification. For example, in the cloud scenario, for the same 4c8g specification, we can specify memory type, computing type, network enhanced type, etc. However, the cloud is a large resource pool, and a certain specification will be sold out; If we only specify a single specification, there will be a situation where the resources cannot be provided and the expansion fails. The most dangerous thing here is not the expansion failure itself, but the troubleshooting process after a business failure.

4. Only consider the application strategy in the RPC link

It is often very simple for a single application, but it is difficult to sort out the whole business scenario. A simple way to sort out ideas is to follow the scenario of application calls. From the perspective of the scenario of inter-application calls, it is generally divided into three types: synchronous (RPC, middleware such as Spring Cloud), asynchronous (message, middleware such as RocketMQ), and task (distributed scheduling, middleware such as SchedulerX). We usually sort out the first situation quickly, but it is easy to ignore the latter two. When the latter two problems occur, troubleshooting and diagnosis are the most time-consuming.

5. No corresponding visualization strategy

Elastic scaling is a typical background task. When managing the background task of a large cluster, it is better to have a large screen for intuitive visual management. You cannot handle the expansion failure silently. If the core business fails to expand, it may cause a direct business failure. However, when the failure really occurs, it is often not concerned about whether the expansion strategy is effective. If the failure is caused by the expansion, it is also difficult to troubleshoot this point.

6. Didn't make a correct assessment beforehand

Although cloud computing provides an almost endless pool of resources for flexibility, it only frees users from the work of preparing resources. The micro-service system itself is complex, and the capacity change of a single component will have an impact on the whole link. After a risk is removed, the bottleneck of the system may migrate, and some hidden constraints will gradually emerge with the capacity change. Therefore, most of the time, the idea of making flexibility strategies cannot be relied on, It is necessary to do a good job of pressure testing and verification of the whole link, and drill to a flexible configuration suitable for the overall situation; We still suggest to understand various technical means from multiple dimensions of high availability in advance and form multiple sets of plans for use.

End

In the cloud native scenario, the flexibility capability is more abundant, and the indicators available for flexibility are also more capable of business customization. The application PaaS platform (such as the enterprise distributed application service EDAS/Serverless application engine SAE, etc.) can combine the basic technology capabilities of cloud manufacturers in computing, storage, and network, and can reduce the cost of using the cloud. However, there will be some challenges for business applications (such as stateless/configuration code decoupling, etc.). From a broader perspective, this is a challenge for application architecture in the cloud native era. However, if the application becomes more and more native, the technology dividend of cloud will also be closer to us.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us