The Practice of Technical Stability Guarantee Based on Public Cloud

1、 Infrastructure Governance

How is the infrastructure governed? How to improve the stability of the foundation through infrastructure governance?

In the early stage, we mainly focused on two aspects: firstly, combining our own situation to implement service-oriented governance; secondly, how to enable research and development colleagues to support their convenient and efficient release under relatively controllable and guaranteed conditions? It can even be released during peak hours without unforeseen risks.

It is actually very simple, supporting rapid development and rapid delivery. Due to the high efficiency of our R&D engineers, we can quickly complete the code for business requirements today and test it online tomorrow. This approach has rapidly supported the rapid development of business, but this "fast" approach also brings some hidden dangers. When our business scale, data scale, and service scale have undergone significant changes at the million order level, some of the issues caused by "fast" have become increasingly prominent.

Firstly, the business link service is unreliable; The relationship between critical and non critical services is unclear and interdependent. This can easily cause the entire main chain to malfunction due to non critical services.

Secondly, some core services are bulky and difficult to maintain.

Thirdly, the self-healing ability of the service is relatively weak, and it is easy to experience system paralysis under several times the peak traffic in an instant.

Fourthly, in the past, our troubleshooting efficiency was low and the emergency response time for faults was long. This is the most difficult and darkest period for us.

So, we immediately began to work on service-oriented governance. Ensure that all business services are in a governable state and the entire business link is clear. This technology itself is not difficult, and there are many mature open source microservice governance frameworks in the industry that can be directly used. But for us, the difficulty is that there are many old services running in our production environment, some implemented in Java and some implemented in PHP. We cannot achieve the goal of having R&D remove all the old code at once and redevelop it in a standard manner, which cannot be implemented. Our solution to this problem is to adopt a service-oriented design approach.

What is pan service-oriented? Universal service-oriented governance refers to achieving the goal of service-oriented governance while retaining the original HTTP URL API call prototype. R&D students do not need to make extensive code modifications to old services, they only need to lightly adjust the code invocation method of HTTP URL calls and move a few lines of code. This provides a basic mechanism for service registration, service discovery, and routing. This design approach achieves the goal of service governance without incurring too much cost for a large number of old services.

Under such lightweight transformation, we can quickly implement in the production environment, enabling all old services to have basic service-oriented governance capabilities. In the process of ensuring the stability of the production environment through technical support, we have more governance and emergency measures.

Our main technology stack for our business is Java and PHP, and during the transition period, our system will have multiple types of services coexisting. These types of services can easily communicate with each other in the same large operating environment. As old services are gradually transformed into standard microservice architectures, a full range of standard microservice links are ultimately formed. By not implementing a "one size fits all" approach (full scale and high cost transformation), the service architecture can also have governance capabilities in the early stages.

Our first generation technology stack was mainly based on PHP technology, which efficiently and quickly supported the development of our business. But during the transition period, how can we make PHP services and Java services interconnected and communicate within a large service architecture? Our middleware team used PHP sidecar to assist in service registration, discovery, and configuration, allowing PHP and Java to run in the same pool.

After completing service governance, the links of the production environment become relatively clear and clear. Our monitoring tool can clearly draw the service link. When there is a technical malfunction or smoke, we will also have more means to carry out emergency operations such as degradation, current limiting, and circuit breaker. We can solve problems faster and shorten the recovery time from failures.

After our service-oriented governance meets expectations, conventional production failures will be quickly fixed in a short period of time. From the perspective of system traffic, our initial architecture was a single IDC, single link architecture, with poor link fault tolerance. For a long time, our production efficiency was relatively low, and large version releases often continued until the early morning. In order to improve publishing efficiency and enhance technical stability, we have evolved our architecture to single IDC and multi link methods (as shown in the middle figure).

Although all services are still within the same data center, multiple traffic links with different logics can be created. Links are isolated from each other, and we internally refer to them as swimlanes. This design approach solves the problem of grayscale publishing during peak production periods throughout the entire link.

This traffic scheduling architecture not only solves the problems of production release efficiency and low risk, but also serves as an emergency measure in the event of serious capacity failures in our system. If there is a fatal bottleneck failure in the system capacity and it cannot be quickly recovered, and the system capacity is insufficient to handle the traffic of the entire network and business, we can urgently configure and modify the traffic scheduling strategy to maintain important city and regional businesses.

2、 Construction of technical support capacity

After completing the reconstruction and governance of service-oriented architecture (including pan service-oriented), there are more means and measures to ensure technological stability. During this period, the technical team also established a dedicated global stability team. The core responsibility of the team is to build and improve the technical support system platform, and the current system overview is shown in the figure.

The entire system covers areas such as pre construction, fault detection, fault response, fault hemostasis, and fault recovery. And we have prioritized the development of urgently needed capabilities for each field. There are two most important aspects.

Firstly, the NOC team. The responsibility of this team is to conduct stability monitoring and duty, as well as command emergency response to production failures. They need to achieve rapid hemostasis and repair. We require this team to identify the issue within 1 minute and respond to it within 5 minutes. NOC is the gatekeeper of our technical support system.

Secondly, organizational support is crucial. We establish a separate organizational structure and responsibilities, attach importance to the follow-up loop of fault review and rectification, define stability process indicators, and continuously follow up on these indicators.

The large monitoring platform is the most important area of our technical support system. In the initial stage, we gave the highest priority to building the monitoring platform, and the internal team has always referred to it as the AI Monitor project.

At the beginning of the project, everyone unanimously chose a difficult path, hoping that the developed monitoring platform could not only provide conventional capabilities for monitoring alarms (application access collection, data storage and calculation, query display and alarm), but also solve some deep-seated problems, such as intelligent discovery, intelligent health scanning, automatic analysis and positioning. Therefore, everyone included AI in the project name.

So far, the monitoring platform has covered all applications (1000+) and all nodes (9000+), receiving over 5000 alarms per day (before noise reduction). In addition to the functional coverage of the monitoring platform, we also focus on the real-time performance of monitoring indicator data and query response latency, treating these indicators as key technical indicators of the monitoring platform itself.

Here, the focus is on monitoring the risk prediction function in the AI OPS module of the platform. The platform constantly conducts "health checks" on every business application. Capture the OS indicators of the application running machine (CPU, IO, Mem, Network, etc.), the indicator information of the service's own process space (CPU, Thread, Mem, TCP, etc.), the upstream and downstream services of the application service, the application throughput capacity (QPS, Latency), the infrastructure and middleware services that the application relies on (DB, Cache, MQ, etc.), and the JVM information of the Java application. Combine these indicator information and check if there are any indicators within an unreasonable range. Some indicators will also undergo a "day on month/week on week" inspection, and the expert team will continue to revise and define reasonable intervals for these indicators. If any indicators are found to be problematic, the platform will issue an alert to remind our research and development classmates to make further diagnosis.

This risk prediction ability is widely used internally, and we have calculated that over 70% of technology smoke incidents can be perceived in advance through these conventional predictions, helping us to nip in the earliest possible causes of serious technical failures.

Another important function of the monitoring platform is root cause automatic analysis. The exploration of this function internally has begun to bear fruit. We rely on the automatic analysis of the monitoring platform to quickly locate the cause. The platform's automatic analysis capability will aggregate all abnormal indicators and provide preliminary analysis suggestions based on the service link relationships automatically built internally, which greatly accelerates the efficiency of our engineers' manual troubleshooting. We can quickly locate the "root cause" of the fault and take the most reasonable measures to stop and recover losses in the first time, minimizing the impact of the fault on the business.

Full link capacity pressure testing is the most core battlefield in the technical support system. In the process of ensuring stability, it is necessary to know how much business volume can be supported by the system capacity? What are the capacity shortcomings of the core main chain service? When implementing a full link capacity pressure testing platform, some scenarios have also been designed and selected.

1、 Selection of technical transformation. Select the production real library table.

2、 Design of data and traffic models. Due to the different traffic ratios of the main chain business services, we provide additional compensation for traffic requests for these services, so that all services on the entire link receive 1.5 or 2 times the traffic pressure, making the capacity pressure of the entire system more comprehensive and accurate.

3、 Capacity governance and drills. We conduct full link capacity pressure testing every two weeks to follow up and address technical design and performance issues identified during capacity pressure testing. Avoid corruption in the long-term iteration of the system technical architecture. In addition, the technical support team will regularly conduct production fault simulation exercises in conjunction with pressure testing, ensuring that the research and development students of the support team always maintain sensitivity to the stability of the production system.

The construction of data platforms in the technical support system is a highly valued area. We manage and govern basic data services through our self built DMS platform. During the construction process of the platform, attention will be paid to three aspects of capabilities.

Firstly, as the business volume rapidly develops and the data scale grows exponentially, DMS has done a lot of automation construction in data isolation, service flow limiting, and real-time scaling.

Secondly, DAL is a database access middleware that we quickly introduce and promote implementation. Enable the database to have flexible and flexible architecture capabilities for database partitioning, table partitioning, and read/write separation to quickly support the growth of business scale.

Thirdly, resource ID design. When delivering data services, we do not expose the true physical information (link strings) of the services to research and development. And our framework and tools are fully integrated to support resource IDs. R&D students can use it out of the box without the need to understand the cluster, deployment, and configuration information of various environments behind the resource ID. Technical support students can also quickly locate which business application is experiencing problems using services (such as DB, Cache, MQ) through resource IDs.

3、 Cross cloud thinking and implementation

Finally, let's talk about how we balance efficiency, cost, and stability design in a cloudy environment. There are mainly three aspects: smoothing out the differentiation of multi clouds, preventing cloud jitter, and managing IT costs.

Firstly, smooth out the differentiation of cloudy areas. Because businesses operate in multi cloud scenarios, there may be some differences in the APIs of cloud services. So we have created an LCloud tool platform internally to connect with all cloud merchants, and the tools will smooth out the differences between cloud merchants. Our internal research and development, operation and maintenance classmates can obtain a consistent operating experience through LCloud.

Secondly, clouds are bound to shake. If our services running on the cloud do not have sufficient flexibility, then the "jitter" on the cloud will definitely affect your business. Once a brief network jitter occurs, it will lead to an increase in the overall service response delay and render the service unusable. The throughput capacity of this service will be completely compromised, resulting in damage to the business link.

Thirdly, the IT cost of cloud services. The current cost control measures not only use K8s infrastructure for flexible resource scheduling, but also heavily utilize some features of cloud commerce, such as bidding for instances and purchasing reserved instances. The containerization team is also researching how to pop up idle resources during low peak hours and provide them for use in offline task scenarios.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us