How does an enterprise build a cloud operation and maintenance system from 0 to 1

1. Problems and challenges on the cloud

At the beginning of joining, I faced four difficulties and three urgent problems:
First, there is a shortage of manpower. There are only 4 students in the operation and maintenance department, and only 200 students in the R&D line. For a young company, the ratio of 1:50 is relatively high.

Second, there are no molding operation and maintenance tools. Tools exist, but the functions are not complete from a single point of view, and the overall operation and maintenance level is not connected in series, which greatly increases the operation and maintenance cost.

Third, high-speed iteration of business. There is a small version every week and a big version every two weeks. This is related to the development system of the company's business line, and it is also constantly exploring some new areas.

Fourth, the lack of infrastructure. This leads to a variety of access and usage methods, and each department maintains its own independent infrastructure, which also greatly increases the operation and maintenance cost.

Under these difficulties, we believe that how to improve the efficiency of operation and maintenance in the short term is the primary problem to be solved by the operation and maintenance department at that stage. Because only by improving the efficiency of operation and maintenance, we have more time to do more things, including improving business stability and better supporting rapid business iteration.

2. Operation and maintenance efficiency improvement

Four main directions have been formulated for improving the efficiency of operation and maintenance, which are also important starting points for my landing work after joining the company:

01 Solve the efficiency problem on the cloud

In the process of platform construction, a large number of Alibaba Cloud PaaS products are used, and there are many work orders generated in daily maintenance. A large number of work orders are handed over to Alibaba Cloud students to form a short-term and fast closed-loop work chain , This gave us more time to sort out the company's current situation and problems, saving a very high time cost.

02 Optimize the publishing platform

The small and medium-sized standard Jenkins and some Alibaba Cloud EDAS product capabilities are used. Neither of these two products is a complete release system, and some functions are missing. Therefore, we choose to build a CI/CD system through the following aspects:

◾ Release cycle: Since there is no fixed release window, the failure time is greatly extended, and the failure problem cannot be dealt with in time, which reduces the user's product trial experience. For example, when you go online in the early morning, you don’t know about the failure until you go to work the next day.

◾ Complete iteration: From DEV environment to QA to code scanning, to grayscale and finally to online, the complete iterative cycle series is what the team urgently needed to do.

◾ Release efficiency: The biggest problem faced at the time of release was poor stability, CI/CD was executed in one, and the execution efficiency of the underlying salt and ansible release was low; first, we needed to separate CI and CD; parameters required for compilation were stored in CMDB, and packaging actions gradually Break away from jenkins and integrate the compilation function into the release platform; remove the salt from the CD part and replace it with ansible api;

◾ Participation: Reduce the participation in operation and maintenance, increase the interaction between R&D and the platform, and allow operation and maintenance to have more time to pay more attention to iterative content, cycle, release times, etc.

03 CMDB construction

CMDB is the most important information source for operation and maintenance, and it is also the first step of a complete operation and maintenance system.

Write all information sources into CMDB, including Alibaba Cloud assets, self-built application information, service construction parameters, and organizational structure personnel information.

The role of storing relevant information lies in the direct correlation between CI/CD, automated operation and maintenance, cost analysis, alarms, and failures and client performance. The construction of CMDB was not completed in one go in a short period of time. It started with some assets of Alibaba Cloud, and finally completed the construction parameters gradually. With the gradual completion of the CMDB, the standardization of the basic environment is also initially advancing.

04 0-1 of the operation and maintenance work order platform

The operation and maintenance work order platform is the only entrance for operation and maintenance to R&D and even to the company. Prior to this, all departments used emails to communicate, but what followed were problems such as lost and delayed emails, which led to delays in demand processing, and some students operated on their own due to unclear authority management, and various problems occurred.

The work order platform is closer to cloud management and control, and gradually replaces the permissions of the Alibaba Cloud console. At the same time, we use work orders to gradually standardize the understanding of R&D students on middleware and Alibaba Cloud's underlying resources. It is mainly divided into three stages:

Stage 1: Handcrafted. The work order platform abstracts high-frequency operations and requirements, but it is still executed manually. The work order platform at this stage is more like a web portal without much automation capability.

Phase 2: Manual decision-making, automatic execution. This stage lasts for a long time, and the work order needs to be completed with automation capabilities.

Stage 3: Automatic decision-making and automatic execution. To transform the business's demand for resources with the resource type/resource specification, normalized demand description and personalized demand description, it is necessary to make the operation and maintenance closer to the business through the entry of the work order platform.

At this stage, within half a year, the efficiency of operation and maintenance has been improved, and there is more time to participate in the transformation of the stability structure. At the same time, the operation and maintenance personnel also pay more attention to business trends.

3. Stability Construction

01 Problems and challenges faced

Not being able to "hang" is the most important OKR for the entire team. The reason is that there were two hot searches in 2020, both of which were caused by site abnormalities caused by unstable architecture, which caused heated discussions.

The second point is service early warning and rapid recovery. This part is a missing state, which makes it impossible to accurately locate the root cause immediately after a fault occurs.

The third point is the evolution of the architecture. The volume of the business continues to rise, and the evolution of the architecture with a larger volume is inevitable.

02 Fault location and handling

In the stable life cycle, the proportion of failure and non-failure time is very important. The non-failure time needs to increase the stability of the architecture, replay, and drills to reduce the proportion of failure time and make the business stability towards development in a healthier direction. In the past year, we have focused on the stability construction time and increased the number of drills. The following will also focus on sharing relevant cases of stability transformation.

03 Stability Modification

In the past year, in terms of stability construction, the six tasks in the above figure have been roughly carried out, and we will select three more representative aspects to share.

Transformation 1: Evolution of the monitoring system

At the end of 2020, only AMRS is the only monitoring tool, which is easy to use but not suitable. For example, IT governance is not done well, and the host group + alarm template + alarm contact + organizational structure + sending notification cannot be associated, resulting in only operation and maintenance receiving alarms, and R&D students receiving alarms less than. Operation and maintenance students can only manually forward it to R&D students for processing.

In terms of the K8s platform, Promethues was used at the time. The biggest problem with Promethues is that there is no formal web interaction, and the query and alarm statements have a relatively high threshold.

By March 2021, we will start formal monitoring and governance, using OpenFalcon to monitor the underlying cloud products, and at the same time merge multiple sets of Promethues within the company and Alibaba Cloud's Promethues services to unify the Promethues data sources. In this way, there are only two monitoring products in the company: one for underlying resources and one for K8s.

In June 2021, we began to develop our own monitoring system OWL. As can be seen from the name "Owl", we hope that it can replace the operation and maintenance students who are on duty at night.

OWL monitoring uses the front-end page of OpenFalcon, is compatible with Promethues data, abstracts Promethues query and alarm statements, displays them in a format, and adds alarm functions. This is similar to the promethues plug-in part of grafana. At present, all alarms have been integrated into the OWL alarm platform, and the information collection and alarm functions have also been improved.

The next stage is the alarm gateway. The alarm gateway also needs to have the ability to take over the alarm + resolve + upgrade the alarm to facilitate the tracking of problems by the operation and maintenance students. With the gradual increase of alarms, the number of alarms is gradually increasing, and the alarm information is also flooded. In order to facilitate the tracking of problems by operation and maintenance students, the alarm gateway has entered the next stage of iteration, alarm takeover+solution+ Alarm upgrades reduce repetitive alarms in the group without missing key alarms.

Transformation 2: Access Layer Transformation

The architecture on the left of the picture above is the architecture of 2020, and the architecture on the right is the latest and transformed architecture.

In the architecture on the left, SLB is connected to the ACK pod and ECS. The advantage is obvious, and there will be no major failures. The disadvantage is that the maintainability is poor, resulting in frequent minor failures, and many abnormalities on the client side, most of which come from the access layer.

On the right is the new architecture transformation, unified access layer, and opened the ingress entry mode. The advantage is that it can easily realize the control and tracking of the entire link, as well as cost optimization, and unified high-defense and WAF access.

Transformation 3: Switching from MySQL to PolarDB

Switching from MySQL to PolarDB is the first project of Alibaba Cloud's in-depth cooperation in 2020. The background of the transformation is that there is a problem with the MySQL sub-database and sub-table mechanism, the disk and the CPU do not match, and the single instance has exceeded the Alibaba Cloud SLA standard, reaching 6T of data .

The business scenario is more reads and less writes. At the same time, the team is also considering distributed DB. These two features match PolarDB and PolarDB very well. It took the team a month to create 30 groups of MySQL instances larger than 1T. All migrated to PolarDB. After the migration, there are positive benefits in terms of cost, stability and usage.

For the transformation of the above architecture, it is not necessarily the best, but it is the most practical high-speed iteration solution that can support the business in the short term.

4. Future direction

Finally, share a few things to do in 2022:

1. Continue to explore cloud capabilities. For such small and medium-sized companies, the underlying technical capabilities can actually rely on the public cloud, especially in business scenarios where multi-directional explorations can be made with the help of the public cloud to realize their own business iterations.

2. Private management. For the basic capabilities of the public cloud, coupled with customized management and access methods, it can completely help the rapid iteration of the business. For example, RDS and Redis can be regarded as an instance; ECS can be regarded as a physical machine, and the pressure of the IaaS layer can be removed by using this method.

3. PaaS of products. The team will also deeply study Alibaba Cloud's product thinking and make its own PaaS products more product-oriented.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us