SRE Capacity Building for Tens of Thousand Node Scale Cloud Services

Background and current situation

Introduction to system architecture

The above figure shows the actual system architecture used by Alibaba Cloud. The main purpose of the system is to calculate and store real-time data streams. Alibaba Cloud's container service ACK is used as the system base, and the deployment, release, and control of containers are all based on the K8s standard. Use the self-developed Gateway service as the system traffic entry and deploy it based on the LoadBalancer type service.

This architecture is also a common way to use K8s as a system. Through the CCM component provided by ACK, the service is automatically bound to the underlying SLB instance to undertake external traffic through this instance. After receiving the reported data stream, the Gateway puts the data into Kafka and buffers the data in a topic. When the consumer perceives the report of the data flow, it will further calculate and process the data from the corresponding topic of Kafka, and write the final result to the storage medium.

There are two kinds of storage media: Alibaba Cloud block storage ESSD is relatively fast, but the price is high; File storage NAS is mainly used to store data with low performance requirements. The metadata is managed by ACM. The self-developed components Consumer and Center are responsible for querying the calculation results from the storage media and returning them to users.

System status

The system is currently global open service and deployed in nearly 20 regions around the world. It can also be indirectly seen from the figure that the data read and write links are long and need to be processed by multiple components. The monitoring objects are also complex, including infrastructure (scheduling, storage, network, etc.), middleware (Kafka), and business components (Gateway, Center, Consumer).

job content

• Collection of observable data

Observability mainly includes Metrics, Tracing and Logging. The three types of data perform their respective functions and complement each other.

Metrics is responsible for answering whether the system has problems. It is also the system's self-monitoring system entrance. The abnormal alarm or problem troubleshooting for Metrics can quickly determine whether there is a problem in the system and whether human intervention is required.

Tracing is responsible for answering where the system problem occurs. It can provide specific details such as the call relationship between components, each request, and which two components have problems in the call process.

After finding the problem component, you need to locate the root cause of the problem through the most detailed problem log, namely Logging.

• Data visualization&troubleshooting

After collecting the three types of data, the next step is to display the data in a visual way through the large scale, so as to determine the problem at a glance and make the problem troubleshooting more efficient.

• Alarm&emergency treatment

After the problem is determined, it needs to be solved through the alarm and emergency processing process.

The emergency handling process mainly includes the following points:

First, a hierarchical alarm notification and upgrade strategy is required to quickly find the corresponding personnel to handle the alarm. At the same time, if the original personnel cannot deal with it in time for some reason, it is also necessary to upgrade the alarm to other personnel for timely processing.

Second, the problem recognition and handling process should be standardized.

Third, afterwards statistics and review. After the system fails and is solved, it is also necessary to make statistics afterwards and make a review so that the system can avoid the same problems in the future and make the system more stable.

Fourth, the operation and maintenance processing should be toolized and white-screen, and the manual input of commands should be minimized, and the whole set of standard processing actions should be solidified with tools.

Practical experience sharing

Observable data collection&access

• Metrics data

The first step of monitoring is to collect observable data, which can be divided into three levels in the system.

The top layer is the application layer, which is mainly concerned with the health of core business interfaces, and is measured by the three golden indicators of RED (Rate, Error, and Duration). Where Rate refers to the QPS or TPS of the interface, Error refers to the error rate or number of errors, and Duration refers to how long the interface can return. You can define SLOs and allocate Error Budgets through gold indicators. If the Error Budget runs out quickly, you should adjust the SLO in time until the system is optimized to be perfect enough, and then increase it. Apdex Score can also be used to measure the health of services. The calculation formula is shown in the figure above. In addition, the application layer will also care about indicators that are strongly related to the business, such as revenue, number of users, UV, PV, etc.

The middle tier is the middleware and storage, which is mainly concerned about the submission status of Kafka client consumption sites, the occupancy rate of the producer buffer, whether the buffer will be filled up in advance, resulting in new messages not coming in, consumption delay, average message size, etc., such as the water level, read and write traffic, disk utilization, etc. of the Kafka broker end, as well as the mount success rate of the cloud disk ESSD, IOPS, disk free space, etc.

The lowest layer is the infrastructure layer, and the indicators concerned are complex, such as the ECS (K8s Node) CPU memory level, the number of restarts, and scheduled O&M events, such as the API server, ETCD, and scheduling related indicators of the K8s core components, such as the pending status of the business Pod, whether there are resources available for sufficient scheduling, OOMKilled events, Error events, etc., and the VPC/SLB related exit bandwidth The number of dropped connections, etc.

Monitoring ECS nodes requires the deployment of node-exporter Daemonset. The core components native to the K8s cloud can expose indicators through the Metrics interface for Prometheus to capture. Because Kafka or storage components use Alibaba Cloud's cloud products, they provide basic monitoring indicators and can be accessed directly. ACK also provides very core indicators for CSI storage plug-in, such as mount rate, iops, space occupancy rate, etc., which also need access. The application layer includes Java applications and Go applications. Java uses MicroMeter or Spring Boot Actor for monitoring. Go applications directly use the official Prometheus SDK as the burial point.

Based on the ACK and K8s systems, the best choice of Metrics protocol is Prometheus. Open source self-built Prometheus is as difficult to access as cloud hosting, but in terms of reliability, operation and maintenance costs, cloud hosted Prometheus is better than self-built Prometheus. For example, using the ARMS system, you can directly install a very lightweight probe in the cluster, and then store the data in a fully managed storage component, and do monitoring, alarm, visualization, etc. around it. The whole process is very convenient, without the need to build a self-built open source component. Cloud hosting also has more advantages in terms of collection and storage capabilities.

Prometheus provides one-click access for ECS nodes, K8s core components, Kafka and storage components.

K8s basic monitoring and node status access are mainly through the ACK component management center, which can be used only by simply installing relevant components.

Kafka's access is mainly through the Prometheus cloud service instance, which has been connected to Alibaba Cloud's very complete PaaS stratiform cloud products. When accessing Kafka, the indicators to be filled in are common indicators, without additional learning costs.

The access of cloud disk ESSD monitoring is mainly monitored through the CSI storage plug-in. The csi-provisioner component in ACK provides a very useful observability capability. You can use the information provided by it to timely check which disk is not attached, which disk is not attached, and which IOPS fails to meet the requirements or configure the alarm after it is attached, and timely discover the abnormal conditions of the cloud disk.

Through ARMS Prometheus access, the PV status of preset collected jobs such as K8s cluster level and node level can be easily monitored.

There is no convenient way of one-click access in the application layer. It is necessary to expose the Metrics interface of the entire application through the way of application burying point+service discovery, and provide the probe capture for ARMS Prometheus.

Java applications need to use MicroMeter or Spring Boot Actor for embedding.

For example, the code in the above figure shows that many JVM-related information in MicroMeter can be simply exposed through a few lines of code, and other more helpful information, such as process-level memory or thread, system information, can also be accessed in a very simple way. After setting the burying point, you need to start the server internally, expose the Metrics information through the HTTP interface, and finally bind the specified port. So far, the burying process is over.

In addition, if there are business indicators, you only need to register your indicators in the global Prometheus Registry.

The ARMS probe grabs the exposed endpoint and simply adds ServiceMonitor, which can simply write a few lines of YAML definition directly through the white-screen way of the console, and then complete the collection, storage and visualization of the entire monitoring.

Go applications are much the same as Java applications, except for the different embedding methods, which mainly use the official SDK of Prometheus.

The above figure is an example. For example, there is a query component in the system that cares about the time distribution of each query. You can use the histogram type indicators in Prometheus to record the time distribution of each request, and specify the commonly used statistics when burying points. Then ServiceMonitor will write the endpoint of the Go application, and finally complete the access of the entire application on the console.

ARMS also provides a non-intrusive observability implementation scheme based on eBPF, which is mainly applicable to scenarios that do not want to modify the code. It monitors the RED of the system interface in a non-intrusive way, and realizes the capture and storage of information through the ARMS Cmonitor probe and eBPF, and through the filter of the system kernel.

To use the above method, you need to install the Cmonitor App in the cluster. After installation, you can see the Cmonitor Agent in Daemonset. Each node needs to start the Cmonitor Agent, which is used to register with the system kernel through eBPF, capture the network traffic of the entire machine, and then filter out the topology and other information of the desired service network. As shown in the figure above, it can monitor the QPS, response time distribution and other information of the entire system.

• Trace and Log data

Trace also uses Cmonitor to implement related capabilities. In terms of logs, the logs of system components, K8s control panel and JVM DCLog are mainly delivered to Loki for long-term storage through arms-Prommail (probe for collecting application logs).

The log of K8s system events is mainly based on the function of ARMS event center to monitor the key events of K8s, OOM, Error and other events on Pod scheduling.

• Read and write link monitoring and problem location

For example, the trace-related console can view standard information such as the components that each request passes through, receiving time, processing time, and response time.

• Component operation log collection and storage

In terms of log collection, by configuring Loki's data source in Grafana, you can easily quickly obtain Pod logs attached to Pod or a service based on keywords or Pod tags, which provides great convenience for troubleshooting.

• K8s operation event collection and monitoring

The above figure is a screenshot of the event center workbench provided by ARMS. The workbench provides a list of key events, which can subscribe to events with higher levels. The subscription only needs a few simple steps: fill in the basic rules, select the mode to match the events, and select the alarm sending object to monitor the key events and achieve the purpose of timely response.

Data visualization and troubleshooting

• Grafana large-cap configuration practice

After data collection, the next step is to build efficient and available data visualization and problem troubleshooting tools.

Prometheus and Grafana are a golden partner, so we also chose Grafana as a tool for data visualization. The figure above lists the key points in the process of large disk configuration:

• When loading the large disk, you need to control the query timeline on each panel to avoid displaying too many timelines on the front end, which will cause high browser rendering pressure. Moreover, for troubleshooting, it is not helpful to display multiple timelines at one time.

• The large disk is configured with many flexible variables, which can switch various data sources and query conditions on a large disk at will.

• Use Transform to make Table Panel display statistics flexibly.

• Distinguish between Range and Instant query switches to avoid useless range query slowing down the display speed of the large disk.

Overview of K8s cluster

The above figure shows the monitoring node information and the pod information of the K8s cluster. The overall layout is clear at a glance. The key information is marked with different colors, and the numbers are displayed in different forms through Grafana's dynamic threshold function to improve the efficiency of troubleshooting.

Node water level

The figure above shows the node water level chart, showing important information such as disk iops, read/write time, network traffic, memory usage, CPU usage, etc.

Global SLO

The figure above shows the global SLO market. Configure the customized large disk through the Grafana hosting service, which is also the latest online function of ARMS. Hosting a Grafana instance on the cloud, you can directly log in to the Grafana interface through the cloud account, which contains the unique functions customized by Alibaba Cloud.

The market includes global latency, QPS, success rate, error code distribution, QPS change trend, and some fine-grained information, such as single sharding, load balancing, gateway components, center components, etc. In case of a release, you can view the difference between the previous version and the previous version by bringing the version number. You can display the datasource in the form of variables in the pannel, or you can select different regions around the world to switch and view.

Kafka client and server monitoring

The cluster relies on Kafka client and server, and its monitoring comes from cloud monitoring integration.

The internal components are strongly dependent on Kafka. They monitor Kafka and the number of connections between it and the broker, the average message length, the site submission rate, and the consumption flow through indicators. The producer side provides information such as the buffer occupancy rate and the number of active producers.

Java application operation monitoring

If the component is written in Java, it is also necessary to monitor JVM-related GC conditions. The large disk can provide JVM memory, GC related conditions, CPU utilization, number of threads, thread status, etc. In addition, for example, for classes that are released or dynamically loaded, you can check whether there is a continuous rising status in the statistics of loaded classes.

Table format error type duplicate statistics

If you want to use a spreadsheet to count the installation status or the status of key customers in the cluster, you can use Grafana's transform function to build the entire market into a spreadsheet format experience. You can use transform to map fields to a spreadsheet. After opening the filter, you can also filter various fields to get query results.

• Problem troubleshooting case sharing

The display of log information needs to query Loki's data in Grafana. For example, there is a query log in the center, which contains many original information, such as the time and UID of this query. Through post-processing, first filter the desired log by row, then extract information through pattern matching, and then match the size relationship of some of the fields according to PromQL, and finally perform secondary processing on the matched log format.

Alarm and hierarchical response mechanism

The above figure shows the process of alarm and hierarchical response mechanism, which includes alarm configuration, personnel scheduling, alarm triggering, emergency handling, post-event review and mechanism optimization in turn. Finally, the mechanism optimization is reflected in the alarm configuration to form a complete closed loop.

The above process is a self built alarm system. By relying on the self built system to regularly run tasks to check indicators, and then calling the nailed webhook or webhook of other operation and maintenance systems to send alarms, there are several deficiencies in the process:

1. The stability of the self-built system shall be responsible for itself. If the stability of the alarm system is worse than that of the operation and maintenance system, the existence of the alarm system is meaningless. Secondly, with the increasing number of regions in the whole cluster, the configuration becomes more and more complex. It is difficult to maintain the configuration and ensure that the configuration takes effect globally.

2. Relying on manual scheduling, it is easy to miss scheduling or lack of backup.

3. The trigger conditions in the alarm trigger stage are very simple, and it is difficult to add additional business logic on the alarm trigger link, such as dynamic threshold, dynamic marking, etc.

4. In the emergency processing stage, the information is sent very frequently and cannot be claimed and closed actively. When there is a problem in the system, the same kind of alarm will be sent in the group with high density, and the inability to actively shield the alarm is also one of the defects.

5. There is no data support during the post-recovery optimization, and the entire process cannot be optimized based on the existing alarm statistics.

Therefore, we choose to build an alarm system based on ARMS. ARMS's powerful alarm and hierarchical response mechanism has brought us a lot of convenience:

1. Global validation function of alarm template: It is only necessary to configure alarm rules once to enable different clusters to validate alarm rules. For example, if there is no alarm template, you need to configure alarms for the indicators in each cluster separately. After the template is available, it is very convenient to apply the PromQL or AlertRule of the alarm to each region cluster in the world through the alarm rule template.

2. Alarm scheduling and dynamic notification: the system can dynamically realize shift work, which is more reliable than manual scheduling.

3. Event processing flow and alarm enrichment: The event processing flow of ARMS, the event processing flow of the alarm center and the alarm enrichment function can be used to dynamically mark the alarm after it is triggered and conduct hierarchical processing. As shown in the figure above, the alarm can be labeled with priority, the alarm with higher priority can be upgraded to P1, and the alarm receiver can be dynamically modified.

In order to achieve the above functions, it is necessary to have a data source to provide the basis for marking. There is data source function on the console of the alarm operation and maintenance center. When the alarm is triggered, the data source can be called through HTTP request or RPC request, and then the marking result can be obtained from the HTTP URL. The implementation of this interface is mainly through the IFC lightweight tool to write the code online. The code is mainly responsible for reading the information configuration items in the ACM configuration center, then reading the configuration items and exposing the HTTP interface to the outside, and providing it to the alarm operation and maintenance center for dynamic call.

After completing the above steps, it is also necessary to configure the event processing flow, transfer the required information to the above interface by matching the update mode, return the priority, and finally print the alarm.

4. Claim, close and shield of alarm: ARMS provides practical functions such as claim, close, focus and shield, significantly improving the quality of alarm.

5. Statistics of alarm claim and acceptance rate: when the alarm is recovered, it is necessary to know how many alarms are processed by each person, the processing time, and the average recovery time of the alarm. After the mechanism of claim, close, recovery, and shielding is introduced, the ARMS alarm center records the event log in the background, and can provide useful recovery information through the analysis of the log.

After getting the alarm information, users want to deal with the problem on the white-screen interface, so we introduced the white-screen operation and maintenance tool chain based on Grafana. The principle is to introduce dynamic information when configuring the large market and splice it into the large market in the form of links.

We have various internal systems. If there is no official link splicing, we need to spell the URL first or search manually, which is very inefficient. By embedding links in Grafana, the operation and maintenance actions are solidified into jumps between links, which is very helpful for improving efficiency. It can complete all the work through a set of tools of Grafana, standardize and solidify the operation and maintenance actions, and reduce the possibility of human error.

Summary and future work

First, the optimization of alarm accuracy and takeover rate. At present, there is no good way to effectively use the duplicate information of the alarm. In the future, we will try to adjust the unreasonable alarm threshold in time through the information of alarm accuracy and takeover rate, and may also try multi-threshold alarm, such as how many levels the alarm is in the range of A to B, and how many levels it is above B.

Second, multi-type data linkage. For example, when troubleshooting problems, in addition to Metrics, Trace and Log, there are also profiler, CPU's flame chart and other information. At present, the linkage between these information and observable data is low. Improving data linkage can improve the efficiency of troubleshooting.

Third, the cost control of buried points. For external customers, the buried point cost is directly related to the cost of using AliCloud. We will regularly carry out targeted governance on the dimensions of self-monitoring indicators, divergent dimensions, etc., and clean up the useless dimensions, so as to control the buried point cost at a lower level.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us