Industry SaaS Microservice Stability Guarantee in Practice


In 2017, Cindy, a Twitter engineer, published an article entitled "Monitoring and Observability", which for the first time brought the term "observability" into the developer's perspective, making a semi playful joke about the difference between observability and monitoring. In the field of software products and services, monitoring can tell us whether a service is functioning properly, while observability can tell us why a service is not functioning properly.

As can be seen from the Google trend graph, the popularity of observability is increasing year by year, and it is also considered a system attribute, which will gradually become a feature that the system needs to have during the development and design process.

Observable development trends

After 2020, the observable search trend has experienced a blowout, largely due to the gradual popularization of SRE site reliability engineering and the establishment of relevant positions and corresponding recruitment indicators by major domestic factories, which has also attracted more attention in China. This also means that more and more basic services are facing stability challenges, and the important means to solve stability challenges is to provide observability.

The lower left corner of the above image shows observable global search trends, with a high search fever in China.

Definition of Observability

Observability is a mathematical concept proposed by Hungarian engineers, referring to the extent to which a system can infer its internal state from external outputs. In other words, observability should be able to analyze the specific operational details within the data output.

Difficulties and challenges

Business flourishes and demands for stability surge

F6 Automotive Technology is an Internet platform company focused on the informatization construction of the automotive aftermarket, currently at the forefront of the industry. With the vigorous development of the business, the number of merchants supported by F6 has increased by dozens of times in a short period of time. At the same time, it has gradually launched services for C-end scenarios such as technician analysis, data query, etc., with significantly improved requirements for stability.

The Role of Conway's Law

Conway's Law is a guiding law in the history of IT that divides the entire organizational structure into microservices. Any organization's system design process is a replica of its organizational structure. As business expands, Conway's Law will lead to the convergence of the separation method when designing microservices to the organizational structure. Business growth will lead to the division of departments, and subsequent microservice design will also be very close to the organizational structure. Even if the organizational structure and microservices are not consistent in the early stage, microservices will gradually compromise with the organizational structure in the future.

Although the convergence of microservices and organizational structures has resulted in higher system communication efficiency, it has also brought many distributed system problems. For example, in the interaction between microservices, no one can have a holistic and global understanding of the service. The most immediate expectation of research and development personnel is to have the troubleshooting efficiency of stand-alone systems in distributed systems, which urges us to transform the system's server-centric approach to a call chain centric approach.

Increased demand for stability

F6 first adopted chimney type construction for business development. Monomer applications are relatively simple, but they have many problems with scalability and maintainability. For example, all research and development are conducted on the system, with many code conflicts, and it is difficult to clarify when and how much business will be lost due to the release. Therefore, more and more situations have led to the need for microservice splitting, and microservice splitting and invocation can lead to very complex and cumbersome invocation chains. As shown on the right of the above figure, almost no legal person has analyzed the invocation link.

So, how can we minimize the difficulty of troubleshooting online?

Observable evolution

Traditional monitoring+micro application log collection

• ELKStack obtains logs and queries ElastAlert to complete log alerts

Traditional monitoring and microservice log collection typically uses ELKStack for log collection. ELK is an acronym for three open source projects: Elasticsearch, Logstash, and Kibana.

We heavily rely on ELK to collect microservice logs. At the same time, we also use the open source ES based alarm system ElastAlert component, whose main function is to query matching rules from ES and alert related types of data.

The figure above describes the idea of daily queries through log collection. For example, research and development personnel will query online logs through pipelining, and ElastAlert will obtain abnormal data from ES logs through matching rule alerts. Kibana can perform queries and prioritize the location of anomalies occurring in the system.

Architecture upgrade+observability introduction

• Grafana Kanban+Zorka supports jvm monitoring

With the development of the business, the system's requirements for logging have gradually increased. For example, there are many teams that need to configure various alarm rules. Therefore, we have introduced Grafana to gradually replace the query functions of Kibana and Zabbix. You can query and alert logs through Grafana's ES plug-in, and then use the alert function to eliminate the original ElastAlert. At the same time, you can use Grafana to make a more intuitive visual display on a large screen.

In addition to logging, we also expect to collect Java application metrics, so we have introduced the Zorka open source component. Zorka and Zabbix can be easily combined, and the collected information can be reported to Zabbix for display through Zorka. And Zabbix can directly output data through the Grafana Zabbix plug-in, ultimately collecting the entire application screen and signage information into the Grafana interface.

The working mechanism of Zorka is similar to that of using the Zabbix Java gateway. It is automatically mounted to the Java process through the Java Agent, which is used to count common application containers and request count indicators, and initially solves our observation needs for Java processes.

Cloud native transformation

• The orchestration capabilities of K8s and microservices complement each other, urgently requiring the support of trace components

With the continuous improvement of micro services, the operation and maintenance costs of traditional methods are becoming higher and higher. Therefore, we have launched the transformation of Yunyuan Biochemistry.

First, the modification of Yunyuan Biochemistry is the preparation of K8s side ready probes and survival probes. The writing of survival probes improves the self-healing ability of services. After OOM occurs, services can automatically recover and start new nodes, ensuring the normal provision of data services. In addition to the K8s, we also introduced Prometheus and ARMS application monitoring.

Prometheus, as CNCF's second largest project after K8s, has formed sufficient discourse power in the entire metrics field; ARMS application monitoring, as a flagship product of Alibaba Cloud's commercial APM, enables us to integrate the cloud native approach to achieve research and development without any code changes and have trace functionality. More importantly, the Alibaba Cloud team can maintain continuous iteration and support more and more middleware, so we believe it will definitely become a diagnostic tool.

• JmxExporter quickly supports jvm information display of cloud native Java components

After the transformation of Yunyuan Biochemistry, the monitoring model has also changed. The earliest monitoring model was push. Zorka was released on the same machine every time, so it had a fixed host; After going to the cloud, the containerization transformation causes the Pod to no longer be fixed, and there may be problems such as new application expansion and contraction. Therefore, we have gradually converted the monitoring model from push to pull mode, which is more consistent with Prometheus's collection model, and gradually stripped Zorka from the observable system.

The reason why ARMS is not used to directly collect JMX metrics is that ARMS does not cover all online and offline Java applications. Applications that are not covered also expect JVM data collection capabilities, and the cost of ARMS is slightly higher. Therefore, due to cost considerations, we did not use ARMS as a complete access, but chose the JMX Exporter component.

JMX Export is also one of the exporters provided by the official Prometheus community. It uses the Java JMX mechanism to read JVM information through the Java Agent, which can directly convert the data into metrics format that Prometheus can recognize, enabling Prometheus to monitor and collect it, and register the corresponding Service Moninor through the Prometheus Operator to complete indicator collection.

• Use the configuration center to complete accurate alerts applied to the owner

With the vigorous development of the company's business, there has been a surge in personnel, a surge in micro services, and a sharp increase in the number of R&D personnel and alerts. In order to improve the alarm reach rate and response rate, we re used the multilingual SDK of the Apollo Configuration Center and developed a set of application alerts based on the Apollo service through Go. The overall process is as follows:

Collect ES alerts or alerts from other scenarios through Grafana, and then associate them to alerts through the metrics application. Alerts will be forwarded to the precise alert service written in the Go language mentioned earlier. The accurate alarm service is parsed to the corresponding application, and information such as the owner's name and mobile phone number are obtained from the Apollo configuration center based on the application. Based on this information, alerts are made through spikes, greatly improving the message reading rate.

• ARMS: Non intrusive support Trace

In addition, we have introduced Alibaba Cloud's application real-time monitoring service ARMS, which can support most middleware and frameworks, such as Kafka, MySQL, Dubbo, and so on, without any code modifications. After Yunyuan Biochemistry, it is only necessary to add annotations to the deployment to support the loading of relevant probes. The maintainability of microservices is extremely friendly. At the same time, it also provides a relatively complete trace view, which allows you to view the entire trace link of the online application node call log. It also provides a Gantt chart view method and dependency topology diagrams, upstream and downstream time consumption charts, and other data.

Observable upgrade

• Log Trace Metrics concept upgrade

Observable microservices have blossomed everywhere in China. Therefore, F6 has also upgraded its observable thinking. The widely promoted observability in the industry includes three pillars: log events, distributed link tracking, and indicator monitoring. Monitoring is required in any era, but it is no longer a core requirement. As can be seen from the above figure, monitoring only includes alerts and application overviews, while in fact, observability also needs to include troubleshooting and dependency analysis.

The earliest users of monitoring functions were operation and maintenance personnel, but most of them could only handle system service alarms. When it comes to the entire micro service field, more problems may arise between applications, requiring troubleshooting. For example, if a service encounters slow requests, it may be due to code issues, insufficient locks or thread pools, or insufficient connections. The final characteristics of the above problems are slow and the service cannot respond. With so many possibilities, it is necessary to locate the true root cause through observability. However, locating the root cause is not a real requirement. The real requirement is more about using observability to analyze the node where the problem is located, and then taking measures such as replacing corresponding components, fusing, or limiting current to maximize the overall SLA.

Observability can also be a good way to analyze problems, such as slow online services, the time spent on each node, and the time spent on each request. Dependency analysis can also be resolved, such as whether the service dependency is reasonable, and whether the service dependency invocation link is normal.

With more and more applications, there are more and more demands for observability and stability. Therefore, we have developed a simple root cause analysis system to classify and cluster current service logs using a text similarity algorithm.

• Simple root cause analysis online

The figure above shows a typical ONS failure that relies on services for service upgrades. If this is a log that is intelligently analyzed through log capture, after a long time of SRE, changes can also cause significant damage to online stability. If changes can also be collected into an observable system, it will bring great benefits. Similarly, if the information about the upgrade of ONS can be collected into the observable system, and the root cause can be analyzed through various event correlations, it will be extremely beneficial for system stability and problem troubleshooting.

• ARMS supports traceId to reveal responseHeader

F6 and the ARMS team have also conducted in-depth collaboration to explore observable best practices. ARMS recently exited a new feature that directly exposes the traceID to the HTTP header, which can be output to the corresponding log in the access layer log and retrieved through Kibana.

When a customer experiences a fault, they report the log information and traceID together to the technical support personnel. Finally, the research and development personnel can quickly locate the cause of the problem and the upstream and downstream links through traceID, as traceID is unique throughout the entire invocation link and is very suitable as a search condition.

At the same time, ARMS supports direct transparent transmission of traceID through MDC, supports mainstream Java logging frameworks, including Logback, Log4j, Log4j2, and can also output traceID as standard Python.

The figure above shows the background configuration of ARMS in a typical log output scenario. You can open the associated business log and traceID, and support various components. You only need to define an eagleeye in the log system_ TraceID to output traceID.

The above scheme is relatively simple, and there are few modifications to research and development, and it can even be modified free. Moreover, the correlation between Loggin and Trace has been greatly improved, reducing data islands.

• ARMS supports operation and maintenance alarm platform

ARMS provides a lot of data to further reduce MTTR, but how the data reaches SRE, DevOps, or R&D operations and maintenance personnel still requires some thought.

Therefore, ARMS has launched an operation and maintenance alarm platform, which completes event processing such as alarm forwarding and diversion through visual methods. Multiple integration methods can be supported through silent functions and grouping. Currently, F6 is in use, including Prometheus, Buffalo, ARMS, and cloud monitoring. Cloud monitoring includes a lot of data, including ONS and Redis. R&D personnel or SRE personnel can simply claim corresponding response events in the spike group. At the same time, it also supports functions such as reporting, timed reminders, and event upgrades to facilitate subsequent recovery and improvement.

The above figure is a screenshot of the online problem processing interface. For example, the alarms in the spike group will prompt who handled the last similar alarm, the alarm list, and the corresponding event processing process. At the same time, it also provides a filtering function that can split content, replace content filling templates through field enrichment or matching updates, and use them for accurate alerts. For example, after identifying the application name, it can be associated with the owner or alert personnel of the corresponding application. This feature will gradually replace the Apollo SDK application written in the Go language earlier in the future.

• Java ecological modification free injection agent method - JAVA_ TOOL_ OPTIONS

At the same time, we also borrowed from the ARMS modification free injection agent method. ARMS injects a lot of ARMS information through one point initContainer, and also mounts a mount named home/admin/. opt to store logs. It is precisely because of the initContainer that it can achieve automatic upgrades.

In initContainer, you can obtain the latest version of the current ARMS agent by calling the ARMS interface, then download the latest version of the Java agent, place it in the mount directory, communicate with the corresponding array Pod in the directory, and complete the Java agent sharing by sharing a volume. The core point in the process is that through JAVA_ TOOL_ OPTIONS implements Java Agent mounting.

Through the above method, we simulated a set of processes to modify the deployment using the patch method through the openkruise component workspread. The simplest practice is to annotate the corresponding deployment using the openkruise workspread, which eliminates the need for R&D or SRE teams to process, as long as the corresponding CRD is written, and the CRD process is directly injected into JAVA_ TOOL_ OPTIONS (see the code in the lower right corner of the figure above). Its application scenarios are also relatively rich, which can be used for application traffic playback, automated testing, etc.

• Prometheus Exporter

In addition to commercial products such as ARMS, we are also actively open source, embracing the Prometheus community, and accessing a number of Exporter components, including SSL Export and BlackBox Exporter, greatly improving the observability of the entire system. Exporter can be used for black box probes, such as detecting whether HTTP requests are normal, HTTPS requests are normal, DNS is normal, and TCP is normal. A typical usage scenario is to detect whether the current service entry address is normal. SSL certificate exceptions are more common. Through SSL Exporter, you can regularly poll whether the certificate has expired, further improving observability.

• Cost observation

In addition to observable daily services, we have also implemented optimization projects such as cost observable. For cloud native environments, Kubecost open source components can be used to optimize costs, directly output resource utilization rates and reports, and provide feedback to research and development personnel for optimization. They can even be used to discover whether the CPU and memory are in a normal proportion, achieving reasonable resource allocation as much as possible.

Imagination of the future

Kuberneters One Stop Observability Based on eBPF

EBPF cloud native components are increasingly entering deep water areas, and many issues no longer remain at the application level, but more often occur at the system level and network level, requiring more underlying data for tracking and troubleshooting. Using eBPF can better answer the questions raised by Google SRE regarding golden indicators such as latency, traffic, error rate, saturation, and so on.

For example, during Redis flash handover, TCP semi open connections may be formed, which may affect the service; For example, when a TCP connection is first established, whether the backlog is reasonable or not can be concluded from the data.


Chaos engineering encourages and utilizes observability, attempting to help users proactively identify and overcome system weaknesses. In June 2020, CNCF proposed a special interest group on observability. In addition to the three pillars, CNCF also proposes chaos engineering and continuous optimization.

There are still doubts in the current community about whether chaos engineering can be divided into observability, and the CNCF observability special interest group has included chaos engineering and continuous optimization. We believe that the approach of CNCF is reasonable. Chaos engineering can be considered as an analysis tool for observability, but its important premise is observability. Imagine that if a failure occurs during the implementation of chaotic engineering, we cannot even determine whether it is caused by chaotic engineering, which will also bring great trouble.

Therefore, it is possible to use observability to minimize the explosion radius and locate problems, and continuously optimize the system through chaos-engineering to identify system weaknesses in advance, better safeguarding system stability.


OpenTelemetry is an open source framework that combines multiple projects. We need to develop a more terminal oriented and unified observability view, such as the expectation of correlating and marking logging, metrics, and tracking data, minimizing data islands, and improving overall observability through association of multiple data sources. And use observability to minimize online troubleshooting time and buy time for business service recovery.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us