How to select the monitoring system

Various monitoring systems

Monitoring has always been the core component of the IT system, responsible for problem discovery and auxiliary positioning. Both traditional operation and maintenance, SRE, DevOps and developers need to pay attention to the monitoring system and participate in the construction and optimization of the monitoring system. From the beginning of the mainframe operating system and Linux basic indicators, the monitoring system has begun to appear and gradually evolved. At this stage, there are no less than 100 kinds of monitoring systems that can be searched, and there are also many ways of classification according to different categories, such as:

1. Monitoring object: general type (general monitoring mode, applicable to most monitoring objects), special type (customized for a certain function, such as Java JMX system, high temperature protection of CPU, power failure protection of hard disk, UPS switching system, switch monitoring system, dedicated line monitoring, etc.)

2. Data acquisition method: Push (CollectD, Zabbix, InfluxDB); Pull(Prometheus、SNMP、JMX)

3. Deployment mode: coupling (deployed with the monitored system); Stand-alone (stand-alone single-instance deployment); Distributed (scalable); SaaS (many commercial companies provide SaaS without deployment)

4. Data acquisition method: interface type (only available through some APIs); DSL (there can be some calculations, such as PromQL and GraphQL); SQL (standard SQL, SQL like)

5. Commercial attribute: open source and free (such as Prometheus, InfluxDB stand-alone version); Open source commercial type (such as InfluxDB cluster version, Elastic Search X-Pack); Closed source commercial type (e.g. DataDog, Splink, AWS Cloud Watch)

Pull or Push

There are relatively many options for building a monitoring system platform for internal use of the company, whether it is self-built using open source solutions or using commercial SaaS products. However, whether it is an open source solution or a commercial SaaS product, the real implementation needs to consider how to give the data to the monitoring platform, or how the monitoring platform can obtain the data. This involves the selection of data acquisition mode: Pull or Push mode?

The monitoring system based on Pull type, as the name implies, is the monitoring system actively obtains indicators, and the objects to be monitored can be remotely accessed; The monitoring system based on Push type does not actively obtain data, but actively pushes indicators by monitoring objects. The two methods are different in many places. For the construction and selection of the monitoring system, it is necessary to know the advantages and disadvantages of the two methods in advance and select the appropriate scheme to implement them. Otherwise, if implemented blindly, it will be disastrous for the stability of the monitoring system and the cost of deployment, operation and maintenance.

Pull vs Push Overview

The following will be introduced from several aspects. In order to save readers' time, here is a table for a brief discussion.

Principle and architecture comparison

As shown in the figure above, the core of the Pull model data acquisition is the Pull module, which is generally deployed with the monitoring backend, such as Prometheus. The core components include:

1. Service discovery system, including host service discovery (generally dependent on the company's own CMDB system), application service discovery (such as Consul), and PaaS service discovery (such as Kubernetes); The Pull module needs to be able to connect these service discovery systems

2. Pull core module, in addition to the service discovery part, generally uses the general protocol to pull data remotely, and generally supports the configuration of pull interval, timeout interval, indicator filtering/Rename/simple process capability

3. The application side SDK supports listening to a fixed port to provide the ability to be pulled

4. Because all kinds of middleware/other systems are not compatible with the Pull protocol, it is necessary to develop the corresponding Exporter Agent to support pulling the indicators of these systems and provide standard Pull interfaces

The Push model is relatively simple:

1. Push Agent, which supports pulling the indicator data of various monitored objects and pushing them to the server, can be deployed with the monitored system or separately

2. ConfigCenter (optional), which is used to provide centralized dynamic configuration capabilities, such as monitoring target, acquisition interval, indicator filtering, indicator processing, remote target, etc

3. The application side SDK supports sending data to the monitoring backend or to the local agent (usually the local agent also implements a set of backend interfaces)

Summary: In terms of deployment complexity, in terms of middleware/other system monitoring, the deployment mode of Pull model is too complex, the maintenance cost is high, and it is more convenient to use Push mode; The cost of providing Metrics port or active push deployment for applications is similar.

Pull's distributed solution

In terms of scalability, data collection in Push mode is naturally distributed, and can be expanded horizontally indefinitely when the monitoring back-end capability can keep up. In contrast, the Pull mode expansion is more troublesome, requiring:

1. Pull module is decoupled from the monitoring backend, and Pull is deployed separately as an agent

2. The pull agent needs to do distributed collaboration. Generally, the simplest is to do sharding. For example, obtain the list of monitored machines from the service discovery system, hash these machines, and then take sharding to determine which agent is responsible for the pull.

3. Add a new configuration center (optional) to manage each PullAgent

I believe that the students who are quick to respond have seen that there are still some problems with this distributed method:

1. The single point bottleneck still exists. All agents need to request the service discovery module

2. After the agent is expanded, the monitoring target will change, which is prone to data duplication or missing

Comparison of monitoring capabilities

Monitoring target viability

Survivability is the first and most basic work to be done for monitoring. It is relatively simple to monitor the target's survivability in the Pull mode. You can know whether to request the target's indicator directly at the center of the Pull. If it fails, you can also know some simple errors, such as network timeout, peer refusal to connect, etc.

The Push mode is relatively troublesome. The failure of the application to report may be due to the application hanging up, the network problem, or the migration to other nodes. Because the Pull module can interact with the service discovery in real time, but Push does not, so only the server can interact with the service discovery again to know the specific reason for the failure.

Data completeness calculation

The concept of data completeness is still very important in large monitoring systems. For example, the QPS of trading applications that monitor a thousand copies needs to be combined with a thousand data for superposition. If there is no concept of data completeness, if the QPS is configured to reduce the alarm by 2%, due to network fluctuations, the data reported by more than 20 copies will be delayed for several seconds, which will trigger false positives. Therefore, when configuring alarms, it is also necessary to consider comprehensively the data completeness data.

The calculation of data completeness also depends on the service discovery module. The Pull method is to pull data in a round by round manner, so the data is complete after a round of pull. Even if some pull fails, the percentage of incomplete data is known;

The push mode is active by each agent and application. The push interval and network delay of each client are different. The server needs to calculate the data completeness according to the historical situation, which is relatively costly.

Short life cycle/serverless application monitoring

In the actual scenario, there are many applications of short life cycle/serverless, especially in the cost-friendly situation, we will use a lot of jobs, flexible instances, non-service applications, etc. For example, after the rendering task arrives, start a flexible computing instance, and immediately destroy and release it; Machine learning training jobs, event-driven non-service workflows, jobs that are executed regularly (such as resource cleaning, capacity checking, security scanning), etc. These applications usually have a very short life cycle (possibly at the second or millisecond level), and the periodic model of Pull is extremely difficult to monitor. Generally, it is necessary to use Push to push the monitoring data actively by the application.

In order to cope with this short-life cycle application, the pure Pull system will provide an intermediate layer (such as Prometheus Push Gateway): accept the active push of the application, and then provide the Pull port to the monitoring system. However, this requires the management and operation and maintenance costs of multiple intermediate layers. Moreover, due to the Pull simulation push, the reporting delay will increase, and it is also necessary to clean up these indicators that disappear immediately.

Flexibility and coupling

In terms of flexibility, the Pull mode has some advantages. You can configure which indicators you want in the Pull module, and do some simple calculation/secondary processing for the indicators; But this advantage is also relative. Push SDK/Agent can also configure these parameters. With the help of the configuration center, configuration management is also very simple.

In terms of coupling degree, the pull model and back-end coupling degree is much lower. It only needs to provide an interface that can be understood by the back-end. It does not need to care about which back-end is connected and which indicators are needed by the back-end. The relative division of labor is relatively clear. The application developer only needs to expose the indicators that should be used, and SRE (monitoring system manager) can obtain these indicators; The Push model is relatively more coupled, and the application needs to configure the back-end address and authentication information. However, with the help of the local Push Agent, the application only needs the Push local address, which is relatively inexpensive.

Comparison between operation and maintenance and cost

Resource cost

In terms of overall cost, the difference between the two methods is not significant, but from the perspective of the owner:

1. The core consumption of Pull mode is on the monitoring system side, and the cost on the application side is low

2. The core consumption of the push mode is at the push and push agent ends. The consumption of the monitoring system side is much smaller than that of the pull mode

Operation and maintenance cost

From the perspective of operation and maintenance, the cost of Pull mode is relatively higher. The components that need operation and maintenance in Pull mode include: various exporters, service discovery, PullAgent, and monitoring backend; The Push mode only requires operation and maintenance: Push Agent, monitoring backend, and configuration center (optional, the deployment method is generally with the monitoring backend).

• One thing to note here is that the Pull mode is that because the server actively sends requests to the client, cross-cluster connectivity and network protection ACL on the application side need to be considered on the network. Compared with the network connectivity of Push, the network connectivity is relatively simple. It only needs the server to provide a domain name/VIP that can be accessed by each node.

How to select Pull or Push

At present, the open source solution, the Pull mode represents the family solution of Prometheus (the reason why it is called family is that the default single point Prometheus has limited scalability, and there are many Prometheus distributed solutions in the community, such as Thanos, VictoriaMetrics, Cortex, etc.), and the Push mode represents the TICK (Telegraf, InfluxDB, Chronograf, Kapitor) solution of InfluxDB. These two solutions have their own advantages and disadvantages. In the context of cloud native, with the fire of Prometheus under the leadership of CNCF and Kubernetes, many open source software begin to provide the Pull port of Prometheus mode; At the same time, there are many systems that are difficult to provide Pull ports at the beginning of their design, and it is more reasonable to use Push Agent for monitoring these systems.

However, whether the application itself should use Pull or Push has not been a good conclusion. The specific selection needs to be based on the actual situation within the company. For example, if the network of the company cluster is complex, it is relatively simple to use Push; There are many short life cycle applications that need to use the Push method; Mobile applications can only use Push mode; The system itself uses Consul for service discovery, which can be easily implemented by exposing the Pull port.

Therefore, for the internal monitoring system of the company, the best solution is to have the ability of both Pull and Push:

1. Host, process and middleware monitoring use Push Agent

2. Kubernetes and others directly exposed the use of Pull port Pull mode

3. The application selects Pull or Push according to the actual scenario

SLS strategy on Pull and Push

SLS currently supports unified storage and analysis of logs, metric, and trace. The timing monitoring scheme is compatible with the format standard of Prometheus, and the standard PromQL syntax is also provided. Facing hundreds of thousands of SLS users, application scenarios may vary greatly, and it is impossible to use a single Pull or Push to meet all customer needs. Therefore, SLS does not follow a single route in the selection of Pull and Push models, but is compatible with the Pull and Push models. In addition, for the open source community and agents, SLS's strategy is fully compatible with the open source ecosystem, rather than creating a closed ecosystem:

1. Pull model: fully compatible with Prometheus's Pull Scrap capability. You can use Prometheus' Remote Write to use Prometheus as the pull agent; VMAgent with the same capability as Prometheus Scrap can also be used in this way; SLS's own agent Logtail can also realize Prometheus's scratch capability

2. Push model: Telegraf is currently the most perfect monitoring PushAgent ecosystem in the industry. The Logtail of SLS has built-in Telegraf, which can support hundreds of monitoring plug-ins of all Telegraf

Compared with pull agents such as VMAgent, Prometheus and native Telegraf, SLS additionally provides the most urgent agent configuration center and agent monitoring capabilities. It can manage the collection configuration of each agent and monitor the running status of these agents on the server side, reducing the cost of operation and maintenance management as much as possible.

Therefore, the actual use of SLS to build a monitoring scheme will be very simple:

1. Create a MetricStore to store monitoring data on the SLS console (Web page)
2. Deploy Logtail Agent (one line command)
3. Configure the collection configuration of monitoring data on the console (both Pull and Push can be used)

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us