Dubbo Metrics: Exploring the Measurement and Statistics Infrastructure of Dubbo

Real-time monitoring of services to understand the current operational metrics and health status of services is an indispensable part in the microservice system. Metrics, as an important component of microservices, provides a comprehensive data base for service monitoring. Recently, Dubbo Metrics version 2.0.1 was released. This article reveals the origins of Dubbo Metrics and the 7 major improvements.

The Origins of Dubbo Metrics

Dubbo Metrics (formerly Alibaba Metrics) is a widely used basic class library for metric tracking within Alibaba Group. It has two versions: Java and Node.js. Currently, Java is the open-source version. Built in 2016, the internal version has gone through nearly three years of development and the test of the shopping carnival on "Singles' Day". It has become the factual standard of microservices monitoring metrics within Alibaba Group, covering all layers of metrics from the system, JVM, middleware, and to the application, and has formed a unified set of specifications in terms of naming rules, data formats, tracking methods and computing rules.

Dubbo Metrics code is derived from Dropwizard Metrics and is version number 3.1.0. At that time, two main reasons led to the decision to fork it internally for customized development.

One reason is that the development of the community is very slow. The next version was updated in the third year after 3.1.0. We were worried that the community could not respond to business needs in time. Another more important reason is that, at that time, 3.1.0 did not support multi-dimensional tags, and could only be based on traditional metric naming methods, such as a.b.c. This means that Dropwizard Metrics can only perform measurements in a single dimension. In the actual business process, many dimensions cannot be determined in advance. And, in a large-scale distributed system, after data statistics are completed, the data needs to be aggregated according to various business dimensions, such as aggregate by data center and group, which, at that time, could not be satisfied by Dropwizard either. For various reasons, we made the decision to fork a branch for internal development.

Improvements in Dubbo Metrics

Compared with Dropwizard Metrics, Dubbo Metrics has the following improvements:

1. Tag-Based Naming Rules Introduced

As described above, multi-dimensional tags have a natural advantage over the traditional metric naming method in terms of dynamic tracking, data aggregation, and the like. For example, to compute the number of calls and RT of a Dubbo service named DemoService, it is usually named dubbo.provider.DemoService.qps and dubbo.provider.DemoService.rt based on the traditional naming method. For only one service, this method has no big problem. However, if a microservice application provides multiple Dubbo services at the same time, then it is difficult to aggregate the QPS and RT for all services. Metric data inherently has the time series attribute, so the data can be appropriately stored in a time series database. To compute the QPS of all Dubbo services, you need to find all indexes named dubbo.provider.*.qps, and then sum them up. Fuzzy search is involved, so it is time-consuming to implement for most databases. If the QPS and RT of the Dubbo method level needs to be computed in more detail, the implementation will be more complicated.

Metric key: a string separated by single-byte dots, to indicate the meaning of this metric;
Metric tag: to define different sharding dimensions of this metric. A single tag or multiple tags are acceptable;
Tag key: the name used to describe the dimension;
Tag value: the value used to describe the dimension;

Considering that all metrics produced by all microservices of a company will be uniformly summarized to the same platform for processing, the Metric Key should be named according to the same set of rules to avoid naming conflicts, and the format is appname.category[.subcategory]*.suffix

Appname: name of the application;
Category: the category of this metric in this application. Multiple words are connected by '_' and letters are lowercase;
Subcategory: the subcategory of this metric under a certain category in this application. Multiple words are connected by '_' and letters are lowercase;
Suffix: the key suffix that can describe the specific type, which can be counts, rates, or distributions, measured by this metric;

In the preceding example, the same metric can be named as dubbo.provider.service.qps{service="DemoService"}, where the name of the previous part is fixed and remain unchanged, while the tags in brackets can be changed dynamically or even added more dimensions. For example, after adding a method dimension, the name becomes dubbo.provider.service.qps{service="DemoService",method="sayHello"}, and the IP and data center information of the machine can also be added. This kind of data storage is compatible with time series databases. Based on the data, aggregation, filtering and other operations in any dimension can be easily realized.

Note: At the end of December 2017, Dropwizard Metrics 4.0 began to support tags, and the implementation of tags in Dubbo Metrics is based on the implementation in Dropwizard. Both MicroMeter and Prometheus provided in Spring Boot 2.0 have also introduced the concept of tags.

2. The Precise Statistics Function Added

The precise statistics function of Dubbo Metrics is different from that of Dropwizard or other open-source project tracking statistical libraries. This is explained with the following two dimensions, the selection of the time window and the statistical method of throughput rate.

In the case of throughput statistics (such as QPS), the implementation of Dropwizard is a sliding window + exponential weighted moving average (EWMA), providing only 1 minute, 5 minutes and 15 minutes of choices on the time window.

Fixed Window vs. Sliding Window

In the case of data statistics, we need to define the statistical time window in advance. Generally, two methods can be used to establish a time window, namely a fixed window and a sliding window.

The fixed time window takes absolute time as the reference frame, for statistics within an absolute time window. The time window remains unchanged whenever the statistical data is accessed. However, the sliding window uses the current time as the reference frame and a specified window size is deduced forward from the current time for statistics. The window changes with the change of time and data.

The fixed window has the following advantages: First, the window does not need to be moved, and the implementation is relatively simple. Second, different machines are all based on the same time window, so you only need to aggregate clusters according to the same time window, which is relatively easy. The disadvantage is that if the window size is large, the real-time performance will be affected and the statistical information of the current window cannot be obtained immediately. For example, if the window is 1 minute, you must wait until the end of the current 1 minute to obtain the statistics for this 1 minute.

The advantage of the sliding window is that it has better real-time performance. The statistical information in a time window deduced forward from the current time can be seen at any time. Relatively, the collection time of different machines is different, so the data on different machines must be aggregated through the so-called Down-Sampling. That is, based on the fixed time window, the data collected in the window is applied to an aggregate function. For example, if the cluster has 5 machines and Down-Sampling is carried out according to the average value at the frequency of 15 seconds. If a metric data point is collected at 00:01, 00:03, 00:06, 00:09, 00:11 respectively within the time window of 00:00~00:15, the weighted average value of these 5 points is regarded as the average value after Down- Sampling at 00:00.

In practice, however, sliding windows still encounter the following problems

In many business scenarios, precise time window data is required. For example, during the shopping carnival on "Singles' Day", if you want to know how many orders are created in the first second at 0:00 of this particular day, the Dropwizard sliding window is obviously not applicable.
The window provided by Dropwizard is only at the minute level, but in this scenario, it needs to be accurate to the second level.
For cluster data aggregation, the sliding time windows on each machine may be different, and the time interval of data collection may also exist, resulting in inaccurate aggregation results.

To solve these problems, Dubbo Metrics provides the BucketCounter metric, which can accurately compute the data at minute-level and second-level, and the time window can be accurate to 1 second. The results after cluster aggregation can be ensured to be accurate, provided that the time on each machine is synchronized. It also supports statistics based on sliding windows.

Instantaneous Rate vs. Exponential Weighted Moving Average (EWMA)

After years of practice, we have gradually discovered that when users observe and monitor, they first focus on the cluster data, and then the stand-alone data. However, the throughput rate on a single machine does not really make much sense. But what does this mean?

For example, if you have a microservice, and two machines, and a method generates 5 calls within a certain period of time. The time points of these calls are [5,17] on Machine 1, and [6,8,8] on Machine 2, respectively (assuming the unit is milliseconds). If you want to compute the average RT of the cluster, you can first compute the average RT on a single machine, and then compute the overall average RT. According to this method, the average RT of Machine 1 is 11 ms, and the average RT of machine 2 is 7.33 ms. After the two are averaged again, the average RT of the cluster is 9.17 ms. But is this the actual result?

If the data on Machine 1 and machine 2 are pulled together to compute, the actual average RT is (5+17+6+8+8)/5 = 8.8 ms, which is obviously different from the above result. Considering the loss of precision in computing floating point numbers and the expansion of cluster size, this error will become more apparent. Therefore, it is concluded that the throughput rate on a single machine is meaningless for computing the cluster throughput, and only meaningful in the dimension of a single machine.

The EWMA provided by Dropwizard is also an average, but it also takes into account the time factor. The closer the data is to the current time, the higher the weight of the data will be. In case the time is long enough, for example, 15 minutes, this method makes sense. However, it has been observed that it is not significant to consider the weight of the time dimension in a short time window, such as 1s and 5s. Therefore, during the internal renovation process, Dubbo Metrics has been improved as follows:

It can compute the instantaneous rate to reflect the situation of single machine dimension, and removes the weighted average and uses the simple average method for computation;
For cluster aggregation, it provide statistics of the total number of calls and total RT in the time window to facilitate accurate computing of the cluster-dimension throughput rate;

3. Ultimate Performance Optimization

In the scenario of a big promotion, improving statistical performance is an important topic for Dubbo Metrics. In Alibaba business scenarios, the QPS of a statistical interface may reach tens of thousands, for example, the scenario of accessing the cache. Therefore, in this case, the statistical logic of metrics may become a hot spot. We have made some targeted optimizations:

In high concurrency scenarios, java.util.concurrent.atomic.LongAdder performs best in data accumulation, so almost all operations are ultimately attributed to operations on this class.

Avoid Calling LongAdder#reset

When the data expires, the data needs to be cleaned up. In the previous implementation, LongAdder#reset was used to clean up the object for reuse. However, the actual test shows that LongAdder#reset is actually a CPU-consuming operation. Therefore, the memory is selected to replace CPU, and a new LongAdder object is used to replace LongAdder#reset when cleaning up.

Remove Redundancy and Accumulation Operations

In the implementations of some metrics, some have more statistical dimensions and multiple LongAdder objects needs to be updated at the same time. For example, in the metric implementation of Dropwizard Metrics, the moving average of 1 minute, 5 minutes, and 15 minutes is computed, and 3 LongAdder objects needs to be updated each time. However, in fact, these 3 updating operations are repeated and only one update is needed.

Avoid Calling the Add Method when RT is 0

In most scenarios, RT is measured in milliseconds. Sometimes, when RT is computed to be less than 1 ms, the RT transmitted to metrics is 0. However, it is found that JDK native LongAdder does not optimize the add(0) operation. Even if the input is 0, the logic is still repeated, which essentially calls sun.misc.Unsafe.UNSAFE.compareAndSwapLong. At this time, if metrics determines the Add operation is not performed on the counter when RT is 0, then one Add operation can be omitted. This is very helpful for middleware with high concurrency, such as distributed cache. In an internal application test, we have found that in 30% of the cases, the RT for accessing distributed cache is 0 ms. This optimization can omit a lot of meaningless update operations.

Perform Combined Statistics on QPS and RT

It is only necessary to update a Long object to compute the number of calls and the time at the same time, which is approaching the theoretical limit.

After observation, it is found that the success rate for some metrics of middleware is usually very high, which is 100% under normal circumstances. To compute the success rate, the number of successes and the total number of calls need to be computed. The two are almost the same in this case, which leads to waste as an addition operation is done in vain. If, in turn, only the number of failures is computed, then the counter is only updated if a failure occurs, thus greatly reducing the addition operations.

In fact, if we perform orthogonal splitting for each case, such as success and failure, then the total number can be computed by summing up the number in various cases. This further ensures that the count is updated only once per call.

But do not forget that, in addition to the number of calls, the RT of the method execution also needs to be computed. Can it be further reduced?

The answer is yes! Assuming that RT is computed in milliseconds, we know that 1 Long object has 64 bits (in fact, the Long object in Java is signed, so theoretically only 63 bits are available), while a statistical cycle of metrics only computes 60s of data at most, so these 64 bits cannot be used up anyway. Can we split these 63 bits and compute the count and RT at the same time? In fact, this is feasible.

We use the high 25 bits of the 63 bits of a Long object to represent the total count in a statistical cycle, and the low 38 bits to represent the total RT.

------------------------------------------
|   1 bit    |     25 bit   |     38 bit |
| signed bit |  total count |   total rt |
------------------------------------------

When a call comes in, assuming that the RT passed in is n, then the number accumulated each time is not 1 or n, but

1 * 2^38 + n

This design mainly has the following considerations:

Count is incremented by 1 for each call, and RT is incremented by N for each call. If count is at the high bit, add 1 for each call, which is actually a fixed constant, but if RT is at the high bit, the content added for each call is different, so it should be computed every time;
25 bits can represent 2^25=33554432 numbers at most, so it will definitely not overflow for the statistics of method calls within 1 minute.
RT may be relatively large, so the low bit is 38 bits, representing 2^38=274877906944 numbers, which is basically impossible to overflow.

What if it does overflow?

Due to the previous analysis, it is almost impossible to overflow, so this problem has not been solved for the time being and will be solved later.

Lockless BucketCounter

In the previous code, BucketCounter needs to ensure that only one thread updates the Bucket under multi-thread concurrent access, so it uses an object lock. In the latest version, BucketCounter has been reimplemented, removing the lock from the original implementation and operating only with AtomicReference and CAS. In this way, local testing has found that the performance increased by about 15%.

4. Comprehensive Metric Statistics

Dubbo Metrics fully supports all layers of metrics from the operating system, JVM, middleware, to the application, unifies various naming metrics so that they can be used out of the box, and supports the collection of certain metrics to be turned on and off at any time through configuration. Currently supported metrics mainly include:

Operating Systems

It supports Linux, Windows, and Mac, including CPU, Load, Disk, Net Traffic, and TCP.

JVM

It supports classload, GC times and time, file handles, Young/Old area occupation, thread status, off-heap memory, compiling time, and some metrics support automatic difference computation.

Middleware

Tomcat: number of requests, number of failures, processing time, number of bytes sent and received, and number of active threads in thread pool;

Druid: number of SQL executions, number of errors, execution time, and number of rows affected;

Nginx: accept, active connections, number of read and write requests, number of queues, QPS of the request, and average RT.

For more detailed metrics, please refer to here. Support for Dubbo, NACOS, Sentinel, and Fescar will be added later.

5. REST Support

Dubbo Metrics provides the JAX-RS based REST interface exposure that makes it easy to query various internal metrics. It can start the HTTP server to provide services independently (by default, a simple implementation based on Jersey + sun Http server is provided), or embed the existing HTTP Server to expose metrics. For specific interfaces, please see:

https://github.com/dubbo/metrics/wiki/query-from-http

6. Single Machine Data Flashed to the Disk

If all the data is stored in memory, data loss may occur due to pull failure or application jitter. To solve this problem, metrics has introduced a module, which provides two mode, the log mode and the binary mode, to flash data to the disk.

By default, the log mode performs output through JSON. The output can be pulled and aggregated by the log component. The readability of the file is relatively strong, but the historical data cannot be queried.
The binary mode provides a more compact storage and supports querying the historical data. This mode is currently used internally.

7. Optimization for Usability and Stability

The API and implementation of tracking are split to facilitate the integration of different implementations, without needing users to pay attention;
Tracking through annotation is supported;
It draws on the design of the log framework, which makes it more convenient to obtain metrics.
Compass and FastCompass are added to facilitate common tracking for businesses, such as the statistics for QPS, RT, success rate, number of errors, and other metrics;
The Spring Boot Starter will be open-source soon. Please stay tuned;
Automatic metric cleaning is supported to prevent long-term unused metrics from occupying memory;
The URL metrics are converged and maximum protection is implemented, to prevent memory caused by dimension explosion and error statistics.

Usage Instructions

The way to use it is simple, like getting the Logger for a logging framework.

Counter hello = MetricManager.getCounter("test", MetricName.build("test.my.counter"));
hello.inc();

Supported metrics include:

Counter
Meter (a metric for throughput)
Histogram (a metric for histogram distribution)
Gauge (a metric for transient value)
Timer (a metric for throughput rate and response time distribution)
Compass (a metric for throughput rate, response time distribution, success rate and error code)
FastCompass (a metric for fast and efficient statistics of throughput rate, average response time, success rate and error code)
ClusterHistogram (a metric for cluster quantile)

Future Works

To provide the Spring Boot Starter.
To support Prometheus and Spring MicroMeter.
To integrate Dubbo and replace the data statistics implementation in Dubbo with Dubbo Metrics.
To display various metrics data on Dubbo Admin.
To integrate other components in Dubbo ecosystem, such as NACOS, Sentinel, and Fescar.

References

Dubbo Metrics @Github：

https://github.com/dubbo/metrics

Wiki:

https://github.com/dubbo/metrics/wiki (continuous updating)

About the Authors

Wang Tao (GitHub ID @ralf0131) is an Apache Dubbo PPMC Member, an Apache Tomcat PMCMember, and a technical expert of Alibaba.

Zi Guan (GitHub id@min) is an Apache Dubbo Commiter and a senior development engineer at Alibaba, who is responsible for the development of Dubbo Admin and Dubbo Metrics projects, and community maintenance.

Community