Ant Financial's Distributed Link Tracking Component Sampling Strategy and Source Code

Abstract: This article mainly analyzes the sampling model strategy of Dapper paper and the implementation of SOFATracer sampling source code, and describes in detail how to formulate sampling rules for buried point data. According to SOFATracer's fixed sampling rate-based sampling mode and user-extension-based custom sampling mode, select a sampling strategy suitable for business needs scenarios, and better integrate the SOFATracer data sampling section to implement custom sampling calculation rules.


Scalable Open Financial Architecture is a financial-grade distributed middleware independently developed by Ant Financial. It includes various components required to build a financial-grade cloud-native architecture. It is the best practice tempered in financial scenarios.

SOFATracer is a component used for distributed system call tracing. It records various network calls in the call link in logs through a unified TraceId, so as to achieve the purpose of visualizing network calls. These link data can be used for Rapid fault discovery, service governance, etc.

This article is the fourth chapter of "Analysis | SOFATracer Framework". The author of this article is Mi Qilin from Lufax. The "Analysis | SOFATracer Framework" series is produced by the SOFA team and source code enthusiasts.


Since distributed link tracking involves every link of the call, and every link will generate a large amount of data, in order to store this data, it may require a lot of cost, and in the actual production process, not all data is worthy of attention. Yes, for these reasons, SOFATracer provides link data sampling features. On the one hand, it can save I/O disk space, and on the other hand, it needs to filter and filter irrelevant data directly. At present, SOFATracer has two built-in sampling strategies, one is sampling based on a fixed ratio, and the other is custom sampling based on user extensions. The custom sampling mode uses the SofaTracerSpan instance as the sampling calculation condition, and users can extend the custom sampling rules based on this implementation.

This article mainly introduces the principle of SOFATracer data sampling strategy, and describes the sampling rule algorithm in detail by analyzing the source code.

Sampling models and strategies in the Dapper paper
tracking sampling model

Each request utilizes the high-throughput online service of a large number of servers, which is one of the most important requirements for efficient tracking. Such situations require the generation of large amounts of trace data, and they are the most sensitive to performance impact. Latency and throughput losses are all within experimental error after adjusting the sampling rate to less than 1/16.

In practice, we found that even with a sampling rate of 1/1024, there is still a sufficient amount of trace data to trace a large number of services. It is important to keep the performance penalty baseline of the link tracking system at a very low level, as it provides a permissive environment for those applications to use the full Annotation API without fear of performance penalty. Using a lower sample rate has the added benefit of allowing trace data persisted to disk to be retained longer before being processed by the garbage collection mechanism, which provides more flexibility for the collection component of the link trace system.

The consumption of any given process in a distributed link tracing system is proportional to the trace sampling rate per process unit time. However, at lower sampling rates and lower transmission loads, important events may be missed, and using higher sampling rates requires an acceptable corresponding performance penalty. In the process of deploying variable sampling, when parameterizing the sampling rate, instead of using a unified sampling scheme, we use a sampling expectation rate to identify the tracking of sampling per unit time. This will automatically increase the sample rate for low traffic and low loads, and decrease the sample rate for high traffic and high loads, keeping losses under control. The actual sampling rate used will be recorded along with the trace itself, which facilitates accurate analysis and troubleshooting from the trace data.

Track sampling strategy

To be truly transparent at the application level, we need to make the core tracing code lightweight and then embed it in the ubiquitous common components, such as thread calls, control flow, and RPC libraries. Using an adaptive sampling rate can make the link tracking system scalable and reduce the performance penalty. The implementation of the link tracking system requires low performance loss, especially in the production environment, it cannot affect the performance of the core business, and it is impossible to track every request. Therefore, to perform sampling, each application and service can set its own sampling rate. The sampling rate should be set in each application's own configuration, so that each application can be dynamically adjusted, especially when the application is just launched, the sampling rate can be appropriately increased. Generally, when the peak traffic of the system is large, only a small part of the requests need to be sampled, such as a sampling rate of 1/1000, that is, the distributed tracking system will only sample one of the 1000 requests.

The importance of data sampling is emphasized in the Dapper paper. If each buried point data is refreshed to disk, it will increase the impact of the link tracking framework on the original business performance. If the sampling rate is too low, some important data may be lost. It is mentioned in the paper that if the sampling rate of 1/1024 is sufficient in the case of high concurrency, there is no need to worry about the loss of important event data. Because in a high concurrency environment, if an abnormal data appears once, it will appear 1000 times. However, in a system that does not have a lot of concurrency and is extremely sensitive to data, business developers need to manually set the sampling rate.

For high-throughput services, aggressive sampling does not prevent the most important analysis. If a significant operation occurs once in the system, it will occur thousands of times. Low throughput services can afford to keep track of every request. This is what motivated us to make the decision to use adaptive sample rates. To maintain flexibility between material resource demands and increasing throughput requirements, we have added additional sample rate support to the collection system itself.

It would indeed be simpler to use only one sample rate parameter for the entire tracking process and collection system, but this cannot cope with the requirement to quickly adjust the runtime sample rate configuration on all deployed nodes. We chose the runtime sampling rate so that we can gracefully remove excess data that we cannot write to the warehouse. We can also adjust this run-time sampling rate by adjusting the secondary sampling rate factor in the collection system. Dapper's pipeline maintenance becomes easier because we can directly increase or decrease global coverage and write speed by modifying the configuration of the secondary sampling rate.

Analysis of the sampling source code of SOFATracer
SOFATracer provides link data sampling features and supports two sampling strategies: sampling mode based on fixed sampling rate and custom sampling mode based on user extension.

Sampling Interface Model

SOFATracer provides an interface to define the sampling mode of link trace data, the sample method of this interface uses the SofaTracerSpan instance parameter as the basic condition for sampling calculation to determine whether the link is sampled, and implements rich data sampling rules.

SOFATracer is based on
The sampler generated by performs the basic process of link data sampling:

Build a link tracker, and use the sampler factory SamplerFactory to implement the class fully qualified name configuration to generate the specified policy sampler Sampler according to the custom sampling rule. The sampling mode implemented based on user extensions has high priority, and the default sampling policy is based on a fixed sampling rate. Sampling calculation rules;
Reporter data reporting reportSpan or link span SofaTracerSpan starts and calls the sampler sample method to check whether the link needs to be sampled, and obtain the sampling status SamplingStatus Whether the sampling flag isSampled.

Sampler initialization
As analyzed above, the sampling strategy instance is created through the SamplerFactory, and the SamplerFactory provides a getSampler method to obtain the sampler:

From the code snippet, the user-defined sampling strategy will be loaded first. If the custom ruleClassName is not found in the configuration file, the default fixed sampling rate-based sampler will be built. SamplerProperties are sampling-related configuration properties. The default fixed-ratio-based sampling rate is 100%, that is, by default, all Span data will be logged to the log file. The specific configuration will be described in detail in the following case.

Sampling calculation

Sampling is for the entire link, that is to say, since the RootSpan is created, it has been determined whether the current link data will be recorded. In the SofaTracer class, the Sapmler instance exists as a member variable and is set to final, that is, after the SofaTracer instance is constructed, the sampling strategy will not be changed. When the Sampler sampler is bound to the SofaTracer instance, the SofaTracer's placement behavior for the generated Span data will depend on the sampler's calculation result (for a certain link).

SOFATracer's construction of a Span is different from the definition of a new Span based on SpanBuilder#start in the OpenTracing specification:

Implementation based on the OpenTracing specification, SofaTracerSpanBuilder#start
Built on SofaTracerSpanContext
For the first one, the calculation will be implemented in the start method, and then set to sofaTracerSpanContext for transparent transmission to the downstream link. The following is the logic for calculating whether the current Span needs to be sampled in the first case:

The second case is based on SofaTracerSpanContext. The constructor of SofaTracerSpanContext in SOFATracer will be set to no sampling by default. In this case, SOFATracer will delay the sampling calculation until the Span is reported. At this time, the calculation condition is that SofaTracer has The sampler exists and the current Span must be the rootSpan:

Sampling mark transparent transmission

When SOFATracer transparently transmits data across processes, it will place the sampling mark in the transparently transmitted data, and the data will be transparently transmitted downstream along with the link. The key of the sampled marker is X-B3-Sampled. When the downstream service parses out the sampling mark through this key, it will directly use the sampling mark in the current service without having to recalculate it.

Sampling strategy implementation

SOFATracer's default sampling strategy uses a sampling mode implemented by the BitSet underlying based on a fixed sampling rate
SofaTracerPercentageBasedSampler, the core implementation entry of sampling calculation rules:

SofaTracerPercentageBasedSampler builds a random BitSet to check whether to sample or not, based on a fixed sampling ratio, using the reservoir sampling algorithm Reservoir Sampling with a time complexity of O(N). The reservoir sampling algorithm selects k samples from a set S containing n items, where n is a large or unknown quantity. The specific sampling steps include:

Draw the first k items from the set S and put them into the "pond"
For each S[j] term (j ≥ k):
Randomly generate an integer r in the range 0 to j
If r < k, replace the rth item in the pond with the S[j] item
SofaTracerPercentageBasedSampler creates a random BitSet based on the reservoir sampling algorithm Source Stack Overflow:

Sample usage example
Sampling capability using SOFATracer is based on
The tracer-sample-with-springmvc project is the same except for

Fixed sample rate mode

SOFATracer provides a sampling implementation based on a fixed sampling rate, and the sampling mode needs to be set to PercentageBasedSampler . when
When, users need to configure the sampling rate.

Add sampling-related configuration items through to provide a sampling mode based on a fixed sampling rate:

[]( %BC%8F) Fixed sampling rate verification method:

When the sample rate is set to 100, the summary log is printed every time.
Do not print when sample rate is set to 0
When the sampling rate is set between 0 and 100, print by probability
Verify the result by requesting 10 times.

1. When the sample rate is set to 100, the summary log is printed every time

Start the project and enter in the browser:
http://localhost:8080/springmvc; And refresh the address 10 times, check the log as follows:

{"time":"2018-11-09 11:54:47.643","":"SOFATracerSpringMVC","traceId":"0a0fe8ec154173568757510019269","spanId":"0.1","request.url": "http://localhost:8080/springmvc","method":"GET","result.code":"200","req.size.bytes":-1,"resp.size.bytes":0 ,"time.cost.milliseconds":68,"":"http-nio-8080-exec-1","baggage":""}
{"time":"2018-11-09 11:54:50.980","":"SOFATracerSpringMVC","traceId":"0a0fe8ec154173569097710029269","spanId":"0.1","request.url": "http://localhost:8080/springmvc","method":"GET","result.code":"200","req.size.bytes":-1,"resp.size.bytes":0 ,"time.cost.milliseconds":3,"":"http-nio-8080-exec-2","baggage":""}
{"time":"2018-11-09 11:54:51.542","":"SOFATracerSpringMVC","traceId":"0a0fe8ec154173569153910049269","spanId":"0.1","request.url": "http://localhost:8080/springmvc","method":"GET","result.code":"200","req.size.bytes":-1,"resp.size.bytes":0 ,"time.cost.milliseconds":3,"":"http-nio-8080-exec-4","baggage":""}
{"time":"2018-11-09 11:54:52.061","":"SOFATracerSpringMVC","traceId":"0a0fe8ec154173569205910069269","spanId":"0.1","request.url": "http://localhost:8080/springmvc","method":"GET","result.code":"200","req.size.bytes":-1,"resp.size.bytes":0 ,"time.cost.milliseconds":2,"":"http-nio-8080-exec-6","baggage":""}
{"time":"2018-11-09 11:54:52.560","":"SOFATracerSpringMVC","traceId":"0a0fe8ec154173569255810089269","spanId":"0.1","request.url": "http://localhost:8080/springmvc","method":"GET","result.code":"200","req.size.bytes":-1,"resp.size.bytes":0 ,"time.cost.milliseconds":2,"":"http-nio-8080-exec-8","baggage":""}
{"time":"2018-11-09 11:54:52.977","":"SOFATracerSpringMVC","traceId":"0a0fe8ec154173569297610109269","spanId":"0.1","request.url": "http://localhost:8080/springmvc","method":"GET","result.code":"200","req.size.bytes":-1,"resp.size.bytes":0 ,"time.cost.milliseconds":1,"":"http-nio-8080-exec-10","baggage":""}
{"time":"2018-11-09 11:54:53.389","":"SOFATracerSpringMVC","traceId":"0a0fe8ec154173569338710129269","spanId":"0.1","request.url": "http://localhost:8080/springmvc","method":"GET","result.code":"200","req.size.bytes":-1,"resp.size.bytes":0 ,"time.cost.milliseconds":2,"":"http-nio-8080-exec-2","baggage":""}
{"time":"2018-11-09 11:54:53.742","":"SOFATracerSpringMVC","traceId":"0a0fe8ec154173569374110149269","spanId":"0.1","request.url": "http://localhost:8080/springmvc","method":"GET","result.code":"200","req.size.bytes":-1,"resp.size.bytes":0 ,"time.cost.milliseconds":1,"":"http-nio-8080-exec-4","baggage":""}
{"time":"2018-11-09 11:54:54.142","":"SOFATracerSpringMVC","traceId":"0a0fe8ec154173569414010169269","spanId":"0.1","request.url": "http://localhost:8080/springmvc","method":"GET","result.code":"200","req.size.bytes":-1,"resp.size.bytes":0 ,"time.cost.milliseconds":2,"":"http-nio-8080-exec-6","baggage":""}
{"time":"2018-11-09 11:54:54.565","":"SOFATracerSpringMVC","traceId":"0a0fe8ec154173569456310189269","spanId":"0.1","request.url": "http://localhost:8080/springmvc","method":"GET","result.code":"200","req.size.bytes":-1,"resp.size.bytes":0 ,"time.cost.milliseconds":2,"":"http-nio-8080-exec-8","baggage":""}
2. When the sampling rate is set to 0, no printing

Start the project and enter in the browser:
http://localhost:8080/springmvc ; and refresh the address 10 times, check the ./logs/tracerlog/ directory, there is no spring-mvc-degist.log log file

3. When the sampling rate is set between 0~100, print by probability

Here it is set to 20

Refresh 10 requests
{"time":"2018-11-09 12:14:29.466","":"SOFATracerSpringMVC","traceId":"0a0fe8ec154173686946410159846","spanId":"0.1","request.url": "http://localhost:8080/springmvc","method":"GET","result.code":"200","req.size.bytes":-1,"resp.size.bytes":0 ,"time.cost.milliseconds":2,"":"http-nio-8080-exec-5","baggage":""}
{"time":"2018-11-09 12:15:21.776","":"SOFATracerSpringMVC","traceId":"0a0fe8ec154173692177410319846","spanId":"0.1","request.url": "http://localhost:8080/springmvc","method":"GET","result.code":"200","req.size.bytes":-1,"resp.size.bytes":0 ,"time.cost.milliseconds":2,"":"http-nio-8080-exec-2","baggage":""}
Refresh 20 requests
{"time":"2018-11-09 12:14:29.466","":"SOFATracerSpringMVC","traceId":"0a0fe8ec154173686946410159846","spanId":"0.1","request.url": "http://localhost:8080/springmvc","method":"GET","result.code":"200","req.size.bytes":-1,"resp.size.bytes":0 ,"time.cost.milliseconds":2,"":"http-nio-8080-exec-5","baggage":""}
{"time":"2018-11-09 12:15:21.776","":"SOFATracerSpringMVC","traceId":"0a0fe8ec154173692177410319846","spanId":"0.1","request.url": "http://localhost:8080/springmvc","method":"GET","result.code":"200","req.size.bytes":-1,"resp.size.bytes":0 ,"time.cost.milliseconds":2,"":"http-nio-8080-exec-2","baggage":""}
{"time":"2018-11-09 12:15:22.439","":"SOFATracerSpringMVC","traceId":"0a0fe8ec154173692243810359846","spanId":"0.1","request.url":"http://localhost:8080/springmvc","method":"GET","result.code":"200","req.size.bytes" :-1,"resp.size.bytes":0,"time.cost.milliseconds":1,"":"http-nio-8080-exec-6","baggage":"" }
{"time":"2018-11-09 12:15:22.817","":"SOFATracerSpringMVC","traceId":"0a0fe8ec154173692281510379846","spanId":"0.1","request.url": "http://localhost:8080/springmvc","method":"GET","result.code":"200","req.size.bytes":-1,"resp.size.bytes":0 ,"time.cost.milliseconds":2,"":"http-nio-8080-exec-8","baggage":""}
Sampling at 20%, the test results are for reference only.

Custom sampling mode

SOFATracer provides a sampling interface based on user-defined extensions. The sampling mode needs to be implemented interface. When = CustomOpenRulesSamplerRuler, the user needs to implement the CustomOpenRulesSamplerRuler.sample method to define the sampling calculation rules based on the current SofaTracerSpan parameter sampling conditions.

[]( 9E%E5%8A%A0%E9%87%87%E6%A0%B7%E7%9B%B8%E5%85%B3%E9%85%8D%E7%BD%AE%E9%A1%B9- 1) Add sampling related configuration items through to support custom sampling mode:

User-defined sampling rule class implementation interface example:

In the sample method, the user can decide whether to print based on the information provided by the current SofaTracerSpan. In this case, whether to sample is determined by judging isServer, isServer=true does not sample, otherwise it is sampled. The relevant experimental results can be verified by yourself.

This article mainly analyzes the sampling model strategy of Dapper paper and the implementation of SOFATracer sampling source code, and describes in detail how to formulate sampling rules for buried point data. According to SOFATracer's fixed sampling rate-based sampling mode and user-extension-based custom sampling mode, select a sampling strategy suitable for business needs scenarios, and better integrate the SOFATracer data sampling section to implement custom sampling calculation rules. Through this source code analysis, I hope to help you better understand the core principle and specific implementation of the SOFATracer link tracking sampling module.

Related links appearing in the article:

The original address of the Drapper paper:
O(N) Reservoir Sampling Algorithm Reservoir Sampling
Random BitSet source StackOverflow:
Author: s Pan Pan

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us