In distributed systems with complex business logic, it can be challenging and costly to identify failed, timeout, and abnormal traces among a large volume of business data. Application Real-Time Monitoring Service (ARMS) provides a selection of trace sampling policies to help you collect the specific traces that meet your business needs. This allows you to efficiently monitor your applications while minimizing resource usage.
Common sampling policies in distributed tracing
In distributed tracing, three sampling policies are available based on the time when a sampling decision is made.
Head-based sampling: The sampling decision is made at the root span of an ingress service, such as a gateway service, proxy service, or core upstream service. If the root span of the ingress service is sampled, the trace data of subsequent services is also sampled. This ensures the coherence of the entire trace.
Tail-based sampling: The sampling decision is made at the server after all traces are available. If a trace needs to be sampled, its data is stored. Otherwise, the data is discarded. This ensures data consistency. Tail-based sampling accurately samples failed, slow, or abnormal traces for troubleshooting and diagnostics at any time. However, it also increases trace overhead and data costs.
Unitary sampling: Unitary sampling is a non-coherent sampling policy. For a trace, the service to which each span belongs determines whether to sample and report data. Because each service determines the trace sampling, the reported trace data is not complete.
Sampling policies provided by ARMS
ARMS typically employs head-based sampling to minimize costs associated with reported observable data. For information about the billing of observable data, see Billing.
Fixed-rate sampling: Traces are sampled based on a specified sampling rate at the ingress service. For example, if the fixed rate is 10%, one out of every 10 pieces of trace data is recorded.
Adaptive sampling: 10 traces are sampled per minute for each of the 1,000 API operations with the most requests based on the Least Frequently Used (LFU) algorithm, whereas all other API operations share the quota of 10 traces per minute. This is a cost-effective head-based sampling policy developed by ARMS.
Select a sampling policy by scenario
Each sampling policy has its own advantages and disadvantages. The following sections will help you balance performance and costs in different scenarios.
In ARMS, you can copy the trace sampling settings of one application to another in batches to meet your business requirements in varied scenarios. For more information, see Synchronize application settings to other applications.
Cost control: fixed-rate sampling
In general, you can set a low fixed sampling rate for your applications, such as changing the default rate of 10% to 5%.
Although the cost of trace sampling is reduced by half, the sampled trace data that is essential for online production environments is not decreased by the same amount. This is because exceptions can still be captured from most normal and generally duplicate traces at a fixed rate, especially for applications with many requests. An exception that occurs in a trace of an application is likely to also occur in other traces of the application, unless the exception is an occasional spike. This way, a low fixed sampling rate still ensures the basic performance of trace monitoring while relieving your cost pressure.
The following figure shows the settings for fixed-rate sampling. For information about how to set the sampling rate, see Fixed-rate sampling.
Core business: fixed-rate sampling and operation-specific full collection
To prioritize certain business logic in a workflow, you can set a high fixed sampling rate for your core applications. Meanwhile, you can enable full collection for specific core operations to sample all the traces of the requests sent from them.
For example, in an e-commerce system, it is essential to focus on operations such as product querying and purchasing, rather than querying and editing user information. This will help quickly identify and resolve trace exceptions, ensuring uninterrupted services. In this case, we recommend that you increase the sampling rate above the default 10% for the applications and enable full collection for the specific core operations that require it.
Full collection may result in a surge of the collected data. Make sure that full collection is enabled only for the key operations, or when necessary.
The following figure shows the settings for collecting all trace data of specific operations, or operations with certain prefixes or suffixes. For information about how to set the full collection, see Fixed-rate sampling.
Major O&M events: fixed-rate sampling
During major O&M events, such as large-scale promotions or performance testing on new releases, you can set a fixed 100% sampling rate for data with certain tags or even for all applications. For information about how to filter applications by tag, see Manage tags. This facilitates troubleshooting, auditing, and accountability.
After the major events, we recommend that you lower the sampling rate to prevent performance loss and unnecessary costs.
Traffic fluctuation: adaptive sampling
In scenarios where the traffic fluctuates as the complex business logic changes, you can set the adaptive sampling policy for your applications.
Applications with complex business logic often involve many operations. The adaptive sampling policy ensures that the data collected does not increase linearly with the operation traffic, by sampling only certain entries based on the LFU algorithm. Additionally, the built-in minimum sampling policy for all operations acts as a supplement, ensuring that valuable trace data is recorded for each operation, regardless of the traffic levels.
Minimum sampling for all API operations: The traces of each API operation are automatically sampled within one minute. By default, ARMS enables this sampling policy.
By contrast:
If you set a fixed sampling rate for the applications, the operations with many requests will be over-sampled, whereas the operations that are not sampled due to low traffic or zero calls, such as scheduling, may affect the tracing result if exceptions occur.
If you manually enable full collection for specific operations individually, this will result in a large O&M workload and potential oversights.
Nevertheless, you can still configure full collection for specific operations that are critical to the application performance.
The following figure shows the settings for adaptive sampling. For information about how to configure the settings, see Adaptive sampling.
Related steps
After traces are sampled, you can configure the filter conditions and aggregation dimensions to analyze trace data in real time. For more information, see Trace analysis.