Community Blog Using Log Service Trace to Implement a Reliable Deployment Solution for Jaeger

Using Log Service Trace to Implement a Reliable Deployment Solution for Jaeger

This article introduces Jaeger and describes how to implement a highly reliable deployment solution for Jaeger using SLS Trace.

1. Background

With the development and popularization of microservice technology, complex monolithic applications are now divided into multiple services. Moreover, the continuous disassembly of various application systems has expanded the services and increased the complexity of call relationships between services. Due to the complex nature of call relationships and massive service scale, enhancing the performance of tasks like troubleshooting and optimization has become a challenge for both R&D and O&M staff.

Google published a paper titled "Dapper, A Large-Scale Distributed Systems Tracing Infrastructure" in 2010. The paper proposed the concept of distributed tracing analysis. It states that each request generates a unique ID and passes this ID to the opposite end during cross-instance request, and records each method's start and end time. Finally, this request is connected in series by tracing ID to form a directed acyclic graph. Each node in the graph indicates a method while the relationships in the graph indicate the execution order and relationships of the methods.


As shown in the figure, the left side indicates the execution process of the subsequent microservice request, whereas the right side indicates the final Trace diagram. With the help of this diagram, developers or O&M staff can clearly view the lifecycle and the path of the request, such as applications, methods, and call relationships from which a request has been traversing. It helps teams to locate business faults and performance bottlenecks quickly. The monitoring system is crucial for rapidly finding and analyzing problems. Many open-source and commercial solutions have emerged, such as Jaeger, Open Telemetry, Apache SkyWalking, Zipkin, etc. APM (Application Performance Monitoring) vendors such as LightStep, AppDynamics, and New Relics also provide reliable monitoring services.

2. Introduction to Jaeger

Jaeger is a distributed tracing system developed by Uber. It became open source in April 2017 and was accepted as a CNCF incubation project in September of the same year. In October 2019, it graduated from CNCF and became a top-level CNCF project. The following figure shows the Jaeger architecture. The two architecture diagrams are roughly the same, except that Kafka is added in the second architecture between collector and DB as a buffer to tackle the peak traffic overload. Jaeger supports a variety of backend storage. Currently, the supported data storage includes memory, Badger, Cassandra, Elasticsearch, and gRPC plug-ins.


3. Jaeger's High-reliability Solution

As an observability/monitoring system component, Jaeger is a critical data source for development and operation staff to locate and discover business system problems. As the Site Reliability Engineer (SRE), we must ensure that monitoring systems live longer than business systems. Monitoring is worthless once the monitoring system is down before the business system. Monitoring is the last barrier to business exception analysis. Compared with other systems, it is more sensitive to high availability and performance.

As an open-source project, Jaeger does not provide evaluation solutions for the deployment scale and solutions that can tell how to ensure the high availability of services. O&M staff needs to give specific deployment solutions based on their experience and research in the business system scale to ensure the high availability of services. It is essential to answer important questions: How to provide high-availability and high-performance back-end services in this situation? Who will provide the last layer of protection for the monitoring system?

First of all, let's analyze the aspects where the Jaeger system's high availability needs to be optimized from the deployment architecture perspective (Here, we take the example of the latest solution in the community as shown on the right side of the above figure):

  • Both Jaeger Client and Agent are directly deployed to the server along with the application, which is relatively stable. We only need to ensure the network quality from the Agent to the backend.
  • The new deployment mode adds a Kafka queue as a buffer to cope with burst traffic. However, Kafka alone is not enough. We also need to consider whether or not a Kafka can have sufficient resources to scale out dynamically. Similarly, the forwarding capability of Collector, the stream computing capability of Flink, and the import capability of Ingester also need to be dynamically expanded with the sudden traffic increase.
  • In the deployment solution, Kafka and DB (such as ES and Cassandra) are stateful services. We need to deploy multiple replicas to ensure reliability.
  • To ensure query efficiency (especially in the case of a sudden error increase), ES and Cassandra need to be optimized when large data volumes are generated, which involves much more work.
  • Jaeger's backend system involves many components. We need to deploy an additional set of monitoring systems to monitor Jaeger's backend's stability and performance issues and configure alerts to handle these emergencies quickly.

As compared to the simple deployment of a set of Jaeger, the work mentioned above requires a lot of effort and continuous investment of manpower in subsequent O&M and management of this system. Therefore, the simplest way is to use service-oriented products directly.

4. Log Service as the Jaeger Backend

The core part of Jaeger's high availability is the Jaeger backend, including Collector, Kafka, Flink, DB, Query, and UI. The best practice is to find a backend system compatible with Jaeger while providing high reliability and performance.

The Trace service released by Alibaba Cloud Log Service (SLS) recently can meet this requirement perfectly. SLS's most prominent features are high performance, elasticity, and O&M-free, allowing users to tackle surging traffic and inaccurate scale assessment. The SLS service provides availability of 99.9% and data reliability of 99.999999999%.

In general, we want to replace Jaeger's backend in two different ways:

  1. Jaeger SDK generates the native data, and the query continues using Jaeger UI for the convenience of application developers.
  2. The Jaeger SDK generates native data, and the Trace UI provided by SLS is used for querying. Compared with the Jaeger UI, Trace capabilities provided by SLS are much more powerful, including Trace metric calculation, dependency analysis, and custom analysis.

5. Access Methods

The Trace service of SLS provides backend services that are easy to access through various open-source software, unified data models, and performance analysis capabilities for various open-source Tracing Analysis. SLS is fully compatible with the deployment mode of Jaeger. Currently, SLS provides two access methods to meet the preceding two types of requirements:

  1. Native Jaeger access method uses SLS only as Jaeger's backend storage and can also use the original Jaeger UI (of course, you can also log in to SLS to use SLS's Trace function)
  2. The simplified Jaeger access method uses Jaeger data access, and all subsequent functions use the Trace capability provided by the SLS.

5.1 Native Jaeger Access Method

The native Jaeger access method uses a hybrid mode, with the Jaeger UI as the frontend and SLS as the storage backend. Users accustomed to Jaeger page operations have one more access method to select.


Here are the steps for accessing native Jaeger. For more detailed information about parameter configurations and container deployment methods, see GitHub.

  • Log in to the SLS console and create a project to store spans.
  • Log in to the SLS instance list and create a Trace instance. Note: Select the project created in the previous step.
  • Go to the Jaeger download page . Download and decompress the Jaeger package.
  • Start the Agent. The following command shows the access steps in the MacOS environment.
./agent-darwin --collector.host-port=localhost:14267
  • Start Jaeger Collector
export SPAN_STORAGE_TYPE=aliyun-log-otel && \
./collector-darwin \
--aliyun-log.project=<PROJECT> \
--aliyun-log.endpoint=<ENDPOINT> \
-- aliyun-log.access-key-id=<ACCESS_KEY_ID> \
-- aliyun-log.access-key-secret=<ACCESS_KEY_SECRET> \
-- aliyun-log.span-logstore=<SPAN_LOGSTORE> \
-- aliyun-log.init-resource-flag=false
  • Start UI
export SPAN_STORAGE_TYPE=aliyun-log-otel && \
./query-darwin \
--aliyun-log.project=<PROJECT> \
--aliyun-log.endpoint=<ENDPOINT> \
-- aliyun-log.access-key-id=<ACCESS_KEY_ID> \
-- aliyun-log.access-key-secret=<ACCESS_KEY_SECRET> \
-- aliyun-log.span-logstore=<SPAN_LOGSTORE> \
-- aliyun-log.span-dep-logstore=<SPAN_DEP_LOGSTORE> \
--aliyun-log.init-resource-flag=false \

The following table is a detailed description of each parameter:

Parameter Name Description
PROJECT Specify the project used to store spans
ENDPOINT Specify the endpoint where the project used to store spans exists.
Its format is ${project}.${region-endpoint}
${project} is the name of the Log Service project.
${region-endpoint} is the endpoint of the project. You can access Log Service by using an endpoint of the Internet, the classic network,or a VPC.
ACCESS_KEY_ID Your Access Key ID
ACCESS_KEY_SECRET Your Access Key Secret
SPAN_LOGSTORE Specify the Logstore used to store spans. The name is {instance-id}-traces.
SPAN_DEP_LOGSTORE Specify the Logstore used to store service call relationships. The name is {instance-id}-traces-dep. Default value: jaeger-traces-dep.

5.2 Simplified Jaeger Access Method

The simplified version provides two data access methods: Jaeger direct transmission and Jaeger forwarding method. The direct transmission method is simple to deploy and requires each agent to connect to SLS. The forwarding method supports more advanced features like flow control. We will talk about the two methods respectively in this section.


5.2.1 Direct Transmission Method

The direct transmission is to directly send trace to the SLS backend by configuring the SLS address on the jaeger agent end. The biggest advantage of this method is that you don't have to deploy the Jaeger Collector instance. The following is the startup parameter command in direct transmission.

./jaeger-agent --reporter.grpc.host-port=${ENDPOINT} --reporter.grpc.tls.enabled=true --agent.tags=sls.otel.project=${PROJECT},sls.otel.instanceid=${INSTANCE},sls.otel.akid=${ACCESS_KEY_ID},sls.otel.aksecret=${ACCESS_SECRET}

The following table is a detailed description of each parameter:

Parameter Description
ACCESS_KEY_ID The AccessKey ID of your Alibaba Cloud account.
We recommend that you use the AccessKey pair of a RAM user that has only the write permissions on the Log Service project. An AccessKey pair consists of an AccessKey ID and an AccessKey secret.
ACCESS_SECRET The AccessKey secret of your Alibaba Cloud account.
We recommend that you use the AccessKey pair of a RAM user that has only the write permissions on the Log Service project.
PROJECT The name of the Log Service project.
INSTANCE The name of the trace instance.
ENDPOINT The access address, in the format of ${project}.${region-endpoint}:10010.

${project} is the name of the Log Service project.

${region-endpoint} is actually the endpoint of the Log Service project. You can access Log Service by using the public or internal endpoint of the project. An internal endpoint is used for access over the classic network or a VPC.

5.2.2 Forwarding Method

The forwarding method uses the OpenTelemetry Collector to collect the incoming span data from the jaeger-agent and sends the trace data to the backend. Here are the deployment steps for the forwarding method:

  • Download Collector

OpenTelemetry Collector download address: https://github.com/open-telemetry/opentelemetry-collector-contrib/releases/tag/v0.30.0

  • Add a configuration file

Add the configuration file config.yaml and modify the configuration content based on the actual situation. For detailed parameter descriptions in the configuration, see the parameter explanation in the direct transmission method above.

   loglevel: debug
   endpoint: "{ENDPOINT}"
   project: "{PROJECT}"
   logstore: "{LOGSTORE}"
   access_key_id: "{ACCESS_KEY_ID}"
   access_key_secret: "{ACCESS_KEY_SECRET}"

     receivers: [jaeger]
     exporters: [alibabacloud_logservice/sls-trace]
     # for debug
     #exporters: [logging/detail,alibabacloud_logservice/sls-trace]
  • Run Collector

Run the following command to start Collector:

./otelcontribcol_linux_amd64 --config="PATH/TO/config.yaml"

5.2.3 Overview of SLS Trace Features

Trace dependency analysis can automatically calculate and generate the dependency topology of traces. Compared with Jaeger, it adds a lot of metric calculations, including QPS, error rate, average latency, and PXX latency.


The Trace list displays the overview information of the uploaded span. In addition, the search box supports combination search based on the attributes, tags, latency of the span, and other conditions.


The Trace detail page displays the execution duration, call relationships, and span information of each method.


5.3 Comparison between the Two Access Methods

Both the preceding methods can complete the access and use of Jaeger Trace data. Let's summarize some similarities and differences between the two methods.

Native Jaeger Access Method Simplified Jaeger Access Method
Reliability Strong, need to ensure the stability of Query UI services Strong
Data Processing Capability Strong Strong
Deployment Complexity Relatively low, need to deploy an additional set of Query UI service Low, no need to deploy any components except access
Capability to Locate and Detect Faults Average (can also use the Trace UI provided by SLS)
• Support simple query capabilities
• Topology diagram discovery
• Support combination query of multiple conditions such as tags, attributes, and delay time based on Span
• Support service and span-level metrics
• Automatic topology diagram discovery
User Habits Retain Jaeger's pages, no need to adjust user habits Need to adjust user habits

Overall, the simplified version of jaeger access works better. If we look at it as a monitoring system, it helps users quickly locate faults. By looking at service metrics and span level and using multi-condition query capability based on Span attributes, we can quickly filter and locate abnormal services, spans, and traces.

6. Summary

As a representative implementation of the OpenTracing protocol, Jaeger is also a popular top-level project in CNCF. However, suppose your company is building a new trace system. In that case, it is not recommended to use the Jaeger solution because OpenTracing has recently merged with OpenCensus to form OpenTelemetry, and the unified standard for the subsequent trace is OpenTelemetry. Therefore, we recommend that you use the native Trace of OpenTelemetry.

More Articles about OpenTelemetry:


  1. https://www.jaegertracing.io/docs/1.24/architecture/
  2. https://research.google/pubs/pub36356/
  3. https://zhuanlan.zhihu.com/p/393861201
  4. https://developer.aliyun.com/article/783621
0 0 0
Share on


12 posts | 1 followers

You may also like



12 posts | 1 followers

Related Products