All Products
Search
Document Center

Managed Service for OpenTelemetry:Best practice of end-to-end tracing

Last Updated:Dec 19, 2024

The core value of tracing lies in "connection." The user terminal, gateway, backend applications, and dependent services (such as databases, messaging systems, and large models) collectively form the topology map of tracing. The broader the coverage of this topology, the greater the value that tracing can deliver. End-to-end tracing is the best practice that covers all associated IT systems, providing a complete record of user behavior across system calls and states.

Workflow

Application Real-Time Monitoring Service (ARMS) and Managed Service for OpenTelemetry support end-to-end tracing among user terminals (such as browser, Android, and iOS), cloud gateways (such as Application Load Balancer, Microservice Engine, NGINX Ingress Controller, and Service Mesh), backend applications (such as Java, Go, Python, and .NET applications), and dependent services (such as databases, message queues, and large models), as shown in the following figure.

image.png

Tracing instrumentation: provides ARMS agent for Java, Go, and Python, enhancing multi-language coverage with open-source compatibility

For mainstream languages such as Java, Go, and Python, we recommend that you use a self-developed ARMS agent to improve the quality, performance, stability, and usability of tracing instrumentation. Managed Service for OpenTelemetry is compatible with four mainstream tracing tools: OpenTelemetry, SkyWalking, Zipkin, and Jaeger. It also supports tracing implementation and data reporting in more than 10 languages, as shown in the following table.

ARMS is fully interoperable with Managed Service for OpenTelemetry. We recommend that you use them together in multi-language scenarios.

Language

ARMS Application Monitoring

(self-developed agent with guaranteed SLA)

Managed Service for OpenTelemetry

(open source client and self-management)

Recommended option

Java

Automatic instrumentation

Automatic instrumentation

ARMS

Go

Automatic instrumentation

Automatic instrumentation

ARMS

Python

Automatic instrumentation

Automatic instrumentation

ARMS

Node.js

Unsupported

Automatic instrumentation

OpenTelemetry

.NET

Unsupported

Automatic instrumentation

OpenTelemetry

PHP

Unsupported

Automatic instrumentation

OpenTelemetry

Erlang

Unsupported

Automatic instrumentation

OpenTelemetry

C++

Unsupported

Manual instrumentation

OpenTelemetry

Swift

Unsupported

Manual instrumentation

OpenTelemetry

Ruby

Unsupported

Manual instrumentation

OpenTelemetry

Rust

Unsupported

Manual instrumentation

SkyWalking

The ARMS agent for Java v4.0, released in 2024, fully embraces the OpenTelemetry ecosystem. The agent foundation has been completely upgraded based on the OpenTelemetry framework and provides additional monitoring of various resources, performance diagnostics, and application security data. In addition to richer data, the ARMS agent for Java v4.0 supports advanced features, such as more flexible trace sampling policies, user-friendly agent management, comprehensive self-monitoring, and dynamic feature degradation, making it more suitable for enterprise-level production environments.

Trace collection and processing: integrates deeply with Alibaba Cloud, enabling easy trace configuration for cloud services

A major challenge for enterprises moving to the cloud is their heavy reliance on cloud service availability. End-to-end tracing can quickly pinpoint slow or failed request nodes, improving fault recovery, and reducing business losses.

Managed Service for OpenTelemetry collaborates with nearly 10 Alibaba Cloud services, implementing internal tracing and data reporting. Enterprise users can simply enable the tracing option in the cloud service console to view traces, greatly reducing collection costs. The tracing integration for Application Load Balancer (ALB), Microservice Engine (MSE), and ARMS User Experience Monitoring (RUM) is illustrated below.

Due to service characteristics, different cloud services use distinct tracing instrumentation schemes. Trace data collection is generally divided into two types:

  • Direct or forwarded trace reporting: As seen in RUM, internal tracing instruments report directly through an Exporter, providing more detailed and flexible instrumentation.

  • Log data conversion to trace: In ALB, backend systems convert access logs into trace data, offering less intrusive instrumentation.

The two schemes have their own advantages and disadvantages. Direct or forwarded trace reporting is usually recommended, which is more standardized. However, if the performance requirements are high or tracing is difficult to be enabled for the system, you can convert logs to traces as long as the trace context such as TraceId has been added to the logs.

The following table lists the supported cloud services and tracing protocols, and provides the relevant references.

Category

Source

References

Tracing protocol

User terminal

Web application, HTML5 application, and mini programs

Enable end-to-end tracing for a web application or mini program

W3C, B3, Jaeger, and SkyWalking

Android and iOS apps

Enable end-to-end tracing for an app

W3C and SkyWalking

Gateway

MSE

Enable tracing analysis for a cloud-native gateway

W3C, B3, and SkyWalking

NGINX Ingress Controller

Enable tracing for the NGINX Ingress controller

W3C, B3, and Jaeger

ALB

Enable Managed Service for OpenTelemetry for ALB

B3

Service Mesh

Enable distributed tracing in ASM

B3

API Gateway

Configure tracing analysis

B3

Backend application

Java, Go, and Python applications monitored by the ARMS agent

Application Monitoring overview

W3C, B3, Jaeger, SkyWalking, and EagleEye

Applications in other languages such as .NET and Node.js

Integration guide

W3C, B3, Jaeger, and SkyWalking

Dependent service

More than 100 are supported as components for monitoring, covering various types including remote procedure call (RPC), message queues, databases, and task scheduling.

Trace context propagation: standardizes Alibaba Cloud end-to-end tracing protocol, supporting multiple protocol conversions with ARMS agent

Completing instrumentation and data collection from a single application is successful when corresponding trace data appears in the console. However, end-to-end tracing requires linking upstream and downstream traces with a unified protocol to ensure continuity, posing both technical and coordination challenges.

Managed Service for OpenTelemetry has already achieved end-to-end trace connectivity based on the OpenTelemetry W3C protocol and will progressively cover more protocols and services for a comprehensive and flexible tracing ecosystem, as shown in the following diagram.

image.png

Compared to new applications, existing applications face greater challenges in unifying end-to-end protocols, especially during technology stack transitions (for example, migrating from SkyWalking to OpenTelemetry). Ensuring continuous operation of the existing monitoring system while validating the new one and enabling coexistence of two different tracing systems is a major hurdle for upgrading or connecting existing applications.

To address this, the self-developed ARMS agent has undergone extensive compatibility optimizations, achieving dual-agent coexistence to ensure both systems operate correctly and stably until migration is complete, as illustrated below.

image.png

The ARMS agent supports multi-protocol recognition and transmission. In scenarios where upstream and downstream systems cannot easily change, the agent can act as a protocol mediator. For example, the upstream application A uses the Jaeger protocol, the ARMS agent receives the Jaeger data and forwards it while converting it to both Jaeger and Zipkin B3 formats, and the downstream application B uses the Zipkin B3 protocol. This ensures seamless trace data transmission between systems using different protocols, and trace continuity and connectivity.