The core value of tracing lies in "connection." The user terminal, gateway, backend applications, and dependent services (such as databases, messaging systems, and large models) collectively form the topology map of tracing. The broader the coverage of this topology, the greater the value that tracing can deliver. End-to-end tracing is the best practice that covers all associated IT systems, providing a complete record of user behavior across system calls and states.
Workflow
Application Real-Time Monitoring Service (ARMS) and Managed Service for OpenTelemetry support end-to-end tracing among user terminals (such as browser, Android, and iOS), cloud gateways (such as Application Load Balancer, Microservice Engine, NGINX Ingress Controller, and Service Mesh), backend applications (such as Java, Go, Python, and .NET applications), and dependent services (such as databases, message queues, and large models), as shown in the following figure.
Tracing instrumentation: provides ARMS agent for Java, Go, and Python, enhancing multi-language coverage with open-source compatibility
For mainstream languages such as Java, Go, and Python, we recommend that you use a self-developed ARMS agent to improve the quality, performance, stability, and usability of tracing instrumentation. Managed Service for OpenTelemetry is compatible with four mainstream tracing tools: OpenTelemetry, SkyWalking, Zipkin, and Jaeger. It also supports tracing implementation and data reporting in more than 10 languages, as shown in the following table.
ARMS is fully interoperable with Managed Service for OpenTelemetry. We recommend that you use them together in multi-language scenarios.
Language | ARMS Application Monitoring (self-developed agent with guaranteed SLA) | Managed Service for OpenTelemetry (open source client and self-management) | Recommended option |
Java | Automatic instrumentation | Automatic instrumentation | ARMS |
Go | Automatic instrumentation | Automatic instrumentation | ARMS |
Python | Automatic instrumentation | Automatic instrumentation | ARMS |
Node.js | Unsupported | Automatic instrumentation | OpenTelemetry |
.NET | Unsupported | Automatic instrumentation | OpenTelemetry |
PHP | Unsupported | Automatic instrumentation | OpenTelemetry |
Erlang | Unsupported | Automatic instrumentation | OpenTelemetry |
C++ | Unsupported | Manual instrumentation | OpenTelemetry |
Swift | Unsupported | Manual instrumentation | OpenTelemetry |
Ruby | Unsupported | Manual instrumentation | OpenTelemetry |
Rust | Unsupported | Manual instrumentation | SkyWalking |
The ARMS agent for Java v4.0, released in 2024, fully embraces the OpenTelemetry ecosystem. The agent foundation has been completely upgraded based on the OpenTelemetry framework and provides additional monitoring of various resources, performance diagnostics, and application security data. In addition to richer data, the ARMS agent for Java v4.0 supports advanced features, such as more flexible trace sampling policies, user-friendly agent management, comprehensive self-monitoring, and dynamic feature degradation, making it more suitable for enterprise-level production environments.
Trace collection and processing: integrates deeply with Alibaba Cloud, enabling easy trace configuration for cloud services
A major challenge for enterprises moving to the cloud is their heavy reliance on cloud service availability. End-to-end tracing can quickly pinpoint slow or failed request nodes, improving fault recovery, and reducing business losses.
Managed Service for OpenTelemetry collaborates with nearly 10 Alibaba Cloud services, implementing internal tracing and data reporting. Enterprise users can simply enable the tracing option in the cloud service console to view traces, greatly reducing collection costs. The tracing integration for Application Load Balancer (ALB), Microservice Engine (MSE), and ARMS User Experience Monitoring (RUM) is illustrated below.
Due to service characteristics, different cloud services use distinct tracing instrumentation schemes. Trace data collection is generally divided into two types:
Direct or forwarded trace reporting: As seen in RUM, internal tracing instruments report directly through an Exporter, providing more detailed and flexible instrumentation.
Log data conversion to trace: In ALB, backend systems convert access logs into trace data, offering less intrusive instrumentation.
The two schemes have their own advantages and disadvantages. Direct or forwarded trace reporting is usually recommended, which is more standardized. However, if the performance requirements are high or tracing is difficult to be enabled for the system, you can convert logs to traces as long as the trace context such as TraceId has been added to the logs.
The following table lists the supported cloud services and tracing protocols, and provides the relevant references.
Category | Source | References | Tracing protocol |
User terminal | Web application, HTML5 application, and mini programs | Enable end-to-end tracing for a web application or mini program | W3C, B3, Jaeger, and SkyWalking |
Android and iOS apps | W3C and SkyWalking | ||
Gateway | MSE | W3C, B3, and SkyWalking | |
NGINX Ingress Controller | W3C, B3, and Jaeger | ||
ALB | B3 | ||
Service Mesh | B3 | ||
API Gateway | B3 | ||
Backend application | Java, Go, and Python applications monitored by the ARMS agent | W3C, B3, Jaeger, SkyWalking, and EagleEye | |
Applications in other languages such as .NET and Node.js
| W3C, B3, Jaeger, and SkyWalking | ||
Dependent service | More than 100 are supported as components for monitoring, covering various types including remote procedure call (RPC), message queues, databases, and task scheduling. |
Trace context propagation: standardizes Alibaba Cloud end-to-end tracing protocol, supporting multiple protocol conversions with ARMS agent
Completing instrumentation and data collection from a single application is successful when corresponding trace data appears in the console. However, end-to-end tracing requires linking upstream and downstream traces with a unified protocol to ensure continuity, posing both technical and coordination challenges.
Managed Service for OpenTelemetry has already achieved end-to-end trace connectivity based on the OpenTelemetry W3C protocol and will progressively cover more protocols and services for a comprehensive and flexible tracing ecosystem, as shown in the following diagram.
Compared to new applications, existing applications face greater challenges in unifying end-to-end protocols, especially during technology stack transitions (for example, migrating from SkyWalking to OpenTelemetry). Ensuring continuous operation of the existing monitoring system while validating the new one and enabling coexistence of two different tracing systems is a major hurdle for upgrading or connecting existing applications.
To address this, the self-developed ARMS agent has undergone extensive compatibility optimizations, achieving dual-agent coexistence to ensure both systems operate correctly and stably until migration is complete, as illustrated below.
The ARMS agent supports multi-protocol recognition and transmission. In scenarios where upstream and downstream systems cannot easily change, the agent can act as a protocol mediator. For example, the upstream application A uses the Jaeger protocol, the ARMS agent receives the Jaeger data and forwards it while converting it to both Jaeger and Zipkin B3 formats, and the downstream application B uses the Zipkin B3 protocol. This ensures seamless trace data transmission between systems using different protocols, and trace continuity and connectivity.