In distributed systems, a single user request often passes through dozens of microservices before completing. When latency spikes or errors occur, pinpointing the root cause across these services requires end-to-end visibility into the request path.
Managed Service for OpenTelemetry, a component of Application Real-Time Monitoring Service (ARMS), provides distributed tracing for microservice architectures. It collects trace data from your applications, aggregates it in real time, and generates trace details, performance metrics, and service topology maps so you can quickly identify and resolve performance bottlenecks.
Core concepts
The following concepts are central to distributed tracing:
Trace: A record of a single request as it travels through multiple services. Each trace has a unique ID that ties together all the operations involved in fulfilling that request.
Span: A single operation within a trace. Each span captures the operation name, start time, duration, and the parent span that triggered it. A trace consists of multiple spans arranged in a parent-child hierarchy.
Topology: A visual map of how your services call each other, generated automatically from trace data.
Architecture
The following diagram shows how Managed Service for OpenTelemetry collects and processes trace data.
Data flow
Instrument your application Integrate the client SDK into your application to capture service call data. Managed Service for OpenTelemetry provides client SDKs for multiple programming languages and is compatible with open source tracing libraries such as Jaeger and Zipkin. The SDKs support the OpenTracing standard.
Process and visualize After the SDK reports data, the service aggregates and persists it in real time. Three types of monitoring data are generated: Use this data to troubleshoot slow requests, identify failing services, and understand call patterns.
Data type Description Trace details The full span-by-span breakdown of each request, used for root-cause analysis. Performance overview Latency, throughput, and error rate metrics across your services. Real-time topology A live map of service dependencies and call relationships. Forward to downstream services Send trace data to other Alibaba Cloud services for further analysis:
Service Use case Simple Log Service Correlate traces with application logs and set up alerting rules. MaxCompute Run large-scale offline analysis on historical trace data.
Capabilities
| Goal | How it helps |
|---|---|
| Trace requests across services | Collects all spans from distributed microservices and assembles them into end-to-end traces for query and root-cause analysis. |
| Monitor application performance | Captures request-level data and analyzes service and resource performance in real time, surfacing latency, error rates, and throughput. |
| Map service dependencies | Automatically discovers how your microservices and related PaaS products call each other, and renders a real-time topology. |
| Integrate with open source libraries | Works with Jaeger, Zipkin, and other open source tracing libraries built on the OpenTracing standard. |
| Stream data to analysis platforms | Sends trace data to Simple Log Service for log correlation and alerting, and to MaxCompute for offline analysis. |
Next steps
Get started by instrumenting your first application with the Managed Service for OpenTelemetry SDK.
Explore the trace query interface to search, filter, and analyze distributed traces.
Set up alerting rules in Simple Log Service to get notified when trace metrics exceed thresholds.