Service Mesh Observability: Unified Logs, Metrics & Tracing - ASM

Alibaba Cloud Service Mesh (ASM) captures logs, metrics, and distributed traces from every proxy in your mesh, giving you real-time visibility into service health, performance, and request flows. Use this data to detect faults before they affect users, debug request-level issues across services, and reduce mean time to recovery (MTTR).

How ASM collects observability data

Waypoint workloads are located in the network request path between services. Every Envoy proxy (Waypoint, sidecar, or gateway) and Ztunnel in your mesh generates telemetry about the traffic it handles. By capturing observable data from Waypoints, you can gain insights into the runtime status of the application network and the mesh. ASM provides a unified model for generating and collecting this data.

The collection workflow has two layers:

Generation rules control what data each proxy produces.
Collection rules send that data to cloud-hosted or self-managed backends.

Separate configurations for mesh proxies and gateway pods support different requirements.

Logs

Access logs record details for every request that passes through a proxy, including response code, request URI, request host, and virtual service route name.

A sample access log entry:

{
    "authority_for": "httpbin:8000",
    "bytes_received": "0",
    "bytes_sent": "0",
    "downstream_local_address": "192.168.73.109:8000",
    "downstream_remote_address": "10.22.115.101:35004",
    "duration": "0",
    "istio_policy_status": "-",
    "method": "GET",
    "path": "/delay/60",
    "protocol": "HTTP/1.1",
    "request_id": "fa4a0646-f873-9561-9b1f-eb58e439866c",
    "requested_server_name": "-",
    "response_code": "429",
    "response_flags": "UAEX",
    "route_name": "default",
    "start_time": "2025-06-17T01:36:06.687Z",
    "trace_id": "-",
    "upstream_cluster": "outbound|8000||httpbin.default.svc.cluster.local;",
    "upstream_host": "-",
    "upstream_local_address": "-",
    "upstream_response_time": "-",
    "upstream_service_time": "-",
    "upstream_transport_failure_reason": "-",
    "user_agent": "curl/8.1.2",
    "x_forwarded_for": "-"
}

Customize log formats

Different services often need different log fields. Customize the access log content output by the data plane to match your requirements. For more information, see Customize data plane access logs.

Important

Log format rules apply only to Envoy proxies (sidecars, Waypoints, and gateways). They do not affect Ztunnel.

Collect data plane logs

Container Service for Kubernetes (ACK) integrates with Simple Log Service (SLS) to collect access logs from the data plane cluster. Configure collection rules to control the collection method and retention period. For more information, see Use Simple Log Service to collect access logs from the data plane cluster.

Collect control plane logs and set alerts

The ASM control plane pushes mesh rule configurations to Envoy proxies on the data plane. If rule conflicts cause a push failure, proxies continue running with their last known valid configuration -- but a pod restart without a successful configuration push causes the proxy to fail.

Misconfigurations are a common cause of unavailable gateways or proxies. Enable control plane log-based alerting to detect and resolve these issues promptly. For more information, see Enable control plane log collection and log-based alerting (for earlier versions) or Enable control plane log collection and log-based alerting (for later versions).

Metrics

Metrics describe request processing, the communication status between services, and other operational details. Use metrics for real-time monitoring, anomaly detection, and auto scaling.

Each Envoy proxy (Waypoint, sidecar, or gateway) and Ztunnel generates metrics. Istio uses Prometheus to collect and store these metrics.

Cost and topology impact

Before enabling metrics, consider the following:

Managed Service for Prometheus is a paid service. Scope metric generation carefully to avoid unexpected costs. To monitor a gateway, enable CLIENT-side metrics.

Mesh Topology depends on specific sidecar-reported metrics. Disabling certain metrics affects topology availability:

Disabled metric	Impact
SERVER-side REQUEST_COUNT	HTTP/gRPC topology graph unavailable
SERVER-side TCP_SENT_BYTES	TCP topology graph unavailable
SERVER-side REQUEST_SIZE, REQUEST_DURATION, or CLIENT-side REQUEST_SIZE	Some topology node monitoring data missing

Re-enablement: If you re-enable metrics after disabling them, previous configuration settings are retained.

Enable metric collection

Send metrics to Prometheus for storage and analysis. ASM integrates with Application Real-Time Monitoring Service (ARMS) Prometheus, and also supports self-managed Prometheus instances.

Control which metrics are generated

Enable or disable specific metrics (such as TCP connection time or request size) to retain only what you need. This reduces performance overhead and billing costs. For more information, see Customize metrics in ASM.

Set the collection interval

The Prometheus scrape interval directly affects collection overhead. A longer interval produces fewer data points, reducing processing, storage, and computation costs. A 30-second interval balances overhead with data completeness.

Important

When using ARMS for collection, the scrape interval is fixed at 30 seconds and cannot be modified. ASM dashboards depend on this interval to function correctly.

Drop high-cardinality metrics

Histogram metrics -- istio_request_duration_milliseconds_bucket, istio_request_bytes_bucket, and istio_response_bytes_bucket -- are typically dense and generate high overhead. Drop these metrics to reduce ongoing costs.

If you use ARMS Prometheus, these metrics are converged by default and collected only when a bucket changes. Confirm this by querying istio_request_duration_milliseconds_bucket_delta in the ARMS console. If convergence is not active, upgrade the ARMS collection configuration. For more information, see Upgrade ASM metrics and dashboards.

To drop metrics manually, see Configure metrics to drop.

Merge Istio and application metrics

If your services already expose a Prometheus endpoint, enable metrics merging to export both Istio and application metrics through a single proxy endpoint. When enabled, ASM:

Merges application metrics into Istio metrics.
Adds prometheus.io annotations to all data plane pods for Prometheus scraping. Existing annotations are overwritten.
Exposes the merged metrics at the :15020/stats/prometheus endpoint.

For more information, see Merge Istio and application metrics.

Mesh topology

Mesh Topology is a service mesh observability tool that provides a visual interface for viewing related services and configurations in the mesh.

For more information, see Enable mesh topology to improve observability.

Distributed tracing

Distributed tracing tracks the path of a request as it flows through multiple microservices. Distributed tracing is a method used to profile and monitor applications, especially those built with a microservices model. Your services must propagate specific request headers for the mesh to construct complete call chains.

Tracing enables you to:

Locate faults faster: Traces show the exact request path across microservices, including latency and error messages at each hop.
Identify performance bottlenecks: Compare latencies between services to find optimization opportunities.
Visualize dependencies: View service call chains to understand complex inter-service relationships.

Important

In Ambient mode, Waypoints are responsible for reporting tracing data. If a service does not have a Waypoint proxy enabled for its traffic, its tracing data is not reported, even if the service is part of the mesh.

For tracing providers other than OpenTelemetry, the tracing data cannot fully display the application topology because different services might report through the same Waypoint client. To view the application topology, use OpenTelemetry for collection or enable Mesh Topology.

Enable tracing

Report tracing data to a self-managed tracing system or to Alibaba Cloud Tracing Analysis. For more information, see Configure ASM to report tracing data.

Propagate tracing headers

ASM automatically adds tracing-related request headers to proxied traffic, but applications must propagate the following headers to maintain complete call chains:

Required for all applications:

x-request-id
traceparent
tracestate

Additional headers for Zipkin:

x-b3-traceid
x-b3-spanid
x-b3-parentspanid
x-b3-sampled
x-b3-flags

Customize observability settings

Customize observability configurations at the global, namespace, or workload level through the ASM console. Configurable settings include log output format, metric dimensions, metric enablement, and trace sampling rate. For more information, see Observability configurations.