Concepts and features of logs metrics and tracing - Alibaba Cloud Service Mesh

Service Mesh (ASM) collects three types of telemetry data from sidecar proxies and gateway pods on the data plane: logs, metrics, and traces. These map to the four golden signals of monitoring -- latency, traffic, errors, and saturation -- giving you real-time visibility into how services communicate, where bottlenecks form, and when failures occur.

ASM uses Istio Telemetry CustomResourceDefinitions (CRDs) to provide a centralized way to configure how telemetry data is generated and collected. Define collection rules separately for sidecar proxies and gateway pods, and route the data to managed cloud services or self-managed backends.

Telemetry CRD configuration model

Telemetry CRDs follow a three-tier inheritance model. Each tier overrides the settings from the tier above it:

Mesh-wide -- A single Telemetry CRD named default in the istio-system root namespace. This sets the baseline for all workloads.
Namespace-level -- One Telemetry CRD named default (with an empty workload selector) per namespace. Fields specified here fully override the mesh-wide configuration for that namespace.
Workload-level -- An additional Telemetry CRD with a workload selector in the target namespace. This overrides namespace-level settings for the selected workloads only.

Configuration rules:

ASM allows exactly one Telemetry CRD named default in the istio-system namespace.
Each namespace can have only one Telemetry CRD with an empty workload selector, also named default.
If two Telemetry CRDs select the same workload, the behavior is undefined. Avoid duplicate selectors.
If no metrics are configured in the istio-system Telemetry CRD, no metrics are generated.

Logs

Access logs record every request that passes through sidecar proxies and gateways. Print logs to stdout or stderr, then use a logging agent to aggregate them into a central system.

ASM provides two log capabilities: log filtering (control which requests generate log entries) and log formatting (control which fields appear in each entry).

Customize the log format

Customize the fields in Envoy access logs through the ASM console or the Telemetry CRD. For step-by-step instructions, see Customize access logs on the data plane.

The following YAML is equivalent to the GUI configuration shown above:

envoyFileAccessLog:
    logFormat:
      text: '{"bytes_received":"%BYTES_RECEIVED%","bytes_sent":"%BYTES_SENT%","downstream_local_address":"%DOWNSTREAM_LOCAL_ADDRESS%","downstream_remote_address":"%DOWNSTREAM_REMOTE_ADDRESS%","duration":"%DURATION%","istio_policy_status":"%DYNAMIC_METADATA(istio.mixer:status)%","method":"%REQ(:METHOD)%","path":"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%","protocol":"%PROTOCOL%","request_id":"%REQ(X-REQUEST-ID)%","requested_server_name":"%REQUESTED_SERVER_NAME%","response_code":"%RESPONSE_CODE%","response_flags":"%RESPONSE_FLAGS%","route_name":"%ROUTE_NAME%","start_time":"%START_TIME%","trace_id":"%REQ(X-B3-TRACEID)%","upstream_cluster":"%UPSTREAM_CLUSTER%","upstream_host":"%UPSTREAM_HOST%","upstream_local_address":"%UPSTREAM_LOCAL_ADDRESS%","upstream_service_time":"%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%","upstream_transport_failure_reason":"%UPSTREAM_TRANSPORT_FAILURE_REASON%","user_agent":"%REQ(USER-AGENT)%","x_forwarded_for":"%REQ(X-FORWARDED-FOR)%","authority_for":"%REQ(:AUTHORITY)%","upstream_response_time":"%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%","xff":"%REQ(X-FORWARDED-FOR)%","app_service_name":"%UPSTREAM_CLUSTER%"}'
    path: /dev/stdout

To filter logs -- for example, to log only responses with status codes 400 and above:

accessLogging:
- disabled: false
  filter:
    expression: response.code >= 400
  providers:
  - name: envoy

Collect data plane logs

Container Service for Kubernetes (ACK) integrates with Simple Log Service (SLS) to collect access logs from sidecar proxies on the data plane. Configure collection rules to control how logs are stored and how long they are retained. For details, see Use Simple Log Service to collect access logs on the data plane.

Collect control plane logs and set up alerting

The ASM control plane pushes configurations to sidecar proxies and ingress gateways on the data plane. If a configuration conflict occurs, the affected proxy or gateway cannot receive the update. It continues running on its last known good configuration, but will fail if the pod restarts.

Enable control plane log collection and log-based alerting to detect configuration push failures early. For setup instructions, see:

Metrics

Istio uses the Prometheus agent to collect and store metrics from Envoy proxies. These metrics cover the four golden signals of monitoring -- latency, traffic, errors, and saturation -- and support use cases such as real-time dashboards, anomaly detection, and auto scaling.

Configure metric generation rules

Enable data plane metrics to generate operational data from gateways and sidecar proxies. Collect these metrics to Managed Service for Prometheus for monitoring dashboards, or use a self-managed Prometheus instance.

Configure custom metrics through the ASM console or the Telemetry CRD. For details, see Create custom metrics in ASM.

Expand to view the equivalent YAML configuration

metrics:
- overrides:
  - disabled: true
    match:
      metric: ALL_METRICS
      mode: CLIENT
  - disabled: false
    match:
      metric: ALL_METRICS
      mode: SERVER
    tagOverrides: {}
  - disabled: true
    match:
      metric: REQUEST_COUNT
      mode: CLIENT
  - disabled: false
    match:
      metric: REQUEST_COUNT
      mode: SERVER
    tagOverrides: {}
  - disabled: true
    match:
      metric: REQUEST_DURATION
      mode: CLIENT
  - disabled: false
    match:
      metric: REQUEST_DURATION
      mode: SERVER
    tagOverrides: {}
  - disabled: true
    match:
      metric: REQUEST_SIZE
      mode: CLIENT
  - disabled: false
    match:
      metric: REQUEST_SIZE
      mode: SERVER
    tagOverrides: {}
  - disabled: true
    match:
      metric: RESPONSE_SIZE
      mode: CLIENT
  - disabled: false
    match:
      metric: RESPONSE_SIZE
      mode: SERVER
    tagOverrides: {}
  - disabled: true
    match:
      metric: GRPC_REQUEST_MESSAGES
      mode: CLIENT
  - disabled: false
    match:
      metric: GRPC_REQUEST_MESSAGES
      mode: SERVER
    tagOverrides: {}
  - disabled: true
    match:
      metric: GRPC_RESPONSE_MESSAGES
      mode: CLIENT
  - disabled: false
    match:
      metric: GRPC_RESPONSE_MESSAGES
      mode: SERVER
    tagOverrides: {}
  - disabled: true
    match:
      metric: TCP_SENT_BYTES
      mode: CLIENT
  - disabled: false
    match:
      metric: TCP_SENT_BYTES
      mode: SERVER
    tagOverrides: {}
  - disabled: true
    match:
      metric: TCP_RECEIVED_BYTES
      mode: CLIENT
  - disabled: false
    match:
      metric: TCP_RECEIVED_BYTES
      mode: SERVER
    tagOverrides: {}
  - disabled: true
    match:
      metric: TCP_OPENED_CONNECTIONS
      mode: CLIENT
  - disabled: false
    match:
      metric: TCP_OPENED_CONNECTIONS
      mode: SERVER
    tagOverrides: {}
  - disabled: true
    match:
      metric: TCP_CLOSED_CONNECTIONS
      mode: CLIENT
  - disabled: false
    match:
      metric: TCP_CLOSED_CONNECTIONS
      mode: SERVER
    tagOverrides: {}
  providers:
  - name: prometheus

Metric considerations

Managed Service for Prometheus is a paid service. When enabling it for the first time, scope the metrics to what your business actually needs. Monitoring too many metrics incurs unnecessary cost. If you previously enabled metrics, your earlier settings are preserved. To monitor a gateway, enable client-side metrics.

Mesh Topology depends on specific metrics. Disabling certain metrics affects Mesh Topology:

Disabled metric	Impact
Server-side `REQUEST_COUNT`	HTTP and gRPC service topology cannot be generated
Server-side `TCP_SENT_BYTES`	TCP service topology cannot be generated
Server-side `REQUEST_SIZE`, `REQUEST_DURATION`, and client-side `REQUEST_SIZE`	Some node monitoring data may not display

Configure metric collection

After enabling Managed Service for Prometheus, collect metrics for storage and analysis. For integration steps, see Integrate Managed Service for Prometheus to monitor ASM instances.

Collection interval. The default interval is 15 seconds. This may be too frequent for production workloads. Adjust it in the Application Real-Time Monitoring Service (ARMS) console based on your requirements. See Configure data collection rules.

Histogram metrics. Metrics such as istio_request_duration_milliseconds_bucket, istio_request_bytes_bucket, and istio_response_bytes_bucket generate large volumes of data and incur ongoing custom metric costs. To reduce costs, discard these metrics in the ARMS console. See Configure metrics.

Self-managed Prometheus. Deploy your own Prometheus instance to monitor ASM. See Monitor ASM instances by using a self-managed Prometheus instance.

View collected metrics on a Grafana dashboard:

Merge Istio metrics with application metrics

Prometheus can only scrape one metrics endpoint per pod. If your application already exposes its own Prometheus metrics, enable metric merging so that sidecar proxies serve both Istio and application metrics from a single endpoint (:15020/stats/prometheus).

When enabled, ASM adds prometheus.io annotations to all pods on the data plane. If these annotations already exist, they are overwritten. Prometheus then scrapes the merged metrics from the unified endpoint.

For configuration steps, see Merge Istio metrics with application metrics.

Mesh Topology

Mesh Topology provides a visual map of service-to-service communication in your mesh. Use it to identify dependencies, trace traffic flows, and spot anomalies across your application.

For setup instructions, see Enable Mesh Topology to improve observability.

Service level objectives (SLOs)

A service level indicator (SLI) measures a specific dimension of service health. A service level objective (SLO) sets a target or target range for one or more SLIs. SLOs provide application developers, platform operators, and operations teams a shared benchmark to measure and continuously improve service quality.

SLO examples:

Average queries per second (QPS) > 100,000/s
99th percentile latency < 500 ms
99th percentile bandwidth > 200 MB/s per minute

Supported SLI types:

SLI type	Plugin type	Description	Failure criteria
Availability	`availability`	Proportion of requests that receive a successful response	HTTP status code 429 or 5XX
Latency	`latency`	Time to return a response	Response time exceeds the configured threshold

After you configure SLOs in ASM, a Prometheus rule is automatically generated. Import this rule into your Prometheus instance for the SLOs to take effect. The Alertmanager component collects and routes alerts to your specified contacts.

For full SLO configuration, see SLO management.

Distributed tracing

Distributed tracing tracks requests as they flow through multiple services. Each unit of work is a span. Spans can be nested, and a collection of related spans forms a trace -- representing the full lifecycle of a single request.

ASM supports distributed tracing through tools such as Jaeger and Zipkin.

Header propagation requirement

Istio proxies automatically generate spans, but applications must propagate the following HTTP headers from inbound to outbound requests to correlate spans into a complete trace:

x-request-id
x-b3-traceid
x-b3-spanid
x-b3-parentspanid
x-b3-sampled
x-b3-flags
x-ot-span-context

Configure tracing data generation rules

Configure tracing parameters -- such as sampling rate and custom tags -- through the ASM console or the Telemetry CRD.

Tracing data generation rule configuration