Overview of observability - Alibaba Cloud Service Mesh - Alibaba Cloud Documentation Center

In Service Mesh (ASM), you may need to collect different observability data for different services. Therefore, you need to define collection configuration rules for observability data for sidecar proxies and gateway pods separately and standardize the collection configuration rules to improve the observability of cloud-native applications. Observability is very important for you to monitor the operation and performance of services in real time, and find and solve service issues and bottlenecks. Therefore, it can help improve the reliability and performance of applications. ASM provides a centralized and standardized mode for you to generate and collect observability data of cloud-native applications. This way, you can have a better observability of cloud-native applications. This topic describes concepts and features related to observability.

Introduction

Due to the complexity of an application in a microservices architecture, it is difficult to ensure that all services run in a stable state. Some services may be degraded in performance due to some issues. Therefore, applications must be reliable and resilient, and observability tools are available to monitor the status of applications and infrastructures during runtime. If you can obtain runtime information, you can detect failures and perform in-depth debugging when unexpected situations occur. This helps reduce the average service recovery time and mitigate the impact on your business.

Observability is a capability that depends on metrics at various levels such as application, network, and infrastructure (for example, database and storage). With those metrics, you can gain a thorough understanding of an unexpected issue when it happens. ASM can effectively facilitate the collection of application metrics in terms of observability. From a practical point of view, you must pay more attention to the stability of your application and know the real-time running status of the application. This way, you can quickly detect issues and take appropriate measures to maintain the availability of the application.

The sidecar proxies on the data plane are deployed on service request paths. You can monitor the running status of the corresponding services and service mesh during runtime by analyzing the observability data of the sidecar proxies.

功能介绍1.png

The implementation of the observability feature of ASM involves the configuration of rules for generating and collecting observability data such as logs, metrics, and tracing analysis. In addition, the observability feature must provide the methods for collecting the observability data to cloud-based services or self-managed services. To meet different requirements, the observability feature must support custom collection configurations for sidecar proxies and gateway pods. ASM provides a centralized and standardized mode for you to generate and collect observability data of cloud-native applications. This helps improve the observability of cloud-native applications. 功能介绍2.png

Best practices

Telemetry CustomResourceDefinitions (CRDs) can be used to create multiple resources in multiple namespaces. However, if you arbitrarily define CRDs, conflicts and unexpected results may occur. The following section lists the best practices based on real-world experience:

Only one Telemetry CRD can exist in the istio-system root namespace. You cannot define multiple Telemetry CRDs in the istio-system namespace. ASM applies this best practice and allows you to define only one Telemetry CRD named default in the istio-system namespace.
You can define only one Telemetry CRD whose workload selector is left empty and whose name is default in all namespaces.
To override specific workloads, you can define a workload selector in a new Telemetry CRD to select workloads for the desired namespace.
If two Telemetry CRDs have the same workload selector, it is uncertain which of the two CRDs will be executed.
If metrics are not configured in the Telemetry CRD in the istio-system namespace, no metrics are generated.

Logs

In ASM, log collection is an important means for you to observe services. To manage and retrieve logs in a unified way, you must aggregate logs of all services together. To do this, you must print the logs of each service to stdout or stderr and use the logging agent to collect logs into a central logging system. ASM provides log filtering and log formatting features. You can filter logs and configure log formats as required to better retrieve and analyze logs.

Configure log format rules

The log formats of different services may be different. Therefore, you need to configure generation rules to control how logs are generated. After you add a Kubernetes cluster to an ASM instance, Envoy proxies that are deployed on the data plane of the ASM instance can print all access logs of the cluster. ASM allows you to customize the fields of access logs printed by Envoy proxies.

Based on the Telemetry CRD, ASM provides a graphical interface as shown in the following figure to simplify log data format configuration. For more information, see Customize access logs on the data plane. 日志格式规则配置1.png

日志格式规则配置2.png

The following sample code is equivalent to the log generation configuration on the preceding graphical interface.

envoyFileAccessLog:
    logFormat:
      text: '{"bytes_received":"%BYTES_RECEIVED%","bytes_sent":"%BYTES_SENT%","downstream_local_address":"%DOWNSTREAM_LOCAL_ADDRESS%","downstream_remote_address":"%DOWNSTREAM_REMOTE_ADDRESS%","duration":"%DURATION%","istio_policy_status":"%DYNAMIC_METADATA(istio.mixer:status)%","method":"%REQ(:METHOD)%","path":"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%","protocol":"%PROTOCOL%","request_id":"%REQ(X-REQUEST-ID)%","requested_server_name":"%REQUESTED_SERVER_NAME%","response_code":"%RESPONSE_CODE%","response_flags":"%RESPONSE_FLAGS%","route_name":"%ROUTE_NAME%","start_time":"%START_TIME%","trace_id":"%REQ(X-B3-TRACEID)%","upstream_cluster":"%UPSTREAM_CLUSTER%","upstream_host":"%UPSTREAM_HOST%","upstream_local_address":"%UPSTREAM_LOCAL_ADDRESS%","upstream_service_time":"%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%","upstream_transport_failure_reason":"%UPSTREAM_TRANSPORT_FAILURE_REASON%","user_agent":"%REQ(USER-AGENT)%","x_forwarded_for":"%REQ(X-FORWARDED-FOR)%","authority_for":"%REQ(:AUTHORITY)%","upstream_response_time":"%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%","xff":"%REQ(X-FORWARDED-FOR)%","app_service_name":"%UPSTREAM_CLUSTER%"}'
    path: /dev/stdout

The following sample code is equivalent to the configuration of corresponding filter condition on the preceding graphical interface.

  accessLogging:
  - disabled: false
    filter:
      expression: response.code >= 400
    providers:
    - name: envoy

Configure data plane log collection

When you collect logs from the data plane to Simple Log Service, you must configure log collection rules to control the collection method and storage validity period of logs. Container Service for Kubernetes (ACK) integrates with Simple Log Service. You can collect the access logs of clusters on the data plane of an ASM instance. For more information, see Use Simple Log Service to collect access logs on the data plane.

数据面日志采集设置.png

Collect control plane logs and configure alerts

ASM allows you to collect control-plane logs and sends you alert notifications based on the log data. For example, you can collect logs related to configuration pushes from the control plane of an ASM instance to sidecar proxies on the data plane. One of the main features of the components on the ASM control plane is to push configurations to the sidecar proxies or ingress gateways on the data plane. If configuration conflicts occur, the sidecar proxies or ingress gateways cannot receive the configurations. The sidecar proxies or ingress gateways may continue to run based on the configurations they have previously received. However, the sidecar proxies or ingress gateways are likely to fail if the pods where they reside are restarted. In many practical situations, sidecar proxies or ingress gateways become unavailable due to improper configurations. Therefore, we recommend that you enable log-based alerting to detect and resolve issues in a timely manner. For more information, see Enable control-plane log collection and log-based alerting in an ASM instance of a version earlier than 1.17.2.35 or Enable control-plane log collection and log-based alerting in an ASM instance of version 1.17.2.35 or later.

Metrics

Metrics are important for users to observe services in ASM. Metrics are used to describe the processing of requests and the communication between services. Istio uses the Prometheus agent to collect and store metrics. The Envoy proxies of each service generate a large number of metrics. The metrics can be used to monitor the operation and performance of services in real time. The metrics can also be used in scenarios such as anomaly detection and auto scaling.

Configure metric data generation rules

If you enable data plane metrics, the data plane generates metric data related to the operation status of the gateways and sidecar proxies. You can collect metrics to Managed Service for Prometheus to view monitoring reports. You may be charged for collecting metrics. Alternatively, you can create a self-managed Prometheus instance to collect metrics from the data plane of an ASM instance.

Based on the Telemetry CRD, ASM provides a graphical interface as shown in the following figure to simplify the configuration of custom metrics. For more information, see Create custom metrics in ASM. 指标数据生成规则配置.png

The following sample code is equivalent to the configuration of custom metrics on the preceding graphical interface.

Expand to view details

  metrics:
  - overrides:
    - disabled: true
      match:
        metric: ALL_METRICS
        mode: CLIENT
    - disabled: false
      match:
        metric: ALL_METRICS
        mode: SERVER
      tagOverrides: {}
    - disabled: true
      match:
        metric: REQUEST_COUNT
        mode: CLIENT
    - disabled: false
      match:
        metric: REQUEST_COUNT
        mode: SERVER
      tagOverrides: {}
    - disabled: true
      match:
        metric: REQUEST_DURATION
        mode: CLIENT
    - disabled: false
      match:
        metric: REQUEST_DURATION
        mode: SERVER
      tagOverrides: {}
    - disabled: true
      match:
        metric: REQUEST_SIZE
        mode: CLIENT
    - disabled: false
      match:
        metric: REQUEST_SIZE
        mode: SERVER
      tagOverrides: {}
    - disabled: true
      match:
        metric: RESPONSE_SIZE
        mode: CLIENT
    - disabled: false
      match:
        metric: RESPONSE_SIZE
        mode: SERVER
      tagOverrides: {}
    - disabled: true
      match:
        metric: GRPC_REQUEST_MESSAGES
        mode: CLIENT
    - disabled: false
      match:
        metric: GRPC_REQUEST_MESSAGES
        mode: SERVER
      tagOverrides: {}
    - disabled: true
      match:
        metric: GRPC_RESPONSE_MESSAGES
        mode: CLIENT
    - disabled: false
      match:
        metric: GRPC_RESPONSE_MESSAGES
        mode: SERVER
      tagOverrides: {}
    - disabled: true
      match:
        metric: TCP_SENT_BYTES
        mode: CLIENT
    - disabled: false
      match:
        metric: TCP_SENT_BYTES
        mode: SERVER
      tagOverrides: {}
    - disabled: true
      match:
        metric: TCP_RECEIVED_BYTES
        mode: CLIENT
    - disabled: false
      match:
        metric: TCP_RECEIVED_BYTES
        mode: SERVER
      tagOverrides: {}
    - disabled: true
      match:
        metric: TCP_OPENED_CONNECTIONS
        mode: CLIENT
    - disabled: false
      match:
        metric: TCP_OPENED_CONNECTIONS
        mode: SERVER
      tagOverrides: {}
    - disabled: true
      match:
        metric: TCP_CLOSED_CONNECTIONS
        mode: CLIENT
    - disabled: false
      match:
        metric: TCP_CLOSED_CONNECTIONS
        mode: SERVER
      tagOverrides: {}
    providers:
    - name: prometheus

Considerations for metrics

Manage Service for Prometheus is a paid service. When you enable it for the first time, determine the scope of metrics you want to observe based on your business requirements. Observing a large number of metrics incurs excessive fees. For example, if you want to monitor a gateway, you must enable the client side metrics. If you have enabled the metrics, the previous settings of the metrics are retained when you enable the metrics again.
The Mesh Topology feature depends on the metrics reported by sidecar proxies. If you enable Mesh Topology, disabling some metrics may affect the normal operation of Mesh Topology.
- If you do not enable the server side metric of REQUEST_COUNT, the topology of HTTP or gRPC services cannot be generated.
- If you do not enable the server side metric of TCP_SENT_BYTES, the topology of TCP services cannot be generated.
- If you disable the server side metrics of REQUEST_SIZE and REQUEST_DURATION and the client side metric of REQUEST_SIZE, the monitoring information of some nodes in the topology may fail to display.

Configure metric collection

After you enable Managed Service for Prometheus, you can collect metrics to Managed Service for Prometheus for storage and analysis. ASM integrates Managed Service for Prometheus to monitor service meshes. For more information, see Integrate Managed Service for Prometheus to monitor ASM instances.

The metric collection interval has a significant impact on metric collection overheads. A longer interval means a lower data capturing frequency. This reduces overheads incurred by metric processing, storage, and computation. The metric collection interval is set to 15 seconds by default. This value may be small for production scenarios. You can set an appropriate metric collection interval based on your business requirements. If you are collecting metrics by using Managed Service for Prometheus, configure the required parameters in the Application Real-Time Monitoring Service (ARMS) console. For more information, see Configure data collection rules.

The metrics represented by the histogram such as istio_request_duration_milliseconds_bucket, istio_request_bytes_bucket, and istio_response_bytes_bucket involve a large amount of data and generate high overheads. To avoid continuous fees incurred by these custom metrics, you can discard these custom metrics. If you are using Managed Service for Prometheus, go to the ARMS console to configure metrics. For more information, see Configure metrics.

You can deploy a self-managed Prometheus instance to monitor ASM instances. For more information, see Monitor ASM instances by using a self-managed Prometheus instance.

As shown in the following figure, you can view metrics on a Grafana dashboard. 指标采集配置.png

Merge Istio metrics with application metrics

For an application integrated with Prometheus, you can use sidecar proxies to expose application metrics by merging Istio metrics with application metrics. After you enable the feature of merging Istio metrics with application metrics, ASM merges application metrics with Istio metrics. The prometheus.io annotations of the application are added to all pods on the data plane to enable the metric scraping capabilities of Prometheus. If these annotations already exist, they will be overwritten. Sidecar proxies merge Istio metrics with application metrics. Prometheus can obtain the merged metrics from the :15020/stats/prometheus endpoint. For more information, see Merge Istio metrics with application metrics.

Mesh Topology

Mesh Topology is a tool that is used to observe ASM instances. This tool provides a GUI that allows you to view related services and configurations. The following figure shows the service topology of an application. For more information, see Enable Mesh Topology to improve observability.

网格拓扑展示.png

SLO

A service level indicator (SLI) is a metric that measures service health. A service level objective (SLO) is an objective or a range of objectives that a service needs to achieve. An SLO consists of one or more SLIs.

SLOs provide a formal way to describe, measure, and monitor the performance, quality, and reliability of microservice-oriented applications. SLOs are a shared quality benchmark for application developers, platform operators, and O&M personnel. They can use SLOs as a reference to measure and continuously improve the service quality. An SLO helps describe the service health in a more accurate way.

Examples of SLOs:

Average queries per second (QPS) > 100,000/s
Latency of 99% access requests < 500 ms
Bandwidth per minute for 99% access requests > 200 MB/s

ASM provides out-of-the-box monitoring and alerting capabilities based on SLOs. You can monitor the performance metrics of calls between application services, such as the latency and error rate.

ASM supports the following SLI types:

Service availability: indicates the proportion of access requests that are successfully responded to. The plug-in type for this SLI type is availability. If the HTTP status code returned to an access request is 429 or 5XX, the access request is not successfully responded to. 5XX means that the status code starts with 5.
Latency: indicates the time required for the service to return a response to a request. The plug-in type for this SLI type is latency. You can specify the maximum latency. Responses that are returned later than the specified period of time are considered unqualified.

The following figure shows the GUI that ASM provides for you to define SLO configurations. SLO配置.png

After you configure SLOs for an application in ASM, a Prometheus rule is automatically generated. You can import the generated Prometheus rule to the Prometheus system for the SLOs to take effect. The Alertmanager component collects alerts generated by the Prometheus server and sends the alerts to the specified contacts. The following figure shows the Alertmanager page on which you can see that custom alert information is collected. For more information about SLOs, see SLO management. Alertmanager页面.png

Distributed tracing

Distributed tracing can be used to profile and monitor applications, especially those built using a microservices model. It is a key feature for observability in ASM. In the microservices model, a mesh controls service-to-service communication. Therefore, it is necessary to use distributed tracing technology to track and monitor the calls among services. In Istio, you can use distributed tracing tools such as Jaeger and Zipkin to achieve this goal. In distributed tracing, the following two concepts are important: trace and span.

Span: the fundamental building block of distributed tracing. A span represents a unit of work or operation. Spans can be nested. Multiple spans form a trace.
Trace: represents a complete process for a request - from its initiation to its completion. A trace consists of multiple spans.

Although Istio proxies can automatically send spans, they need some hints to tie together the entire trace. Applications need to propagate appropriate HTTP headers so that the spans sent by Istio proxies can be correctly correlated into a single trace. To do this, applications need to collect the following headers and propagate them from inbound requests to all outbound requests:

x-request-id
x-b3-traceid
x-b3-spanid
x-b3-parentspanid
x-b3-sampled
x-b3-flags
x-ot-span-context

Configure tracing data generation rules

Based on the Telemetry CRD, ASM provides a graphical interface as shown in the following figure to simplify the configuration of the generation rules for distributed tracing data. 追踪数据生成规则配置.png

The following sample code is equivalent to the configuration of generation rules for distributed tracing data on the preceding graphical interface.

  tracing:
  - customTags:
      mytag1:
        literal:
          value: fixedvalue
      mytag2:
        header:
          defaultValue: value1
          name: myheader1
      mytag3:
        environment:
          defaultValue: value1
          name: myenv1
    providers:
    - name: zipkin
    randomSamplingPercentage: 90

Configure tracing data collection

If you need to send the collected tracing data to a managed cloud service or a self-managed service, you can use the following methods:

In a managed cloud service, you can use the cloud-native application management service to collect and analyze data. For more information, see Enable distributed tracing in ASM.
In a self-managed service, you can use open source data collection and analysis tools, such as Zipkin and Jaeger, to collect and analyze data. For more information, see Export ASM tracing data to a self-managed system.

Alibaba Cloud Service Mesh:Overview of observability