Service Mesh: Application-aware Cloud Native Infrastructure

This article gives insight into the author's journey with service mesh as a cloud-native application-aware infrastructure, including 5 dimensions: Cl...

By Xining Wang

As a basic core technology used to manage application service communication, service mesh brings safe, reliable, fast, application-unaware traffic routing, security, and observability for calls between application services.

What Is Service Nesh Technology?

The service mesh technology represented by Istio has been around for four to five years, and Alibaba Cloud is one of the first vendors to support the service mesh cloud service. The service mesh technology decouples from the application by sidecar-izing the capabilities of service governance. These sidecar proxies form a mesh data plane through which traffic between all application services can be processed and observed.

The management component responsible for how the data plane performs is called the control plane, which is the brain of service mesh. It provides open APIs for users to manipulate network behavior easier. With service mesh, Dev/Ops/SRE will address application service management issues in a unified, declarative manner.

Service Mesh: Application-Aware Cloud-Native Infrastructure

We think of service mesh as a cloud-native application-aware infrastructure. It is about service governance and includes cloud-native application-aware features and capabilities in several other dimensions. We divide it into five parts:

Cloud-Native Application Network
Cloud-Native Application Governance
Cloud-Native Zero Trust Security
Observability and Application Elasticity of Cloud-Native Applications
Cloud-Native Application Ecosystem Engine

Cloud-Native Application Network

Let's look at the cloud-native application network. We divide it into three parts:

Cloud-Native Application Network in Sidecar Mode
Network in L4 and L7 Proxy Decoupling Mode
Priority Routing Based on Regional Availability Zone Information

The network is the core part of Kubernetes, involving communication between pods, communication between pods and services, and communication between services and external systems. Kubernetes clusters use CNI plug-ins to manage the container network functions and use Kube-proxy to maintain network rules on nodes, such as performing load balancing to the traffic destined for services and sending it to the correct backend pods through ClusterIP and port.

The container network has become a new interface for users to use IaaS networks. The ACK Terway network is used as an example. Based on the Alibaba Cloud VPC Elastic Network Interface (ENI), it has high-performance features and seamlessly connects with the Alibaba Cloud IAAS network. However, kube-proxy settings are global, with no fine-grained control for each service, and kube-proxy operates exclusively at the network packet level. It cannot meet the needs of modern applications, such as traffic management, tracing, and authentication on the application layer, etc.

Let's look at how the cloud-native application network in service mesh sidecar proxy mode solves these problems. Service mesh decouples traffic control from the service layer of Kubernetes through sidecar proxies, injects proxies into each pod, and manages these distributed proxies through the control plane, allowing for more fine-grained traffic control between these services.

What is the process of transmitting network packets under the sidecar proxy?

The current Istio implementation involves the overhead of the TCP/IP stack. It uses the netfilter of the Linux kernel to intercept traffic by configuring iptables and routes the traffic to and from the sidecar proxy based on the configured rules. A typical path from a client pod to a server pod (even within the same host) must traverse the TCP/IP stack at least three times (outbound, client sidecar proxy to server sidecar proxy, and inbound).

In order to solve this network data path problem, the industry has introduced eBPF to bypass the TCP/IP network stack in the Linux kernel to achieve network acceleration, which reduces latency and improves throughput. Of course, eBPF does not replace the seven-layer proxy capability of Envoy. In the scenario of seven-layer traffic management, the integrated mode of 4-layer eBPF + 7-layer Envoy proxy is still needed. Only in cases where 4-layer traffic network is targeted, can the proxy pod be skipped directly into the network interface.

In addition to the traditional sidecar proxy mode, the industry has begun to explore a new data plane model. Its design concept is – the data plane is layered, layer 4 is used for basic processing, characterized by low resources and high efficiency, and layer 7 is used for advanced traffic processing, characterized by rich features (but requires more resources). This way, users can employ service mesh techniques in progressive increments based on the range of features desired.

Specifically, the processing on layer 4 mainly includes:

Traffic Management: TCP routing
Security: Simple authorization policies for layer 4 and mutual TLS
Observable: TCP monitoring metrics and logs

The processing on layer 7 mainly includes:

Traffic Management: HTTP routing, load balancing, circuit breaking, throttling, fault tolerance, retry, and timeout
Security: Fine-grained authorization policies for layer 7
Observable: HTTP monitoring metrics, access logs, and Tracing Analysis

What will be the network topology of service mesh in the decoupling mode of the L4 and L7 Proxy on the data plane?

On the one hand, the L4 Proxy capability is moved down to the CNI component. The L4 Proxy component runs in the form of a DaemonSet and runs on each node separately. This means it is a shared base component that serves all pods running on a node.

On the other hand, the L7 Proxy no longer exists in the Sidecar mode but is decoupled. We call it Waypoint Proxy, which is an L7 Proxy pod created for each Service Account.

The configurations of layer 4 and layer 7 proxy are still managed by the service mesh control plane component.

In short, this decoupling pattern enables further decoupling and separation between the service mesh data plane and the application.

Then, the specific implementation of Istio can be divided into three parts:

Waypoint Proxy: The L7 component runs independently of the application and provides higher security. Each identity (service account in Kubernetes) has a dedicated L7 Proxy (Waypoint proxy) to avoid the complexity and instability introduced by the multi-tenant L7 proxy mode. In addition, the Kubernetes Gateway CRD definition is used to trigger and enable the L7 Proxy.
ztunnel: Sink L4 processing to the CNI level. Traffic from the workload is redirected to the ztunnel, which identifies the workload and selects the correct certificate to process.
Compatible with Sidecar Mode: Sidecar mode is still a first-class citizen of mesh. Waypoint proxy can communicate locally with workloads that have sidecar deployed.

An Example of Service Mesh as a Cloud-Native Application Network: Priority Routing Based on Regional Availability Zone Information

In this example, we deploy two sets of the same application services to two different availability zones, namely, zone A and zone B in this deployment architecture diagram. In normal cases, these application services follow the mechanism of prioritizing traffic routing in the same availability zone. In other words, the application service productpage in zone A points to the reviews service in the current availability zone.

When services in the current availability zone are unavailable (for example, the reviews service fails), the disaster recovery mode is automatically enabled. In this case, the productpage service in zone A points to the reviews service in zone B.

This traffic topology shows the call links displayed by availability zone information in the mesh topology provided by Alibaba Cloud Service Mesh (ASM). The process of implementing the traffic routing in the same availability zone can be completed without modifying the application code.

Similarly, the service mesh technology automatically implements failover to ensure high availability without modifying the application code. In this topology, you can see that the application services in zone A are automatically switched to calling the services in zone B.

Cloud-Native Application Governance

Let's look at cloud-native application governance, which contains many dimensions. Here, we focus on three parts:

Warm-Up Feature
End-to-End Traffic Management
Integration with the Service Framework

Let's look at the difference before and after the warm-up mode is enabled.

Warm-up mode is not enabled:

Progressive traffic increases for new Pods are not supported. Whenever a new target Pod joins, the requester sends a certain percentage of traffic to the Pod. However, this may not be desirable for services that require some warm-up time before they can process all assigned requests. In this case, request timeout and data loss may occur, hampering user experience.
As a practical example, the problems above are manifested in JVM-based web applications that use the horizontal pod for auto scaling. When a service is started, it will be flooded with a large number of requests, which will cause constant timeouts while the application is warming up. Therefore, every time the service is extended, data loss occurs.

Warm-up mode is enabled:

The warm-up/slow-start mode is also known as progressive traffic increase. Users can configure a time period for the service. When a service instance is enabled, the requester sends a portion of the request load to the instance and increments the request volume over the configured period of time. When the warm-up window duration runs out, the system exits the warm-up mode.
In warm-up mode, when adding new target service Pods, they can be prevented from being overwhelmed by a large number of requests. These new target services can be warmed up before accepting the requests of the equalization policies according to the specified acceleration period.

A Customer Case: Support for the Warm-Up Feature for Graceful Service Launch in Scale-out and Rolling Update Scenarios

You can also see from these two figures that after the warm-up mode is enabled, the time required for a newly created Pod to reach a balanced load request is lengthened. This also reflects that the number of requests entering the Pod gradually increases during the warm-up time.

Background:

A large amount of traffic during the cold start of a Java application causes the CPU to be full and call exceptions.
Application rolling updates cannot achieve graceful launch only through readiness.

Support for the warm-up feature for graceful service launch:

ASM detects the start or shutdown of an instance. When a machine is found to start, ASM automatically reduces its weight. After a certain period of time (the warm-up period), ASM adjusts its weight to 100% (the weight adjustment ratio).
Completely non-invasive to the application

How is this feature implemented in ASM? Simply put, a single line of configuration can realize the warm-up capability. This topology is also a simplified de-identification version of this actual case. You can see the warm-up feature is used.

It can realize a smooth start of the business and perfectly avoid the problem of business jitter.
After the warm-up feature is enabled, the traffic is evenly distributed to v1 and v2. Compared with the situation where the warm-up feature is not enabled, it takes about 1 minute and 10 seconds longer for the traffic buffer.

End-to-End Traffic Management, Two Scenarios

End-to-End Canary Release:

While the production environment is running normally, perform a canary release on some application services. For example, perform a canary release on Application B and Application D in the figure. Without modifying the application logic, the service mesh technology can be used to dynamically route requests to services in different versions according to the request source or request header information. For example, if the request header contains tag 1, Application A will call the canary release version, Application B. If Application C does not have a canary release version, the system will automatically perform the fallback operation to the original version.

Multi-Version Development Environment:

Similarly, there is a baseline environment, and other different development environments can deploy some application services in their versions as needed. According to the request source or request header information, the system dynamically routes requests to services in different versions.

In the entire process, you do not need to modify the application code logic. You only need to tag applications. ASM implements an abstract definition of Traffic Label CRD to simplify the tagging of traffic and application instances.

Application Governance and Service Framework Integration

A service framework similar to Spring Cloud or Dubbo is still implemented in the legacy application system. ASM supports multi-model service governance and implements call intercommunication, monitoring intercommunication, and governance intercommunication.

The service information of the original registration center is registered to service mesh through the MCP over the xDS protocol, and the control plane component of service mesh is used to uniformly manage rule configuration and distribute governance rules by the xDS protocol.

Zero Trust Security for Cloud-Native Applications

Next, look at Zero Trust security for cloud-native applications, including:

Zero Trust Foundation – Workload Identity
Zero Trust Carrier – Security Certificates
Zero Trust Engine – Policy Enforcement
Zero Trust Insight – Visualization and Analysis

Specifically, building an ASM-based Zero Trust security capability system includes the following aspects.

Zero Trust Foundation – Workload Identity: It provides a unified identity for cloud-native workloads. ASM products provide a simple and easy-to-use identity definition for each workload under service mesh and provide a customized mechanism to extend the identity construction system according to specific scenarios. The customized mechanism is compatible with the SPIFFE standard.
Zero Trust Carrier – Security Certificates: ASM provides mechanisms (such as how to issue certificates and manage the lifecycle and rotation of certificates). The X509 TLS certificate is used to establish identities. Each proxy uses the certificate. ASM also provides the rotation of certificates and private keys.
Zero Trust Engine – Policy Enforcement: A policy-based trust engine is a key to zero trust. In addition to the Istio RBAC authorization policy, ASM provides fine-grained authorization policies based on the Open Policy Agent (OPA) and supports the dry-run mode.
Zero Trust Insight – Visualization and Analysis: ASM provides an observable mechanism to monitor policy execution logs and metrics to determine the implementation of each policy.

Using ASM to implement zero trust has the following advantages:

Independent Lifecycle of Sidecar Proxy
Dynamic Configuration without Restart
The centralized control of service mesh reduces the mental burden on developers.
The security of the authentication authorization system is strengthened.
Centralized Management and Issuance of OPA Authorization Policies
Simplified Access to Third-Party Authorization Services and OIDC Identity Authentication

This is a scenario where a platform adopts the ASM-based Zero Trust security technology. As you can see, the customer follows the basic principles of the Zero Trust security architecture:

Workload Identity
Service Certification
Service Authorization
TLS Encryption
OPA Policy
Least Privilege Access Policy

Observability and Application Elasticity of Cloud-Native Applications

Next, look at the observability and application elasticity of cloud-native applications, including:

Mesh Observability Center
Mesh Diagnostics
Mesh Topology

Mesh Observability Center: Unified Observability System and Linkage Analysis (3 Dimensions)

The first observability capability is log analysis. Through the collection and analysis of the AccessLog on the data plane, especially the analysis of the ingress gateway log, the traffic situation of the service request, the status code ratio, etc., can be obtained to help optimize the calling between these services.
The second observability capability is the distributed tracing capability. Tracing Analysis provides a wide range of tools to help developers identify performance bottlenecks of distributed applications. This helps developers improve the efficiency of developing and troubleshooting applications that use the microservices architecture. The provided tools can be used to map traces, offer trace topologies, analyze application dependencies, and calculate the number of requests.
The third observability capability is the monitoring capability. This capability can generate a set of service metrics based on the four monitoring dimensions (latency, traffic, error, and saturation) to understand and monitor the behavior of services in the mesh.

Mesh Diagnostics: It Detects Potential Problems with ASM Configurations and Provides Improvements

Based on our customer support and hands-on experience with the service mesh technology:

There are more than 30 built-in diagnostic rules and best practices.
ASM provides a static check of syntax for each Istio resource object.
ASM supports dynamic validation of execution semantics for multiple Istio resource objects.
ASM provides corresponding improvement measures for diagnosed problems.

Mesh Topology: It Provides Instant Insight into ASM Behavior

In addition to powerful mesh traffic topology visualizations, a playback feature is provided to select traffic over a past period.

Cloud-Native Application Ecosystem Engine

Next, look at the cloud-native application ecosystem engine, including three parts:

Plug-In Market
Capability Center
New Scenario Definition

Out-of-the-Box EnvoyFilter Plug-In Market, WebAssembly Plug-In Full Lifecycle Management

Based on the built-in template, you only need to perform simple configurations based on the corresponding parameter requirements to deploy the corresponding EnvoyFilter plug-in. Through such a mechanism, the data plane becomes a more extensible plug-in collection capability.

Capability Center: Integration of Service Mesh and Ecosystem

There is a series of application-centric ecosystems in the industry around the service mesh technology. Fully-managed ASM supports the following ecosystems:

Blue-Green Release, Progressive Delivery, GitOps, and CI/CD: It supports integration with systems (such as Argo Rollout, Flagger, and Alibaba Cloud DevOps) to implement the blue-green or canary release of applications.
Compatible with Microservice Frameworks: It supports seamless migration of Spring Boot/Cloud applications to service mesh for unified management and governance.
Serverless: Knative runs on the managed ASM to deploy and run Serverless workloads. Knative routes traffic between different versions based on service mesh.
AI Serving: Kubeflow Serving is a community project led by Google based on Kubernetes support for the Machine Learning Platform for AI. Its next-generation name was changed to KServe. The purpose of this project is to support different Machine Learning Platform for AI frameworks in cloud-native ways and implement traffic control and model version updates and rollbacks based on service mesh.
Policy as Code: OPA/Declarative Policy Expression Language Rego: In addition to using Kubernetes Network Policy to implement 3-layer network security control, ASM provides workload identity, peer identity and request identity authentication capabilities, Istio authorization policies, and more finely managed OPA-based policy control capabilities. Generally speaking, building a service mesh-based Zero Trust security capability system includes the following aspects: Zero Trust foundation-workload identity, Zero Trust carrier-security certificates, Zero Trust engine-policy enforcement, and Zero Trust insight-visualization and analysis.

In addition, we are exploring broadening new scenarios driven by service mesh. I will give an AI Serving example here.

This demand comes from our actual customers who want to run KServe based on the service mesh technology to implement AI services. KServe runs smoothly on service mesh to implement blue/green and canary deployment of model services, traffic distribution between revisions, and other capabilities. It supports auto-scaling serverless inference workload deployment, high scalability, and concurrency-based intelligent load routing.

The Key Layer of the Cloud-Native Infrastructure: An Important Bridge for the Digital Transformation of Enterprises

As the first fully managed Istio-compatible service mesh product in the industry, ASM maintained consistency with the community and industry trends in architecture from the very beginning. The components of the control plane are hosted on the Alibaba Cloud side and are independent of the user clusters on the data side. ASM is customized and implemented based on the community's open-source Istio. It provides component capabilities to support refined traffic management and security management on the managed control plane side. The managed mode decouples the lifecycle management of Istio components from the managed Kubernetes clusters, making the architecture more flexible and improving system scalability.

The managed ASM has become the infrastructure for the unified management of various heterogeneous computing services. It provides unified traffic management capabilities, unified service security capabilities, unified service observability capabilities, and WebAssembly-based unified proxy scalability capabilities to build enterprise-level capabilities.

The cloud-native application infrastructure supported by service mesh brings important advantages, summarized in the following six aspects:

Advantage 1: Unified Governance of Heterogeneous Services

Interoperability and governance of multi-language and multi-framework, dual-mode architecture integrated with traditional microservice systems
Refined multi-protocol traffic control and unified management of east-west and north-south traffic
Automated service discovery for unified heterogeneous computing infrastructure

Advantage 2: End-to-End Observability

Integrated intelligent O&M for logging, monitoring, and tracking
Intuitive and easy-to-use visual mesh topology and health identification system based on color identification
Built-in best practices and self-service mesh diagnostics

Advantage 3: Zero Trust Security

End-to-end mTLS encryption, attribute-based RAM (ABAC)
OPA declarative policy engine and globally unique workload identity (Identity)
Complete audit history and insight analysis with dashboards

Advantage 4: Soft and Hard Combined Performance Optimization

It is the first service mesh platform to improve TLS encryption and decryption based on Intel Multi-Buffer technology.
NFD automatically detects hardware features and adaptively supports features (such as AVX instruction set, QAT hardware acceleration, and other features).
It is the first batch certified by the trusted cloud service mesh platform and performance evaluation.

Advantage 5: SLO-Driven Application Elasticity

Service level objectives (SLO) policy
Automatic auto scaling of application services based on observability data
Automatic switchover and disaster recovery under traffic bursts of multiple clusters

Advantage 6: Out-of-the-Box Extensions & Ecosystem Compatibility

Out-of-the-box EnvoyFilter plug-in market and WebAssembly plug-in full lifecycle management
Unified integration with Proxyless mode and support for SDK and kernel eBPF
Compatible with the Istio ecosystem and support for Serverless/Knative, AI Serving/KServe

Community

Service Mesh: Application-aware Cloud Native Infrastructure

What Is Service Nesh Technology?

Service Mesh: Application-Aware Cloud-Native Infrastructure

Cloud-Native Application Network

Cloud-Native Application Governance

End-to-End Traffic Management, Two Scenarios

Application Governance and Service Framework Integration

Zero Trust Security for Cloud-Native Applications

Observability and Application Elasticity of Cloud-Native Applications

Mesh Observability Center: Unified Observability System and Linkage Analysis (3 Dimensions)

Mesh Diagnostics: It Detects Potential Problems with ASM Configurations and Provides Improvements

Mesh Topology: It Provides Instant Insight into ASM Behavior

Cloud-Native Application Ecosystem Engine

Out-of-the-Box EnvoyFilter Plug-In Market, WebAssembly Plug-In Full Lifecycle Management

Capability Center: Integration of Service Mesh and Ecosystem

The Key Layer of the Cloud-Native Infrastructure: An Important Bridge for the Digital Transformation of Enterprises

Read previous post:

Read next post:

Xi Ning Wang(王夕宁)

You may also like

Comments

Xi Ning Wang(王夕宁)

Related Products

Alibaba Cloud Service Mesh

Cloud-Native Applications Management Solution

Managed Service for Prometheus

PolarDB for Xscale