By Zehuan Shi
In Kubernetes, services are used to organize workloads of the same type into clusters and provide load balancing capabilities to route requests randomly to endpoints within the clusters. However, in Istio Ambient Mesh, in order to achieve Layer 4 security, iptables rules are configured to intercept traffic to the Ztunnel component. This allows for encryption of Layer 4 traffic before sending it to the peer Ztunnel, which then forwards the traffic to the destination workload. As a result, Kubernetes iptables rules cannot be used for load balancing. This article will explain how Istio Ambient Mesh performs load balancing for Layer 4 traffic through analysis of iptables rules, network packet capture, and code analysis.
First, let's briefly discuss the load balancing mechanism in Kubernetes. Each service in Kubernetes is assigned a unique domain name and corresponding IP address. When a pod within the cluster initiates access to a service, the data packet hits the following iptables rule in the PRETOUTING chain of the nat table. It then enters the KUBE-SERVICES chain of the nat table when it arrives at the host machine:
*nat
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
There are multiple rules matching the destination address of the data packet in the KUBE-SERVICES chain. We can see that the first three are DNS hijacking routes, and the fourth is matching the address (Cluster IP) of the service named httpbin and its service port 8000. After matching, it will jump to the KUBE-SVC-FREKB6WNWYJLKTHC chain.
*nat
-A KUBE-SERVICES -d 10.96.0.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-TCOU7JCQXEZGVUNU
-A KUBE-SERVICES -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-SVC-ERIFXISQEP7F7OF4
-A KUBE-SERVICES -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:metrics cluster IP" -m tcp --dport 9153 -j KUBE-SVC-JD5MR3NA4I4DYORP
-A KUBE-SERVICES -d 10.96.130.105/32 -p tcp -m comment --comment "default/httpbin:http cluster IP" -m tcp --dport 8000 -j KUBE-SVC-FREKB6WNWYJLKTHC
-A KUBE-SERVICES -d 10.96.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
KUBE-SVC-FREKB6WNWYJLKTHC is a chain specifically created for the httpbin service. Let's examine the rules within this chain. We can observe that within the nat table of this chain, there are multiple rules with the options -m statistic --mode random --probability 0.16666666651. These rules are essential for Kubernetes service routing. Each request randomly hits one of these rules and then jumps to the custom chain specified by the rule. Based on the comments within the rules, it is evident that each rule corresponds to the address of a pod within the httpbin service.
*nat
-A KUBE-SVC-FREKB6WNWYJLKTHC ! -s 10.244.0.0/16 -d 10.96.130.105/32 -p tcp -m comment --comment "default/httpbin:http cluster IP" -m tcp --dport 8000 -j KUBE-MARK-MASQ
-A KUBE-SVC-FREKB6WNWYJLKTHC -m comment --comment "default/httpbin:http -> 10.244.1.5:80" -m statistic --mode random --probability 0.16666666651 -j KUBE-SEP-ILXXWWPSKGBZA46S
-A KUBE-SVC-FREKB6WNWYJLKTHC -m comment --comment "default/httpbin:http -> 10.244.1.6:80" -m statistic --mode random --probability 0.20000000019 -j KUBE-SEP-AOCDBVEKKM5YHLBK
-A KUBE-SVC-FREKB6WNWYJLKTHC -m comment --comment "default/httpbin:http -> 10.244.1.7:80" -m statistic --mode random --probability 0.25000000000 -j KUBE-SEP-DAUDCGLE4GJUMXCA
-A KUBE-SVC-FREKB6WNWYJLKTHC -m comment --comment "default/httpbin:http -> 10.244.2.10:80" -m statistic --mode random --probability 0.33333333349 -j KUBE-SEP-QIYMDL2I6J6HKXCT
-A KUBE-SVC-FREKB6WNWYJLKTHC -m comment --comment "default/httpbin:http -> 10.244.2.7:80" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-IOJMKVPIYFDGJ3TX
-A KUBE-SVC-FREKB6WNWYJLKTHC -m comment --comment "default/httpbin:http -> 10.244.2.9:80" -j KUBE-SEP-5RMYTATHPXR6OQCH
After hitting one of the above random rules, the request will continue to jump to a custom chain, which mainly performs DNAT operations: -m tcp -j DNAT --to-destination 10.244.1.5:80
*nat
-A KUBE-SEP-ILXXWWPSKGBZA46S -s 10.244.1.5/32 -m comment --comment "default/httpbin:http" -j KUBE-MARK-MASQ
-A KUBE-SEP-ILXXWWPSKGBZA46S -p tcp -m comment --comment "default/httpbin:http" -m tcp -j DNAT --to-destination 10.244.1.5:80
At this point, we have briefly analyzed the load balancing rules in Kubernetes. Now, let's take a look at how Istio Ambient Mesh can bypass these load balancing rules.
Istio Ambient Mesh provides Layer 4 traffic security. Therefore, it requires that after the traffic leaves the source pod, it enters the Ztunnel pod. In the Ztunnel, the traffic undergoes load balancing and encryption (if enabled). It is then sent to the node where the peer pod is located and enters the Ztunnel pod again. The Ztunnel pod decrypts the traffic and transparently transmits it to the destination pod.
First, let's examine the iptables rules on the host to understand how Istio skips the Kubernetes load balancing.
mangle*
-A PREROUTING -j ztunnel-PREROUTING # The ztunnel-PREROUTING chain that Istio joined.
As you can see, Istio has inserted a rule into the PREROUTING chain. If traffic hits this rule, it jumps to the ztunnel-PREROUTING chain for further execution. In this chain, the packet is marked 0 x 100:
*mangle
......
-A ztunnel-PREROUTING -p tcp -m set --match-set ztunnel-pods-ips src -j MARK --set-xmark 0x100/0x100
After the PREROUTING chain of the mangle table is processed, enter the PREROUTING chain of the nat table, and Istio inserts a rule before the Kubernetes service rule to jump directly to the ztunnel-PREROUTING.
*nat
-A PREROUTING -j ztunnel-PREROUTING
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A PREROUTING -d 172.18.0.1/32 -j DOCKER_OUTPUT
The ztunnel-PREROUTING chain of the nat table directly accepts packets marked with 0 x 100, thus skipping the iptables rules of the subsequent Kubernetes services:
*nat
-A ztunnel-PREROUTING -m mark --mark 0x100/0x100 -j ACCEPT
In addition to serving the purpose of skipping Kubernetes service rules in iptables, the 0x100 mark is also used for policy-based routing to redirect traffic to Ztunnel. However, since this article primarily focuses on load balancing, this aspect will not be discussed in detail.
To verify the effectiveness of the aforementioned rules, we will capture packets in the Ztunnel pod running on the host where the sleep application pod is located. We will access the httpbin service using curl within the sleep application. Here is the deployment information for the application and service:
| Type | Name | Address: port |
| Pod | sleep-bc9998558-8sf7z | 10.244.2.8:Random |
| Service | httpbin.default.svc.cluster.local | 10.96.130.105.8000 |
Run the tcpdump -i any 'src host 10.244.2.8' command to capture packages:
03:26:11.610298 pistioout In IP 10.244.2.8.48920 > 10.96.130.105.8000: Flags [S], seq 294867080, win 64240, options [mss 1460,sackOK,TS val 1553210180 ecr 0,nop,wscale 7], length 0
03:26:11.610336 eth0 In IP 10.244.2.8.48920 > 10.96.130.105.8000: Flags [.], ack 286067347, win 502, options [nop,nop,TS val 1553210181 ecr 61738018], length 0
03:26:11.610377 eth0 In IP 10.244.2.8.48920 > 10.96.130.105.8000: Flags [P.], seq 0:75, ack 1, win 502, options [nop,nop,TS val 1553210181 ecr 61738018], length 75
By capturing packets in the Ztunnel pod, you can verify that the packets entering the Ztunnel still maintain the service address (Cluster IP) 10.96.130.105.8000.
You may have noticed why the first packet comes from pistioout and the subsequent packets from eth0. I will share this in the subsequent article.
After the packet enters the Ztunnel pod, it will hit the iptables rules in the Ztunnel pod and then be tproxyed to the Ztunnel port 15001:
-A PREROUTING -i pistioout -p tcp -j TPROXY --on-port 15001 --on-ip 127.0.0.1 --tproxy-mark 0x400/0xfff
Next, we refer to the code of Ztunnel and analyze the implementation of its load balancing logic.
The outbound proxy implementation of Ztunnel is mainly in src/proxy/outbound.rs. In the run method, it accepts an outbound connection, creates an OutboundConnection, and then starts an asynchronous task, calls the proxy method of the OutboundConnection, passes in the connection, and continuously forwards the data received on the connection:
pub(super) async fn run(self) {
let accept = async move {
loop {
// Asynchronously wait for an inbound socket.
let socket = self.listener.accept().await;
let start_outbound_instant = Instant::now();
match socket {
Ok((stream, _remote)) => {
let mut oc = OutboundConnection {
pi: self.pi.clone(),
id: TraceParent::new(),
};
let span = info_span!("outbound", id=%oc.id);
tokio::spawn(
(async move {
let res = oc.proxy(stream).await;
...
})
.instrument(span),
);
}
...
}
}
}.in_current_span();
...
}
Let's analyze the proxy method of OutboundConnection. The proxy method is the wrapper of the proxy_to method. Let's look at the general logical structure of the proxy_to method:
pub async fn proxy_to(
&mut self,
mut stream: TcpStream,
remote_addr: IpAddr,
orig_dst_addr: SocketAddr,
block_passthrough: bool,
) -> Result<(), Error> {
if self.pi.cfg.proxy_mode == ProxyMode::Shared
&& Some(orig_dst_addr.ip()) == self.pi.cfg.local_ip
{
return Err(Error::SelfCall);
}
let req = self.build_request(remote_addr, orig_dst_addr).await?;
...
if req.request_type == RequestType::DirectLocal && can_fastpath {
// Process the local forwarding logic.
}
match req.protocol {
Protocol::HBONE => {
info!(
"proxy to {} using HBONE via {} type {:#?}",
req.destination, req.gateway, req.request_type
);
// Process HBONE forwarding.
let connect = async {
...
let tcp_stream = super::freebind_connect(local, req.gateway).await?;
...
}
}
Protocol::TCP => {
// Process TCP pass-through.
...
let mut outbound = super::freebind_connect(local, req.gateway).await?;
}
}
}
The code shows that the proxy_to function is the core logic of Ztunnel outbound forwarding. The general process is as follows:
• Call self.build_request to construct a request.
• Forward HBONE or TCP based on the protocol of the request.
Whether it is HBONE forwarding or TCP pass-through, the real destination of its connection is req.gateway. Let's analyze the code of build_request and see how it assigns a value to req.gateway:
async fn build_request(
&self,
downstream: IpAddr,
target: SocketAddr,
) -> Result<Request, Error> {
// The naming here is a bit confusing. In fact, the downstream is the destination address, which is the ClusterIP in the current scenario.
let downstream_network_addr = NetworkAddress {
network: self.pi.cfg.network.clone(),
address: downstream,
};
let source_workload = match self
.pi
.workloads
.fetch_workload(&downstream_network_addr)
.await
{
Some(wl) => wl,
None => return Err(Error::UnknownSource(downstream)),
};
// Search for upstream
let us = self
.pi
.workloads
.find_upstream(&source_workload.network, target, self.pi.hbone_port)
.await;
// If the upstream cannot be found
if us.is_none() {
...
}
// If the upstream service has a waypoint
match self.pi.workloads.find_waypoint(us.workload.clone()).await {
...
}
// Forward to Ztunnel
if !us.workload.node.is_empty()
&& self.pi.cfg.local_node.as_ref() == Some(&us.workload.node) // looks weird but in Rust borrows can be compared and will behave the same as owned (https://doc.rust-lang.org/std/primitive.reference.html)
&& us.workload.protocol == Protocol::HBONE
{
return Ok(Request {
...
gateway: SocketAddr::from((
us.workload
.gateway_address
.expect("gateway address confirmed")
.ip(),
15008,
)),
...
});
}
}
Since I am not going to discuss waypoint in this article, let's focus on the scenario where the destination is Ztunnel. We can see that the gateway under the branch of Ztunnel is set to us.workload.gateway_address.expect("gateway address confirmed").ip(). Where does this IP come from? We need to further look at the implementation of find_upstream above:
pub async fn find_upstream(
&self,
network: &str,
addr: SocketAddr,
hbone_port: u16,
) -> Option<Upstream> {
self.fetch_address(&network_addr(network, addr.ip())).await;
let wi = self.info.lock().unwrap();
wi.find_upstream(network, addr, hbone_port)
}
The find_upstream method calls fetch_address to obtain the corresponding workload info, then calls wi.find_upstream, passes in the obtained workload info, and finally returns the wi.find_upstream return value. Next, let's look at the implementation of find_upstream:
fn find_upstream(&self, network: &str, addr: SocketAddr, hbone_port: u16) -> Option<Upstream> {
if let Some(svc) = self.services_by_ip.get(&network_addr(network, addr.ip())) {
let svc = svc.read().unwrap().clone();
let Some(target_port) = svc.ports.get(&addr.port()) else {
debug!("found VIP {}, but port {} was unknown", addr.ip(), addr.port());
return None
};
// Randomly pick an upstream
// TODO: do this more efficiently, and not just randomly
let Some((_, ep)) = svc.endpoints.iter().choose(&mut rand::thread_rng()) else {
debug!("VIP {} has no healthy endpoints", addr);
return None
};
let Some(wl) = self.workloads.get(&network_addr(&ep.address.network, ep.address.address)) else {
debug!("failed to fetch workload for {}", ep.address);
return None
};
// If endpoint overrides the target port, use that instead
let target_port = ep.port.get(&addr.port()).unwrap_or(target_port);
let mut us = Upstream {
workload: wl.to_owned(),
port: *target_port,
};
Self::set_gateway_address(&mut us, hbone_port);
debug!("found upstream {} from VIP {}", us, addr.ip());
return Some(us);
}
if let Some(wl) = self.workloads.get(&network_addr(network, addr.ip())) {
let mut us = Upstream {
workload: wl.to_owned(),
port: addr.port(),
};
Self::set_gateway_address(&mut us, hbone_port);
debug!("found upstream: {}", us);
return Some(us);
}
None
}
As we can see, the logic here is that if the services_by_ip corresponding to the workload can be obtained, the svc.endpoints.iter().choose function is called to randomly select an endpoint as the destination address. This address is then used to fetch the corresponding workload information and is ultimately returned within an Upstream structure. This represents the core implementation logic of Layer 4 load balancing in Ztunnel.
Istio Ambient Mesh skips the Kubernetes load balancing through iptables, and then implements the Layer 4 load balancing logic in Ztunnel. I will continue to analyze Istio Ambient Mesh implementation in detail in future articles. Please stay tuned.
626 posts | 54 followers
FollowAlibaba Cloud Native Community - December 11, 2023
Alibaba Cloud Native Community - December 18, 2023
Alibaba Container Service - August 16, 2024
Alibaba Cloud Native Community - December 15, 2023
Alibaba Cloud Native - November 16, 2023
Alibaba Container Service - September 11, 2025
626 posts | 54 followers
Follow
Cloud-Native Applications Management Solution
Accelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn More
Container Service for Kubernetes
Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn More
Alibaba Cloud Service Mesh
Alibaba Cloud Service Mesh (ASM) is a fully managed service mesh platform that is compatible with Istio.
Learn More
Server Load Balancer
Respond to sudden traffic spikes and minimize response time with Server Load Balancer
Learn MoreMore Posts by Alibaba Cloud Native Community