Community Blog Improving Kubernetes Service Network Performance with Socket eBPF

Improving Kubernetes Service Network Performance with Socket eBPF

How can we use socket eBPF to improve the performance of the Kubernetes Service?


Background Information

The network features in Kubernetes mainly include POD network, service network, and network policies. Among them, the POD network and network policies specify the model but do not provide a built-inimplementation. As Serivce is a built-in feature of Kubernetes , the official version has multiple implementations:

Service mode Description
userspace proxy mode The kube-proxy is responsible for list/watch, rule settings, and user state forwarding.
iptables proxy mode The kube-proxy is responsible for list/watch and rule settings. IPtables-related kernel modules are responsible for forwarding.
IPVS proxy mode The kube-proxy is responsible for watching the k8s resource and setting rules. IPVS in Linux kernel is responsible for forwarding.

Among the several Service implementations that appeared in Kubernetes, the overall purpose is to provide higher performance and extensibility.

Service network is a distributed server load balancer. With kube-proxy deployed in daemonset mode, it listens to endpoint(or endpointslice) and service , and generates forwarding table items locally to node. Currently, the iptables and IPVS modes are used in the production environment. The following is the principle:


In this article, the logic of using socket eBPF to complete server load balancer at the socket level is introduced. This eliminates packet-by-packet NAT transformation processing and improves the Service network's performance.

Socket eBPF-based Data Plane

Introduction to the Socket eBPF

Whether kube-proxy uses IPVS or tc eBPF service network acceleration mode, each network request from a pod must go through IPVS or tc eBPF (POD <--> Service <--> POD). As traffic increases, there will be performance overhead. Then, can we change the cluster IP address of the service to the corresponding pod IP in the connection? The service network services based on Kube-proxy and IPVS are implemented based on per-report processing plus session.

With socket eBPF, we can implement SLB logic without processing messages and NAT conversion. The Service network is optimized to POD <--> POD in synchronization so that the service network performance is equivalent to that of the POD network. The following is the software structure:


In the Linux kernel, BPF_PROG_TYPE_CGROUP_SOCK types of eBPF hook can be used to insert necessary EBPF programs for socket system calls to hook.

  • By file descriptors that attach to a specific cgroup, we can control the scope of the hook interface.
  • With sock eBPF hook, we can hijack specific socket interfaces at the socket level to complete the SLB logic.
  • The forwarding of the POD-SVC-POD is converted into the forwarding of POD-POD.



TCP Workflow

TCP is based on connection, so the implementation is simple. You only need to hook connect system calls, as shown in the following:


Connect system call hijacking logic:

  1. Take the dip+dport from the connect call context to find the svc table. If it is not found, the return will not be processed.
  2. Find the affinity session. If it is found, get the backend_id, and go to 4. Otherwise, go to 3.
  3. Randomly schedule to assign a backend.
  4. According to backend_id, check the be table to obtainthe IP+ port of the be.
  5. Update the affinity information.
  6. Modify connect to call the dip+dport parameter in the context as be's ip+port.
  7. Done.

At the socket level, port conversion is completed. For the clusterip access to TCP, it is equivalent to the east-west communication between PODs, thus minimizing the overhead of clusterip.

  • The package-by-package dnat behavior is not required.
  • The behavior of looking up for svc package by package is not required.

UDP Workflow

UDP is connectionless and is more complex, as shown in the following figure:

For the definition of the nat_sk table, see: LB4_REVERSE_NAT_SK_MAP


Hijack connect and sendmsg system calls:

  1. Take dip+dport from the system call context to find the svc table. If it is not found, the return will not be processed.
  2. Find the affinity session. If the session is found, the backend_id is obtained, and go to 4. Otherwise, go to 3.
  3. Randomly schedule to assign a backend.
  4. According to backend_id, check the be table to obtain the IP+ port of the be.
  5. Update affinity tables.
  6. Update the nat_sk table. The key is the ip+port of be, and the value is the vip+vport of svc.
  7. Modify the dip+dport in the system call context as be 's ip+port.
  8. Done.

Hijacking recvmsg system call

  1. Find the NAT_SK table from the system call context remote IP+port. If it is unable to be found, the return will not be processed.
  2. If it is found, remove the IP+port to find the svc table. If it is not found, delete the corresponding table entry of nat_sk and return the value.
  3. Use ip+port found in nat_sk to set the remote IP+port in the system call context.
  4. Done.

About Address Correction

The implementation of the clusterIP bases on socket eBPF. In addition to the preceding basic forwarding principles, there are some special details to consider. One of them is the peer address. Unlike implementations such as IPVS, on the clusterIP of socket eBPF, the client communicates directly with backend, and the intermediate service is bypassed. The following is the forwarding path:


If the APP on the client calls interface query peer address such as getpeername, the address obtained at this time is inconsistent with the address initiated by the connect. If the app has a judgment or special purpose for the peeraddr, there may be unexpected situations.

In view of this situation, we can correct it at the socket level by eBPF:

  1. Add a bpf_attach_type to the guest kernel, which can be used to add hook processing to getpeername and getsockname.
  2. At the time of connection, in the corresponding socket hook process, define the map and record the calls of responded VIP: VPort and RSIP: RSPort.
  3. When APP calls getpeername/getsockname interface, use eBPF program to modify the returned data: modify in the context the remote IP+port to vip+vport.


Differences between TC-EBPF/IPVS performance

Test environment: 4 vCPUs and 8 GB mem secure container instances; single client, single clusterip, and 12 backend. socket BPF: a service implementation based on socket ebpf. tc eBPF: a cls-bpf-based service implementation, which has been applied in the ack service. IPVS-raw: remove all security group rules and overhead such as veth and only implement the service IPVS forwarding logic. Socket BPF improves all performance metrics to varying degrees. For a large number of concurrent short connections, the throughput is improved by 15%, and the latency is reduced by 20%. Comparison of forwarding performance (QPS)


Comparison of 90% Forwarding Latency (ms)


Continue to Evolve

The service based on the implementation of socket eBPF simplifies the load balancerlogic implementation and reflects the flexible and compact features of eBPF. These features of eBPF fit cloud-native scenarios well. Currently, this technology has been implemented in Alibaba Cloud to accelerate the service network of kubernetes.

0 0 0
Share on


48 posts | 2 followers

You may also like



48 posts | 2 followers

Related Products