Virtual network device and its application in cloud native

Ask questions

Have you heard of VLAN? Its full name is Virtual Local Area Network, which is used to isolate different broadcast domains in Ethernet. It was born very early. In 1995, IEEE published the 802.1Q standard [1], which defines the format of VLAN in Ethernet data frames, and has been used up to now. If you know VLAN, have you heard of MACVlan and IPVlan? With the continuous rise of container technology, IPVlan and MACVlan, as Linux virtual network devices, have slowly come to the front stage. In the 1.13.1 version of Docker Engine in 2017 [2], IPVlan and MACVlan have been introduced as container network solutions.

Do you have the following questions?

1. What is the relationship between VLAN and IPVlan and MACVlan? Why are there VLANs in the name?

2. Why do IPVlan and MACVlan have various modes and flags, such as VEPA, Private, Passthrough, etc? What is the difference between them?

3. What are the advantages of IPVlan and MACVlan? Under what circumstances should you touch and use them?

I once had the same question. In this article, we will explore the above three questions.

background knowledge

Here are some background knowledge. If you know Linux well, you can skip it.

• Check the abstraction of network equipment

In Linux, we operate a network device by using the ip command or ifconfig command. For the implementation of ip command iproute2, what it really depends on is the netlink message mechanism provided by Linux. The kernel will abstract a structure that specifically responds to the netlink message for each type of network device (whether real or virtual), and they all follow rtnl_ link_ Ops structure, which is used to respond to the creation, destruction and modification of network devices. For example, the more intuitive Veth device:

For a network device, the operation of Linux and the response of hardware devices also need a set of specifications. Linux abstracts it as net_ device_ If you are interested in device drivers, the structure of ops is mainly to deal with it. Take the Veth device as an example:

From the above definition, we can see several semantically intuitive methods: ndo_ start_ Xmit is used to send packets, and newlink is used to create a new device.

For receiving packets, the packet receiving action of Linux is not completed by each process itself, but is handled by the ksoftirqd kernel thread from the driver, network layer (ip, iptables), transport layer (tcp, udp), and finally placed in the recv buffer of the socket held by the user process, and then processed by the kernel inotify user process. For virtual devices, all differences are concentrated in front of the network layer, where there is a unified entrance, namely__ netif_ receive_ skb_ core。

• Definition of VLAN in 801.2q protocol

In 802.1q protocol, the VLAN field used to mark the Ethernet data frame header is a 32bit domain, and the structure is as follows:

As shown above, there are 16 bits for labeling Protocol, 3 bits for labeling priority, 1 bit for labeling format, and 12 bits for storing VLAN ID. I think you can easily calculate how many broadcast domains can we divide based on VLAN? Yes, it is 2 * 124096. Subtracting all 0 and all 1 reserved, the customer can divide 4094 available broadcast domains. (Before the rise of OpenFlow, the implementation of vpc in the earliest prototype of cloud computing depended on VLAN for network differentiation, but due to this limitation, it was soon eliminated, which also gave birth to another term that you may have familiar with, VxLAN. Although the two are very different, there are still reasons for reference).

VLAN was originally a switch concept like bridge, but Linux implemented them in software. Linux uses a 16-bit vlan in each Ethernet data frame_ Proto field and 16bit vlan_ The tci field implements the 802.1q protocol. At the same time, for each VLAN, a sub-device will be virtualized to process the packets after the VLAN is removed. Yes, VLANs also have their own sub-devices, namely VLAN sub-interface. Different VLAN sub-devices receive and receive physical packets through a master device. Is this concept a little familiar? Yes, this is the principle of ENI-Trunking.

Go deep into the core implementation of VLAN/MACVlan/IPVlan

After adding the background knowledge, let's start with the VLAN sub-device and see how the Linux kernel works. All the kernel codes here are based on the latest version 5.16.2.

VLAN sub-device

• Device creation

The VLAN sub-device was not treated as a separate virtual device at first. After all, it appeared very early and the code distribution was chaotic. However, the core logic is located in the/net/8021q/path. From the background, we can see that the netlink mechanism implements the entry of network card device creation. For VLAN sub-devices, the structure of their netlink message implementation is vlan_ link_ Ops, while vlan is responsible for creating VLAN sub-devices_ Newlink method, the kernel initialization code flow is as follows:

1. First create a general net for Linux_ The device structure saves the configuration information of the device and enters the vlan_ After newlink, vlan will be performed_ check_ real_ Dev checks whether the incoming VLAN ID is available, which will call vlan_ find_ The dev method, which is used to find qualified sub-devices for a master device, will be used later. Let's take a part of the code to observe:

2. Next is through vlan_ The changelink method sets the properties of the device. If you have special configurations, the default values will be overwritten.

3. Finally enter register_ vlan_ Dev method, which is to load the completed information into net_ Device structure, and register it into the kernel according to the unified device management interface of Linux.

• Receive message

From the point of view of creation process, the difference between VLAN sub-device and general device is that it can be passed through the vlan by the master device and VLAN ID_ find_ This is very important.

Next, let's look at the message receiving process. According to the background knowledge, after the physical device receives the message, before entering the protocol stack for processing, the normal entry is__ netif_ receive_ skb_ Core, we will gradually analyze from this entry. The kernel operation process is as follows:

According to the diagram above, we intercept part__ netif_ receive_ skb_ Core analysis:

1. At the beginning of the packet processing process, skb will be performed_ vlan_ Untag operation. For VLAN packets, the packet protocol field is always the ETH of the VLAN_ P_ 8021Q ,skb_ vlan_ Untag is to transfer VLAN information from the vlan of the packet_ After extracting from the tci field, call vlan_ set_ encap_ Proto updates the protocol to the normal network layer protocol. At this time, the VLAN has been partially converted into normal packets.

2. Packets with VLAN tags will be stored in skb_ vlan_ tag_ Enter vlan in present_ do_ Receive processing flow, vlan_ do_ The core of the receive processing is through vlan_ find_ Dev finds the sub-device, sets the dev in the packet as the sub-device, and then cleans up the VLAN-related information such as Priority. At this point, the VLAN packet has been transformed into a normal packet sent to the VLAN sub-device.

3. In vlan_ do_ After receiving, it will enter another_ Round, execute it again according to the normal packet flow__ netif_ receive_ skb_ Core, entered according to the normal package processing logic, entered rx_ The handler is processed like a normal data packet through the same rx as the main device on the sub-device_ The handler enters the network layer.

• Data transmission

The data transmission entry of VLAN sub-equipment is vlan_ dev_ hard_ start_ Compared with the packet receiving process, the sending process of xmit is much simpler. The process of kernel sending is as follows:

When the hardware sends, the VLAN sub-device will enter the vlan_ dev_ hard_ start_ Xmit method, which implements ndo_ start_ Xmit interface, which passes__ vlan_ hwaccel_ put_ The tag method fills the VLAN-related Ethernet information into the message, and then modifies the device of the message as the main device, and calls the dev of the main device_ queue_ The xmit method re-enters the sending queue of the main device for sending. We intercept the key part for analysis.

MACVlan equipment

After reading the VLAN sub-device, analyze the MACVlan immediately. The difference between MACVlan and VLAN sub-device is that it is no longer the capability of Ethernet itself, but a virtual network device with its own driver. This is first reflected in the independence of the driver code. MACVlan-related codes are basically located in/drivers/net/macvlan. c.

There are five modes for MACVlan devices. Except for the source mode, the other four modes appear earlier. The definition is as follows:

Let's remember the behavior of these patterns first. The reason is the question we will answer later.

• Device creation

For MACVlan devices, its netlink response structure is macvlan_ link_ Ops, we can find the response method to create the device is macvlan_ Newlink, starting from the entrance, the overall process of creating a MACVlan device is as follows:

1. macvlan_ Newlink will call macvlan_ common_ Newlink performs the actual sub-device creation operation, macvlan_ common_ Newlink will first perform a validation of its validity, which needs to be noted is netif_ is_ MACVlan checks that if a MACVlan sub-device is created as the master device, the master device of the sub-device will be automatically used as the master device of the new network card.

2. Next, we will pass eth_ hw_ addr_ Random creates a random mac address for the MACVlan device. Yes, the MAC address of the MACVlan sub-device is random. This is very important, as will be mentioned later.

3. After having a mac address, start initializing MACVlan logic on the master device. Here is a check. If the master device has never created a MACVlan device, it will pass the macvlan_ port_ Create to support the initialization of MACVlan. The most important part of this initialization is to call netdev_ rx_ handler_ Register performs rx of MACVlan_ Handler method macvlan_ handle_ Frame replaces the rx originally registered by the device_ The action of the handler.

4. After initialization, obtain a port, that is, the sub-device, and then set the information of the sub-device.

5. Finally, register_ Netdevice completes the device creation. We intercept some core logic for analysis:

• Receive message

MACVlan device still receives messages from__ netif_ receive_ skb_ The specific code flow is as follows:

1. When__ netif_ receive_ skb_ After the core is received by the master device, it will enter the macvlan registered by the MACVlan driver_ handle_ Frame method, which first processes multicast packets, and then unicast packets.

2. For multicast messages, via is_ multicast_ ether_ After addr, first use macvlan_ hash_ Lookup: find the sub-device through the relevant information on the sub-device, and process it according to the mode of the network card. If it is private or passthrough, find the sub-device and use the macvlan alone_ broadcast_ One gives it; If it is a bridge or there is no VEPA, all the sub-devices will pass the macvlan_ broadcast_ Enqueue receives the broadcast message.

3. For unicast messages, the source mode and passthru mode will be processed first to directly trigger the operation of the upper layer. For other modes, macvlan will be performed according to the source mac_ hash_ During the lookup operation, if the VLAN information is found, set the dev of the message to the found sub-device.

4. Finally, pkt the message_ Type_ HANDLER_ ANOTHER's return, do it again__ netif_ receive_ skb_ Core operation. In this operation, go to macvlan_ hash_ When looking up, RX will be returned because it is already a child device_ HANDLER_ PASS thus enters the upper layer.

5. For the data receiving process of MACVlan, the most critical is the logic of selecting the sub-equipment after the master equipment receives the message. This part of code is as follows:

• Send message

MACVlan's message sending process is also to receive ndo from the sub-device_ start_ The xmit callback function starts. Its entry is macvlan_ start_ Xmit, the overall kernel code flow is as follows:

1. When the packet enters the macvlan_ start_ After xmit, macvlan mainly performs packet transmission_ queue_ Xmit method.

2. macvlan_ queue_ Xmit first deals with the bridge mode. From the definition of the mode, we can see that only in the bridge mode can direct communication between different sub-devices occur in the main device. This special case is dealt with here. Multicast messages and unicast messages destined for other sub-devices are sent directly to sub-devices.

3. For other messages, it will pass dev_ queue_ xmit_ Accel sends, dev_ queue_ xmit_ Accel will directly call netdev of the main device_ start_ Xmit method to realize the real transmission of messages.

IPVlan equipment

Compared with MACVlan and VLAN sub-devices, IPVlan sub-devices have more complex models. Unlike MACVlan, IPVlan defines the interworking behavior with sub-devices through flags, and provides three modes, as defined below:

• Device creation

With the analysis of the previous two sub-devices, we can also continue the analysis of IPVlan according to this idea. The netlink message processing structure of IPVlan devices is ipvlan_ link_ Ops, and the entry method for creating devices is ipvlan_ link_ New, the process of creating IPVlan sub-devices is as follows:

1. Enter ipvlan_ link_ New, to judge the legitimacy. Similar to MACVlan, if an IPVlan device is added as the primary device, the primary device of the IPVlan device will be automatically used as the primary device of the new device.

2. Through eth_ hw_ addr_ Set the MAC address of the IPVlan device to the MAC address of the primary device. This is the most obvious feature difference between IPVlan and MACVlan.

3. Enter the register registered with the unified network card_ Netdevice process. In this process, if no IPVlan sub-device exists at present, it will enter the ipvlan like MACVlan_ Init initialization process, which will create ipvl on the primary device_ Port, and use rx of IPVlan_ Handler to replace the original rx of the main device_ The handler will also start a special kernel worker to process multicast packets. That is, for IPVlan, all multicast packets are processed uniformly.

4. Next, continue to process the newly added sub-device through ipvlan_ set_ port_ Mode saves the current sub-device to the information of the master device, and registers its l3mdev processing method to nf for the sub-device of l3s_ In hook, yes, this is the biggest difference from the above devices. The main device and sub-device of l3s actually exchange packets at the network layer.

For IPVlan network devices, we intercept ipvlan_ port_ Create part of the code for analysis:

• Receive message

The three modes of the IPVlan sub-device have different packet receiving processes. The process in the kernel is as follows:

1. Similar to MACVlan, it will first pass through__ netif_ receive_ skb_ The core enters the ipvlan registered at the time of creation_ handle_ Frame processing flow. At this time, the data packet is still owned by the master device.

2. For message processing in mode l2, only multicast messages are processed, and the messages are put into the multicast processing queue initialized when the sub-device is created previously; Unicast messages will be delivered directly to the ipvlan_ handle_ mode_ L3!

3. Enter ipvlan for mode l3 or unicast mode l2 message_ handle_ mode_ L3 processing flow, first through ipvlan_ get_ L3_ The hdr obtains the header information of the network layer, finds the corresponding sub device according to the IP address, and finally calls the ipvlan_ rcv_ Frame, set the dev of the message as IPVlan sub-device and return RX_ HANDLER_ ANOTHER, carry out the next packet collection.

4. For mode l3s, on ipvlan_ handle_ RX will be returned directly in frame_ HANDLER_ PASS, that is, the message of mode l3s will enter the network layer processing stage at the main device. For mode l3s, the pre-registered nf_ Hook will be in NF_ INET_ LOCAL_ Triggered when IN, execute ipvlan_ l3_ Rcv operation, find the sub-device through addr, change the destination address of the network layer of the message, and then directly enter ip_ local_ The delivery performs the remaining operations of the network layer.

• Send message

Although the implementation of IPVlan message sending is relatively complex, the root is that each sub-device is trying to use the master device to send messages. When IPVlan sub-devices send packets, they first enter ipvlan_ start_ Xmit, whose core sending operation is in ipvlan_ queue_ Xmit, the kernel code flow is as follows:

1. ipvlan_ queue_ Xmit selects different transmission methods according to the mode of the sub-device, and mode l2 uses ipvlan_ xmit_ mode_ L2 transmission, mode l3 and mode l3s for ipvlan_ xmit_ mode_ L3 transmission.

2. For ipvlan_ xmit_ mode_ L2, first judge whether it is a local address or VEPA mode. If it is not a VEPA mode local message, first pass ipvlan_ addr_ Lookup finds out whether it is an IPVlan sub-device under the same master device. If it is, use ipvlan_ rcv_ Frame allows other sub-devices to receive packets; If not, use dev_ forward_ Skb allows the master device to process.

3. Next, ipvlan_ xmit_ mode_ L2 will process the multicast message. Before processing, it will pass the ipvlan_ skb_ crossing_ Ns cleans up the netns-related information of the packets, including priority, and finally puts the packets on the ipvlan_ multicast_ Enqueue to trigger the above multicast processing flow.

4. For non-local data packets, via the dev of the master device_ queue_ Xmit sends.

5. ipvlan_ xmit_ mode_ The processing of l3 is also to judge the VEPA first. For the data packets with non-VEPA mode, ipvlan_ addr_ Look up to find out whether it is another sub-device. If it is, call ipvlan_ rcv_ Frame triggers other devices to receive packets.

6. For data packets in non-VEPA mode, first perform ipvlan_ skb_ crossing_ Ns, and then perform ipvlan_ process_ Outbound operation, select ipvlan according to the network layer protocol of the data packet_ process_ v4_ Outbound or ipvlan_ process_ v6_ Outbound.

7. Use ipvlan_ process_ v6_ Outbound, for example, will first pass IP_ route_ output_ Flow looks up the route, and then directly passes through the IP of the network layer_ local_ Out, continue the contract operation on the network layer of the main equipment.

solve the problem

After the above analysis and experience, I think at least the first question can be easily answered:

Relationship between VLAN and MACVlan/IPVlan

What is the relationship between VLAN and IPVlan and MACVlan? Why are there VLANs in the name?

Since MACVlan and IPVlan choose this name, it shows that there are similarities in some aspects. Through overall analysis, we find that the core logic of VLAN sub-devices is very similar to MACVlan and IPVlan:

1. The main equipment is responsible for the physical sending and receiving of packets.

2. The master device manages the sub-devices as multiple ports, and then finds the ports according to certain rules, such as VLAN information, MAC address and IP address (macvlan_hash_lookup, vlan_find_dev, ipvlan_addr_lookup).

3. After the main equipment receives the package, it needs to go through the__ netif_ receive_ skb_ Take a "turn back" in the core

4. In the end, the sub-equipment is sent directly by modifying the dev of the message, and then letting the main equipment operate.

Therefore, it is not difficult to infer that the internal logic of MACVlan/IPVlan actually refers to the implementation of Linux VLAN to a large extent. Linux first joined MACVlan in version 2.6.63 [3] released on June 18, 2007. His description is as follows:

The new "MACVlan" driver allows the system administrator to create virtual interfaces mapped to and from specific MAC addresses.

In version 3.19 [4] released on December 7, 2014, IPVlan was introduced for the first time. His description is as follows:

The new "IPVlan" driver enable the creation of virtual network devices for container interconnection. It is designed to work well with network namespaces. IPVlan is much like the existing MACVlan driver, but it does its multiplexing at a higher level in the stack.

As for VLAN, it appeared much earlier than Linux 2.4. The first version of the driver of many devices already supported VLAN. However, Linux's implementation of the hwaccel for VLAN was 2.6.10 [5] in 2004. At that time, among a large number of updated features, this article appeared:

I was poking about in the National Semi 83820 driver, and I happened to notice that the chip supports VLAN tag add/strip assist in hardware, but the driver wasn't making use of it. This patch adds in the driver support to use the VLAN tag add/remove hardware, and enables the drivers use of the kernel VLAN hwaccel interface.

That is to say, when Linux began to treat VLAN as an interface, there were two virtual interfaces, MACVlan and IPVlan. In order to speed up the processing process of VLAN packets, Linux virtualized different VLANs into devices. Under this idea, MACVlan and IPVlan in the later stage made virtual devices more useful.

In this way, their relationship is more like a homage.

About VEPA/passthrough/bridge/private

Why do IPVlan and MACVlan have various modes and flags, such as VEPA, private, passthrough, etc? What is the difference between them?

In fact, in the analysis of the kernel, we have roughly understood the performance of these modes. If the main device is a nail group, and all group friends can send messages to the outside, then the modes are very intuitive:

1. In the private mode, group friends are forbidden from each other, neither within nor outside the group.

2. Bridge mode, group friends can speak happily in the group.

3. In VEPA mode, group friends have forbidden to speak in the group, but you can talk directly and privately outside the group, which is equivalent to the collective forbidden speech in the period of red packet snatching at the annual meeting.

4. In passthrough mode, you are the group leader, and no one can speak except you.

So why are these models? From the performance of the kernel, whether it is a port or a bridge, it is actually a concept of network. That is to say, from the beginning, Linux has been trying to represent itself as a qualified network device. For the main device, Linux has tried to make it a switch. For the sub-device, it is the device behind each network cable. This seems very reasonable.

In fact, this is exactly the case. Both VEPA and private are network concepts at first. In fact, it is not just Linux. We have seen many projects that are committed to masquerading themselves as physical networks, all of which follow these behavior patterns, such as OpenvSwitch [6].

Application of MACVlan and IPVlan

What are the advantages of IPVlan and MACVlan? Under what circumstances should you touch and use them?

In fact, I just started to talk about the original intention of this article. We find from the second problem that both IPVlan and MACVlan are doing one thing: virtual network. Why do we need virtual network? There are many answers to this question, but like the value of cloud computing, virtual network, as a basic technology of cloud computing, is ultimately to improve the efficiency of resource utilization.

MACVlan and IPVlan serve this ultimate goal. The era of local tyrants running a helloworld on a physical machine is still in the past. From virtualization to containerization, the era has put forward higher and higher requirements for network density. With the birth of container technology, the first step is to step onto the dance stage, but the density is enough, and the performance is also efficient, MACVlan and IPVlan came into being by improving density and ensuring efficiency through sub-devices (and of course our ENI Trunking).

Speaking of this, I would like to recommend the new high-performance and high-density network solution brought by Alibaba Cloud Container Service ACK - IPVlan solution [7]!

Based on the Terway plug-in, ACK implements the K8s network solution based on IPVlan. The Terway network plug-in is a network plug-in developed by ACK. It allocates the native elastic network card to the Pod to implement the Pod network. It supports the network policy based on the Kubernetes standard to define the access policy between containers, and is compatible with the network policy of Calico.

In the Terway network plug-in, each Pod has its own network stack and IP address. Pod communication within the same ECS is directly forwarded through the internal machine. Pod communication across ECS and messages are directly forwarded through the VPC's elastic network card. Since it is not necessary to use tunnel technology such as VxLAN to encapsulate messages, the Terway mode network has high communication performance. The network mode of Terway is shown below:

When customers create a cluster using ACK, if they select the Terway network plug-in, they can configure it to use the Terway IPv4 mode. The Terway IPvlan mode uses IPvlan virtualization and eBPF kernel technology to realize high-performance Pod and Service networks.

Unlike the default Terway network mode, IPvlan mode mainly optimizes the performance of Pod network, service and network policy:

• The network of Pod is directly realized through the sub-interface of the IPvlan L2 of the ENI network card, which greatly simplifies the forwarding process of the network on the host computer. The network performance of Pod is almost the same as that of the host computer, and the latency is reduced by 30% compared with the traditional mode.

• The service network uses eBPF to replace the original kube-proxy mode, and does not need to be forwarded through iptables or IPVS on the host computer. In large clusters, the performance is almost unchanged and the scalability is better. In a large number of new connections and port reuse scenarios, the request latency is significantly lower than that of IPVS and iptables modes.

• Pod's network policy also uses eBPF to replace the original implementation of iptables. It does not need to generate a large number of iptables rules on the host computer to minimize the impact of network policy on network performance.

Therefore, using IPVlan to allocate IPVlan network cards for each service pod not only ensures the density of the network, but also makes the traditional network's Veth scheme achieve a huge performance improvement (see reference link 7 for details). At the same time, the Terway IPvlan mode provides high-performance service solutions. Based on the eBPF technology, we have avoided the long-criticized Conntrack performance problem.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us