From VLAN to IPVLAN: Virtual Network Device and Its Application in Cloud-Native

By Zhang Wei (Xieshi)

This long article discusses the implementation of the kernel. If you are not interested in this part, I suggest jumping directly to our solutions (after giving some thought to the three questions below).

Questions

Note: All the code in this article is Linux kernel source code, version 5.16.2.

Have you ever heard of a Virtual Local Area Network (VLAN)? It is used to isolate different broadcast domains in ethernet. It was born early. In 1995, IEEE published the 802.1Q standard [1], defining the format of VLAN in Ethernet data frames, which is still in use today. If you know VLAN, have you heard of MACVlan and IPVlan? With the emergence of container technology, IPVlan and MACVlan, as Linux virtual network devices, slowly came to the stage. IPVlan and MACVlan were introduced as network solutions for containers in the Docker Engine version 1.13.1 [2] in 2017.

Have you ever had the following questions?

What is the relationship between VLAN, IPVlan, and MACVlan? Why is there a VLAN in their names?

Why do IPVlan and MACVlan have various modes and flags, such as VEPA, Private, Passthrough? What are the differences?

What are the advantages of IPVlan and MACVlan? Under what circumstances will users come across and use them?

These are the same problems I used to have, but today's article will explore the answers.

Background

The following is background information. If you know Linux well, you can skip it.

The Abstraction of Network Device by the Kernel

In Linux, when we operate a network device, we will use nothing more than IP commands or ifconfig commands. In terms of the implementation of the IP command iproute2, it depends on the netlink message mechanism provided by Linux. The kernel will abstract a structure for each type of network device (real or virtual) that specifically responds to netlink messages. They are all implemented according to the rtnl_link_ops structure to respond to the creation, destruction, and modification of network devices (such as the intuitive Veth device):

static struct rtnl_link_ops veth_link_ops = {
    .kind   = DRV_NAME,
    .priv_size  = sizeof(struct veth_priv),
    .setup    = veth_setup,
    .validate = veth_validate,
    .newlink  = veth_newlink,
    .dellink  = veth_dellink,
    .policy   = veth_policy,
    .maxtype  = VETH_INFO_MAX,
    .get_link_net = veth_get_link_net,
    .get_num_tx_queues  = veth_get_num_queues,
    .get_num_rx_queues  = veth_get_num_queues,
};

The operation of Linux and the response of hardware devices require a set of specifications for a network device. Linux abstracts it as a net_device_ops structure. If you are interested in device drivers, you mainly deal with them. Let's take the Veth device as an example:

static const struct net_device_ops veth_netdev_ops = {
  .ndo_init            = veth_dev_init,
  .ndo_open            = veth_open,
  .ndo_stop            = veth_close,
  .ndo_start_xmit      = veth_xmit,
  .ndo_get_stats64     = veth_get_stats64,
  .ndo_set_rx_mode     = veth_set_multicast_list,
  .ndo_set_mac_address = eth_mac_addr,
#ifdef CONFIG_NET_POLL_CONTROLLER
  .ndo_poll_controller  = veth_poll_controller,
#endif
  .ndo_get_iflink   = veth_get_iflink,
  .ndo_fix_features = veth_fix_features,
  .ndo_set_features = veth_set_features,
  .ndo_features_check = passthru_features_check,
  .ndo_set_rx_headroom  = veth_set_rx_headroom,
  .ndo_bpf    = veth_xdp,
  .ndo_xdp_xmit   = veth_ndo_xdp_xmit,
  .ndo_get_peer_dev = veth_peer_dev,
};

From the definition above, we can see several semantically intuitive methods: ndo_start_xmit is used to send packets. Newlink is used to create a new device.

The data packet receiving of Linux is not completed by each process, in the process in which the ksoftirqd kernel thread first processes the receiving driver, Network Layer (ip, iptables), and transport layer (tcp, udp). Then, they're put into the recv buffer of the Socket held by the user process and processed by the kernel inotify user process. For virtual devices, all the differences are concentrated before the Network Layer, where there is a unified entrance, the __netif_receive_skb_core.

The Definition of VLAN by the 801.2q Protocol

In the 802.1q protocol, a 32-bit domain is used to mark the VLAN field in the header of the Ethernet data frame. The structure is listed below:

16bits	3bits	1bit	12bits
TPID is Protocol.	PCP is a Priority.	CFI is Format Identifier	VID is the VLAN ID.

As shown above, there are 16 bits for marking Protocol, 3 bits for marking priority, 1 bit for marking format, and 12 bits for storing VLAN ID. You can easily calculate how many broadcast domains we can divide by VLAN. Yes, it is 212, which is 4,096. We minus the reserved all 0 and 1, and the customer divides 4094 available broadcast domains. (Before the rise of OpenFlow, the implementation of VPC in the earliest form of cloud computing relied on VLAN to distinguish networks. Due to this limitation, it was soon eliminated, which gave birth to another term you may be familiar with, VxLAN. Although the two are different, they still have something in common.)*

VLAN was originally a switch concept like bridge, and Linux implemented them in software. Linux uses a 16-bit vlan_proto field and a 16-bit vlan_tci field to implement 802.1q protocol in each Ethernet data frame. At the same time, for each VLAN, a sub-device will be virtualized to process the message after VLAN is removed. VLAN has its sub-device, namely a VLAN sub-interface. Different VLAN sub-devices send and receive physical messages through one master device. Is this concept a bit familiar? Yes, this is the principle of ENI-Trunking.

Go Deep into the Kernel Implementation of VLAN/MACVlan/IPVlan

With the background knowledge, let's go deeper into the Linux kernel, starting with the VLAN sub-device. All the kernel code here is based on the new version 5.16.2.

VLAN Sub-Device

Device Creation

VLAN sub-devices were not initially treated as separate virtual devices. After all, they appeared early, and the code distribution was messy. The core logic was under the /net/8021q/ path. From the background, we can understand that the netlink mechanism implements the entry for network interface controller device creation. For VLAN sub-devices, the structure of their netlink message implementation is vlan_link_ops, while the vlan_newlink method is responsible for creating VLAN sub-devices. The following is the kernel initialization code process:

1. Create a Linux general net_device structure to save the configuration information of the device. After entering vlan_newlink, a vlan_check_real_dev will be carried out to check whether the incoming VLAN id is available. The Vlan_find_dev method will be called during the process. This method is used to find qualified sub-devices for a main device, which will be used later. Let's take a look at some codes intercepted in the process:

static int vlan_newlink(struct net *src_net, struct net_device *dev,
      struct nlattr *tb[], struct nlattr *data[],
      struct netlink_ext_ack *extack)
{
  struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
  struct net_device *real_dev;
  unsigned int max_mtu;
  __be16 proto;
  int err;

  /*The part used for parameter verification is omitted here.*/

    // The vlan information of the vlan sub-device is set here, which is the default value of the protocol, vlanid, priority, and flag information related to the vlan in the background knowledge
  vlan->vlan_proto = proto;
  vlan->vlan_id  = nla_get_u16(data[IFLA_VLAN_ID]);
  vlan->real_dev   = real_dev;
  dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
  vlan->flags  = VLAN_FLAG_REORDER_HDR;

  err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id,
          extack);
  if (err < 0)
    return err;

  /*There will be mtu configuration here*/

  err = vlan_changelink(dev, tb, data, extack);
  if (!err)
    err = register_vlan_dev(dev, extack);
  if (err)
    vlan_dev_uninit(dev);
  return err;
}

2. The next step is to set the properties of the device through the vlan_changelink method. If you have special configurations, the default value will be overwritten.

3. Enter the register_vlan_dev method, which is to load the previously completed information into the net_device structure and register it into the kernel according to the Linux device management unified interface.

Receive Messages

From the perspective of the creation process, the difference between a VLAN sub-device and a general device is that the former can be found by the main device and the VLAN id in a vlan_find_dev way, which is important.

Next, let's look at the message-receiving process. According to the background knowledge, after the physical device receives the message and before entering the protocol stack processing, the conventional entry is __netif_receive_skb_core. We will analyze it from this entry. The following is the kernel operation process.

According to the diagram, we intercept __netif_receive_skb_core for analysis.

At the beginning of the packet processing, the skb_vlan_untag operation will be carried out. For VLAN packets, the packet Protocol field has been ETH_P_8021Q of the VLAN. After the VLAN information is extracted from the vlan_tci field of the packet, skb_vlan_untag calls the vlan_set_encap_proto to update the Protocol to the normal Network Layer protocol. At this time, part of the VLAN has been converted into the normal packet.
The data packet with the VLAN tag will enter the vlan_do_recieve processing in the skb_vlan_tag_present. The core of the vlan_do_receive processing is to find the sub-device through the vlan_find_dev, set dev in the data packet as the sub-device, and clean up the VLAN-related information (such as Priority). Here, the VLAN data packet has been transformed into an ordinary data packet sent to the VLAN sub-device.
After the vlan_do_receive is completed, it will enter another_round, and the __netif_receive_skb_core will be executed again according to the process of normal data packets. According to the processing logic of normal data packets, it will enter the processing of rx_handler. Just like a normal data packet, it will enter the Network Layer through the same rx_handler as the master device on the sub-device.

static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
            struct packet_type **ppt_prev)
{
  rx_handler_func_t *rx_handler;
  struct sk_buff *skb = *pskb;
  struct net_device *orig_dev;

another_round:
  skb->skb_iif = skb->dev->ifindex;
  /* This is an attempt to decapsulate the vlan of the data frame message itself and fill the two fields related to the vlan in the background.*/
  if (eth_type_vlan(skb->protocol)) {
    skb = skb_vlan_untag(skb);
    if (unlikely(!skb))
      goto out;
  }

    /* This is the packet capture point of tcpdump that you know well. Pt_prev records the handler of the previous packet processing. As you can see, a skb may be processed in many places, including pcap. */
  list_for_each_entry_rcu(ptype, &ptype_all, list) {
    if (pt_prev)
      ret = deliver_skb(skb, pt_prev, orig_dev);
    pt_prev = ptype;
  }

  /* In the case of vlan tag, if pt_prev exists, deliver_skb will be performed once. As such, other handlers will copy one, and the original message will not be modified. */
  if (skb_vlan_tag_present(skb)) {
    if (pt_prev) {
      ret = deliver_skb(skb, pt_prev, orig_dev);
      pt_prev = NULL;
    }
        /* This is the core part. We can see that after the vlan_do_receive processing, it will become a normal message again */
    if (vlan_do_receive(&skb))
      goto another_round;
    else if (unlikely(!skb))
      goto out;
  }

    /* This is where normal messages should arrive. Pt_prev indicates that the normal handler has been found. Then, call rx_handler to enter the upper-layer processing */
  rx_handler = rcu_dereference(skb->dev->rx_handler);
  if (rx_handler) {
    if (pt_prev) {
      ret = deliver_skb(skb, pt_prev, orig_dev);
      pt_prev = NULL;
    }
    switch (rx_handler(&skb)) {
    case RX_HANDLER_CONSUMED:
      ret = NET_RX_SUCCESS;
      goto out;
    case RX_HANDLER_ANOTHER:
      goto another_round;
    case RX_HANDLER_EXACT:
      deliver_exact = true;
      break;
    case RX_HANDLER_PASS:
      break;
    }
  }

  if (unlikely(skb_vlan_tag_present(skb)) && !netdev_uses_dsa(skb->dev)) {
check_vlan_id:
    if (skb_vlan_tag_get_id(skb)) {
    /* This is the case where the vlan id is not removed correctly, usually because the vlan id is invalid or does not exist locally
  }
}

Data Sending

The entry point for data transmission of VLAN sub-devices is vlan_dev_hard_start_xmit. Compared with the packet-receiving process, the sending process is simpler. The kernel sends the following process.

When the hardware is sent, the VLAN sub-device will enter the vlan_dev_hard_start_xmit method, which implements the ndo_start_xmit interface. It fills the VLAN-related Ethernet information into the message through the __vlan_hwaccel_put_tag method, modifies the device of the message as the main device, and calls the dev_queue_xmit method of the main device to re-enter the sending queue of the main device for sending. We intercept the key part to analyze.

static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff *skb,
              struct net_device *dev)
{
  /* This is the filling of vlan_tci mentioned above. This information belongs to the sub-device itself. */
  if (veth->h_vlan_proto != vlan->vlan_proto ||
      vlan->flags & VLAN_FLAG_REORDER_HDR) {
    u16 vlan_tci;
    vlan_tci = vlan->vlan_id;
    vlan_tci |= vlan_dev_get_egress_qos_mask(dev, skb->priority);
    __vlan_hwaccel_put_tag(skb, vlan->vlan_proto, vlan_tci);
  }

    /* The device is directly changed from a sub-device to a master device. */
  skb->dev = vlan->real_dev;
  len = skb->len;
  if (unlikely(netpoll_tx_running(dev)))
    return vlan_netpoll_send_skb(vlan, skb);

    /* Here, you can directly call the master device to send messages */
  ret = dev_queue_xmit(skb);

    ...

  return ret;
}

MACVlan Devices

After the VLAN sub-devices, analyze the MACVlan immediately. The difference between MACVlan and VLAN sub-devices is that the former is no longer the capability of Ethernet itself but a virtual network device with its own driver. It is first reflected in the independence of the driver code. The relevant codes of MACVlan are located in the /drivers/net/macvlan.c.

There are five modes for MACVlan devices. Besides the source mode, the other four modes appear earlier. The following are the definitions:

enum macvlan_mode {
  MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */
  MACVLAN_MODE_VEPA    = 2, /* talk to other ports through ext bridge */
  MACVLAN_MODE_BRIDGE  = 4, /* talk to bridge ports directly */
  MACVLAN_MODE_PASSTHRU = 8,/* take over the underlying device */
  MACVLAN_MODE_SOURCE  = 16,/* use source MAC address list to assign */
};

Let's remember the behavior of these modes first. We will elaborate on the reasons later.

Device Creation

For a MACVlan device, its netlink response structure is macvlan_link_ops. We can find that the response method for creating a device is macvlan_newlink. Starting from the entry, the following is the overall process of creating a MACVlan device.

The macvlan_newlink will call the macvlan_common_newlink to create the actual sub-device. The macvlan_common_newlink will carry out a legality check first. What needs to be paid attention to is the netif_is_MACVlan check. If a MACVlan sub-device is created as the main device, the main device of this sub-device will be used as the main device of the new network interface controller.
Next, a random mac address will be created for the MACVlan device by eth_hw_addr_random. Note: The mac address of MACVlan device is random. It will be mentioned later.
After having the mac address, initialize the MACVlan logic on the main device. There will be a check here. If the main device has never created a MACVlan device, it will support the initialization of the MACVlan through macvlan_port_create. The core of this initialization is calling the netdev_rx_handler_register to perform the rx_handler method macvlan_handle_frame of the MACVlan to replace the rx_handler originally registered by the device.
After the initialization is completed, we obtain a port, which is a sub-device, and set the information of the sub-device.
Finally, the creation of the device is completed through register_netdevice. We intercept some of the core logic for analysis.

int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
         struct nlattr *tb[], struct nlattr *data[],
         struct netlink_ext_ack *extack)
{
  ...

    /* Here, it checks whether the main device is a macvlan device. If so, use the main device directly.*/
  if (netif_is_macvlan(lowerdev))
    lowerdev = macvlan_dev_real_dev(lowerdev);

    /* A random mac address is generated here.*/
  if (!tb[IFLA_ADDRESS])
    eth_hw_addr_random(dev);

  /* The initialization operation is performed here, which means the rx_handler is replaced.*/
  if (!netif_is_macvlan_port(lowerdev)) {
    err = macvlan_port_create(lowerdev);
    if (err < 0)
      return err;
    create = true;
  }
  port = macvlan_port_get_rtnl(lowerdev);

    /* The next large paragraph is omitted about the mode setting.*/
  vlan->lowerdev = lowerdev;
  vlan->dev      = dev;
  vlan->port     = port;
  vlan->set_features = MACVLAN_FEATURES;
  vlan->mode     = MACVLAN_MODE_VEPA;

    /* Last, register the device */
  err = register_netdevice(dev);
  if (err < 0)
    goto destroy_macvlan_port;
}

Receive Messages

The message receiving of the MACVlan device still starts from the __netif_receive_skb_core entry. The following is the specific code process.

When the __netif_receive_skb_core is received by the master device, it will enter the macvlan_handle_frame method of MACVlan driver registration. This method first processes multi-broadcast messages and uni-broadcast messages.
For multicast messages, after the is_multicast_ether_addr operation, the sub-devices are found through macvlan_hash_lookup and relevant information on the sub-devices and processed according to the mode of network interface controller. If it is private or passthrough, the sub-devices are found and sent to it through the macvlan_broadcast_one alone. If it is the bridge or there is no VEPA, all sub-devices will receive broadcast messages through macvlan_broadcast_enqueue.
For unicast messages, the source mode and passthrough mode will be processed first to trigger the operation of the upper layer. For other modes, the macvlan_hash_lookup operation will be performed according to the source mac. If VLAN information is found, the dev of the message will be set as the found sub-device.
Set the pkt_type of the message, return it through the RX_HANDLER_ANOTHER, and perform another __netif_receive_skb_core operation. In this operation, when it comes to the macvlan_hash_lookup, it will return to the RX_HANDLER_PASS and enter the upper processing since it is a sub-device.
For the data-receiving process of MACVlan, the most important thing is the logic of selecting sub-devices after the main device receives the message. The code of the part is listed below:

static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
                 const unsigned char *addr)
{
  struct macvlan_dev *vlan;
  u32 idx = macvlan_eth_hash(addr);

  hlist_for_each_entry_rcu(vlan, &port->vlan_hash[idx], hlist,
         lockdep_rtnl_is_held()) {
    /* This logic of this part is that the macvlan searches the sub-device and compares the mac addresses.*/
    if (ether_addr_equal_64bits(vlan->dev->dev_addr, addr))
      return vlan;
  }
  return NULL;
}

Message Sending

The sending process of MACVlan starts from the sub-device receiving the ndo_start_xmit callback function. Its entry is macvlan_start_xmit. The following is the overall kernel code process.

When the data packet enters the macvlan_start_xmit, it is the macvlan_queue_xmit method that mainly performs the data packet sending operation.
Macvlan_queue_xmit first deals with bridge mode. From the definition of mode, we can see that only in bridge mode can direct communication between different sub-devices occur in the main device. So, we did something to deal with the special situation here where multiple broadcast messages and single broadcast messages destined for other sub-devices are directly sent to the sub-devices.
Other messages, will be sent through the dev_queue_xmit_accel, and the dev_queue_xmit_accel will directly call the netdev_start_xmit method of the main device, thus realizing the real sending of the message.

static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
{
  ...

  /* This is the logic in bridge mode. You need to consider the communication between different sub-devices.*/
  if (vlan->mode == MACVLAN_MODE_BRIDGE) {
    const struct ethhdr *eth = skb_eth_hdr(skb);

    /* send to other bridge ports directly */
    if (is_multicast_ether_addr(eth->h_dest)) {
      skb_reset_mac_header(skb);
      macvlan_broadcast(skb, port, dev, MACVLAN_MODE_BRIDGE);
      goto xmit_world;
    }
    /* This process directly forwards other sub-devices of the same master device.*/
    dest = macvlan_hash_lookup(port, eth->h_dest);
    if (dest && dest->mode == MACVLAN_MODE_BRIDGE) {
      /* send to lowerdev first for its network taps */
      dev_forward_skb(vlan->lowerdev, skb);

      return NET_XMIT_SUCCESS;
    }
  }
xmit_world:
  skb->dev = vlan->lowerdev;
    /* Here, the device of the message has been set as the master device and then sub-devices are sent through the master device. */
  return dev_queue_xmit_accel(skb,
            netdev_get_sb_channel(dev) ? dev : NULL);
}

IPVlan Devices

Compared with MACVlan and VLAN sub-devices, the model of IPVlan sub-devices is more complex. Unlike MACVlan, IPVlan defines the interworking behavior between sub-devices through flags. At the same time, three modes are provided. The following are the definitions:

/* At first, there was only I2 and I3. Later there was Linux and 13mdev, so 13s appeared accordingly. The main difference between them lies in the rx. */
enum ipvlan_mode {
  IPVLAN_MODE_L2 = 0,
  IPVLAN_MODE_L3,
  IPVLAN_MODE_L3S,
  IPVLAN_MODE_MAX
};

/* There is actually a bridge here. Since the default is bridge, all are omitted, and their semantics are the same as those of macvlan. */
#define IPVLAN_F_PRIVATE  0x01
#define IPVLAN_F_VEPA   0x02

Device Creation

With the analysis of the previous two sub-devices, we can continue the analysis of IPVlan according to this idea. The netlink message processing structure of IPVlan devices is ipvlan_link_ops, and the entry method for creating devices is ipvlan_link_new. The following is the process of creating IPVlan sub-devices.

Enter the ipvlan_link_new and judge the validity. Similar to MACVlan, if an IPVlan device is added as the master device, the master device of the IPVlan device will be automatically used as the master device of the new device.
Set the mac address of the IPVlan device by eth_hw_addr_set the mac address of the master device, which is the biggest difference between IPVlan and MACVlan.
Enter the register_netdevice process of unified network interface controller registration. If there is no IPVlan sub-device currently in this process, it will enter the initialization of ipvlan_init just like MACVlan. It will create ipvl_port on the master device, replace the original rx_handler of the master device with rx_handler of IPVlan, and start a special kernel worker to process multi-broadcast messages, which means all multi-broadcast messages for IPVlan are handled uniformly.
Next, continue to process the current newly added sub-device and save the current sub-device to the information of the master device through ipvlan_set_port_mode. At the same time, for the sub-device of l3s, its l3mdev processing method will be registered in nf_hook. This is the biggest difference from the preceding devices. The exchange of data packets between the master device and the sub-device of l3s is completed at Network Layer.

For the IPVlan network device, we intercept part of the ipvlan_port_create code for analysis:

static int ipvlan_port_create(struct net_device *dev)
{
    /* As you can see from here, port is the core of master device's management to sub-device.*/
  struct ipvl_port *port;
  int err, idx;

    /* Various attributes of the sub-device are reflected in the port. You can see that the default mode is l3.*/
  write_pnet(&port->pnet, dev_net(dev));
  port->dev = dev;
  port->mode = IPVLAN_MODE_L3;

    /* As can be seen here, for ipvlan, multicast messages are processed separately.*/
  skb_queue_head_init(&port->backlog);
  INIT_WORK(&port->wq, ipvlan_process_multicast);

    /* This is the regular operation. It relies on this to make the master device receive the package smoothly to cooperate with the ipvlan action. */
  err = netdev_rx_handler_register(dev, ipvlan_handle_frame, port);
  if (err)
    goto err;
}

Messages Receiving

The three modes of IPVlan sub-devices have different packet-receiving and processing processes. The following are the processes in the kernel.

Similar to MACVlan, it first passes through the __netif_receive_skb_core to enter the processing of the ipvlan_handle_frame registered at the time of creation. At this time, the data packet is still owned by the master device.
For message processing in mode l2 mode, only multicast messages are processed, and the messages are put into the multicast processing queue initialized when the sub-device is created in front. For uni-broadcast messages, they will be directly handed over to the ipvlan_handle_mode_l3 for processing.
For mode l3 or unicast mode l2 message, enter the ipvlan_handle_mode_l3 processing. First, obtain the header information of the Network Layer through the ipvlan_get_L3_hdr. Then, find the corresponding sub-device according to the IP address. Finally, call the ipvlan_rcv_frame, set dev of the message as the IPVlan sub-device, and return the RX_HANDLER_ANOTHER to receive the packet next time.
For mode l3s, the RX_HANDLER_PASS will be directly returned during the ipvlan_handle_frame, which means the message of mode l3s will enter the Network Layer processing stage on the master device. For mode l3s, the pre-registered nf_hook will be triggered when the NF_INET_LOCAL_IN operation is performed, performing ipvlan_l3_rcv operations, finding the sub-device through addr, replacing the Network Layer destination address of the message, and then directly entering the ip_local_deliver to perform the remaining operations at Network Layer.

Messages Sending

Although the message sending of IPVlan is complicated in implementation, the fundamental reason is that each sub-device is trying to use the master device to send the message. When the IPVlan sub-device sends the packet, it enters the ipvlan_start_xmit, and its core sending operation is ipvlan_queue_xmit. The following is the kernel code process:

Ipvlan_queue_xmit selects different sending methods according to the mode of the sub-device. Mode l2 is sent through ipvlan_xmit_mode_l2, and mode l3 and mode l3s are sent through ipvlan_xmit_mode_l3.
For ipvlan_xmit_mode_l2, first, judge whether it is a local address or VEPA mode. If it is not a local message of VEPA mode, find out whether it is an IPVlan sub-device under the same master device through ipvlan_addr_lookup. If it is, let other sub-devices carry out packet-receiving processing through ipvlan_rcv_frame. If not, let the master device process through dev_forward_skb.
Next, the ipvlan_xmit_mode_l2 will process the multicast message. Before processing, the netns-related information of the packet (including priority) is cleared by ipvlan_skb_crossing_ns. Finally, the packet is put to ipvlan_multicast_enqueue to trigger the preceding multicast processing.
For non-local data packets, they are sent through the dev_queue_xmit of the master device.
For the processing of ipvlan_xmit_mode_l3, judge the VEPA first. For packets in non-VEPA mode, find out whether they are other sub-devices through ipvlan_addr_lookup. If so, call ipvlan_rcv_frame to trigger other devices to receive packets.
For packets in non-VEPA mode, ipvlan_skb_crossing_ns processing is performed first, and then ipvlan_process_outbound operations are performed. At this time, according to the Network Layer protocol of the packet, ipvlan_process_v4_outbound or ipvlan_process_v6_outbound are selected for processing.
Let's take the ipvlan_process_v6_outbound as an example. First, the route is searched through the ip_route_output_flow, and then the Network Layer ip_local_out is directly used to continue the packet-sending operation on the network layer of the master device.

Solution

After the preceding analysis, the first question can be answered easily.

The Relationship between VLAN and MACVlan/IPVlan

What is the relationship between VLAN, IPVlan, and MACVlan? Why is there a VLAN in their names?

It shows similarities in some aspects since they are called MACVlan and IPVlan. The overall analysis shows that the core logic of VLAN sub-devices is similar to MACVlan and IPVlan.

The master device is responsible for physical packet receiving and sending.
The master device manages sub-devices into multiple ports and then finds ports according to certain rules, such as VLAN information, mac address, and IP address (macvlan_hash_lookup, vlan_find_dev, and ipvlan_addr_lookup).
After receiving the packet, the master device needs to go through a turning back in the __netif_receive_skb_core.
The sub-device ultimately sends packets directly through the dev of the modified message, and then the master device is allowed to operate.

Therefore, it is not difficult to infer that the internal logic of MACVlan/IPVlan refers to the implementation of Linux VLAN to a large extent. Linux joined MACVlan on June 18, 2007, with the release of version 2.6.63 [3]. The description is listed below:

The new MACVlan driver allows the system administrator to create virtual interfaces mapped to and from specific MAC addresses.

IPVlan was introduced in version 3.19 [4] released on December 7, 2014. This is its description:

The new IPVlan driver enables the creation of virtual network devices for container interconnection. It is designed to work well with network namespaces. IPVlan is like the existing MACVlan driver, but it does its multiplexing at a higher level in the stack.

As for VLAN, it appeared much earlier than Linux version 2.4. The first version of drivers for many devices already supported VLAN. However, Linux's hwaccel implementation of VLAN was version 2.6.10 [5] in 2004. Among a large number of updated features at that time, this one appeared.

I was poking around in the National Semi 83820 driver noticed that the chip supports VLAN tag add/strip assist in hardware, but the driver wasn't making use of it. This patch adds the driver support to use the VLAN tag add/remove hardware and enables the drivers' use of the kernel VLAN hwaccel interface.

After Linux began to treat VLAN as an interface, there were two virtual interfaces, MACVlan and IPVlan. Linux virtualized different VLANs into devices to accelerate the processing of VLAN packets. In the later period, MACVlan and IPVlan made virtual devices useful under this idea.

As such, their relationship is more like a tribute.

About VEPA/passthrough/bridge/private

Why do IPVlan and MACVlan have various modes and flags, such as VEPA, private, passthrough? What are the differences?

In fact, in the analysis of the kernel, we have roughly understood the performance of these modes. If the main device is a DingTalk group and all group friends can send messages to the outside world, several modes are intuitive.

In private mode, friends in the group are forbidden to speak to each other, including within the group and outside the group.
Bridge mode allows group friends to talk happily in the group.
In VEPA mode, friends in the group are muted, but you can chat directly outside the group.
In passthrough mode, you are the group leader, and no one can speak except you.

Why are there these models? From the performance of the kernel, whether port or bridge is the concept of network, which means from the beginning, Linux is trying to show itself as a qualified network device. For the master device, Linux is trying to make it into a switch. For the sub-device, it is the device behind the network cable, which looks reasonable.

This is exactly the case. Whether it is VEPA or private, they were originally network concepts. It is not only Linux. We have seen many projects dedicated to disguising themselves as physical networks, all following these behavior patterns, such as OpenvSwitch [6].

Application of MACVlan and IPVlan

What are the advantages of IPVlan and MACVlan? Under what circumstances will users come across and use them?

Here comes the original intention of this article. We found from the second question that both IPVlan and MACVlan are doing one thing: virtual networks. Why do we want a virtual network? There are many answers to this question. Like the value of cloud computing, virtual networks, as a basic technology of cloud computing, are ultimately aimed at improving resource utilization efficiency.

MACVlan and IPVlan serve this ultimate goal. The era when the rich ran a helloworld script with one physical machine has passed. From virtualization to containerization, the era puts forward higher requirements for network density. With the birth of container technology, Veth stepped onto the stage first, but with high enough density, efficient performance is still needed. MACVlan and IPVlan have emerged in a way to increase density and ensure efficiency through sub-devices (and our ENI-Trunking).

At this point, I would like to recommend a new high-performance and high-density network solution brought by Container Service for Kubernetes (ACK) - IPVlan solution [7].

ACK implements an IPVlan-based Kubernetes network solution based on the Terway addon. Terway is a network addon developed by ACK. Terway allows you to configure networks for pods by associating elastic network interfaces with the pods. Terway allows you to use standard Kubernetes network policies to regulate how containers communicate with each other. In addition, Terway is compatible with Calico network policies.

In the Terway network addon, each pod has its network stack and IP address. Pods on the same Elastic Compute Service instance communicate with each other by forwarding packets inside the ECS instance. Pods on different ECS instances communicate with each other through ENIs in the VPC in which the ECS instances are deployed. This improves communication efficiency since no tunneling technologies, such as Virtual Extensible Local Area Network (VxLAN), are required to encapsulate messages. The following figure shows the network type of Terway.

If users select the Terway network addon when they create a cluster, they can enable the Terway IPvlan mode. The Terway IPvlan mode provides high-performance networks for Pods and Services based on IPvlan virtualization and eBPF technologies.

Compared with the default Terway mode, the Terway IPvlan mode optimizes the performance of Pod networks, Service, and NetworkPolicy.

Pod networks are directly implemented based on the sub-interfaces of ENIs in IPvlan L2 mode. This significantly simplifies network forwarding on the host and reduces the latency by 30% compared with the traditional mode. The performance of pod networks is almost the same as the host network.
Service networks are implemented based on the eBPF technology instead of the kube-proxy mode. Traffic forwarding no longer depends on the host iptables or IPVS. This maintains almost the same performance in large-scale clusters and offers better scalability. The request latency in a large number of new connections, and port reuse scenarios is lower than that in IPVS and iptables modes.
The NetworkPolicy of pods is implemented based on the eBPF technology instead of iptables. Large numbers of iptables rules are no longer generated on the host, and the impact of network policies on network performance is minimized.

Therefore, allocating the IPvlan network interface controller with IPvlan to each service pod ensures the density of the network but also enables the Veth scheme of the traditional network to achieve a huge performance improvement (please see link 7 for details). At the same time, the Terway IPvlan mode provides a high-performance Service solution. Based on the eBPF technology, we circumvent the long-criticized Conntrack performance problem.

No matter what kind of business, ACK with IPVlan is a better choice.

Thank you for reading. Lastly, there is another question behind this question. Do you know why we chose IPVlan instead of MACVlan? If you know about virtual network technology, you will get the answer soon after reading what's mentioned above. You are welcome to leave a message below.

References

[1] About IEEE 802.1Q
https://zh.wikipedia.org/wiki/IEEE_802.1Q

[2] Docker Engine release notes
https://docs.docker.com/engine/release-notes/prior-releases/

[3] Merged for MACVlan 2.6.23
https://lwn.net/Articles/241915/

[4] MACVlan 3.19 Merge window part 2
https://lwn.net/Articles/626150/

[5] VLan 2.6.10-rc2 long-format changelog
https://lwn.net/Articles/111033/

[6] [ovs-dev] VEPA support in OVS
https://mail.openvswitch.org/pipermail/ovs-dev/2013-December/277994.html

[7] Use IPVlan to Accelerate Pod Networks in Alibaba Cloud Kubernetes Clusters
https://developer.aliyun.com/article/743689

Community

From VLAN to IPVLAN: Virtual Network Device and Its Application in Cloud-Native

Questions

Background

Go Deep into the Kernel Implementation of VLAN/MACVlan/IPVlan

VLAN Sub-Device

MACVlan Devices

IPVlan Devices

Solution

The Relationship between VLAN and MACVlan/IPVlan

About VEPA/passthrough/bridge/private

Application of MACVlan and IPVlan

References

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Alibaba Cloud Linux

Cloud-Native Applications Management Solution

Container Service for Kubernetes

Networking Overview