PouchContainer CNM Network Initialization with Libnetwork

Network is always a topic related to container. The container community developed two models for the network: Container Network Interface (CNI) and Container Network Model (CNM). In my opinion, CNI is mainly used in the container scheduling field, working with the scheduling layer (such as kubernetes) to complete the container network initialization. CNM is primarily promoted by Docker and mainly used to build container networks for standalones.

PouchContainer supports both models. CNI works with Container Runtime Interface (CRI) to offer computing and network services to kubernetes, while CNM offers network services to the container when Pouch is independently used. Currently, PouchContainer implements the container's CNM network using Libnetwork. This article will introduce the definition of CNM, describe how the PouchContainer uses the Libnetwork (this part includes many source code), and make a conclusion on network building by PouchContainer.

CNM overview

CNM is mainly promoted by Docker. The CNM model consists of the following types of resources:

Sandbox: contains the entire network protocol stack used by the container, such as NIC, route table, and DNS configuration. A sandbox may contain multiple endpoints from different networks. Sandbox can be implemented in different ways. It is implemented using Network Namespace in Linux, and using Jail in FreeBSD.
Endpoint: connects a sandbox to the network. An endpoint may be a veth device pair or an OpenvSwitch internal port.
Network: consists of a group of endpoints that can communicate with each other. A network may be a Linux bridge or a VLAN.

Libnetwork fully complies with CNM and provides network devices with various drivers such as bridge, host, IP, IP VLAN, MAC VLAN, null, and overlay.

How the PouchContainer Invokes Libnetwork

The PouchContainer invokes Libnetwork by the following steps:

Initialize the Libnetwork Controller.
Initialize the network types of bridge, null, and host.
Create the corresponding sandbox and endpoints when creating the container and join the endpoints to connect the sandbox to the network.

The first step is the initialization of Libnetwork Controller. This step is included in the NetworkManager initialization. NetworkManager is the network module of PouchController, and is the entrance of all network operations.

func NewNetworkManager(cfg *config.Config, store *meta.Store, ctrMgr ContainerMgr) (*NetworkManager, error) {
    ...
    ctlOptions, err := controllerOptions(cfg.NetworkConfig)
    if err != nil {
        return nil, errors.Wrap(err, "failed to build network options")
    }

    controller, err := libnetwork.New(ctlOptions...)
    if err != nil {
        return nil, errors.Wrap(err, "failed to create network controller")
    }
    ...
}

The second step is network initialization, namely the building of the networks such as bridge and null. This step is also included in the Pouchd booting stage.

func NetworkModeInit(ctx context.Context, config network.Config, manager mgr.NetworkMgr) error {
    // if it has old containers, don't to intialize network.
    if len(config.ActiveSandboxes) > 0 {
        logrus.Warnf("There are old containers, don't to initialize network")
        return nil
    }

    // init none network
    if n, _ := manager.Get(ctx, "none"); n == nil {
        ...
    }

    // init host network
    if n, _ := manager.Get(ctx, "host"); n == nil {
        ...
    }

    // init bridge network
    return bridge.New(ctx, config.BridgeConfig, manager)
}

The last step is sandbox and endpoints creation in the booting of container. This step is included in the container booting stage. The start function prepareContainerNetwork will be used to build the network.

func (mgr *ContainerManager) start(ctx context.Context, c *Container, detachKeys string) error {
    ...
    if err = mgr.prepareContainerNetwork(ctx, c); err != nil {
        return err
    } 
 
     if err = mgr.createContainerdContainer(ctx, c); err != nil {
        return errors.Wrapf(err, "failed to create container(%s) on containerd", c.ID)
    }    

    return nil
}

The prepareContainerNetwork function will call the EndpointCreate function of NetworkManager to create and join sandbox and endpoints.

func (nm *NetworkManager) EndpointCreate(ctx context.Context, endpoint *types.Endpoint) (string, error) {
    containerID := endpoint.Owner
    network := endpoint.Name
    ...

    n, err := nm.controller.NetworkByName(network)
    ...

    // create endpoint
    ep, err := n.CreateEndpoint(endpointName, epOptions...)
    if err != nil {
        return "", err
    }

    // create sandbox
    sb := nm.getNetworkSandbox(containerID)
    if sb == nil {
        ...
        sb, err = nm.controller.NewSandbox(containerID, sandboxOptions...)
    }
    ...

    // endpoint joins into sandbox
    if err := ep.Join(sb, joinOptions...); err != nil {
        return "", fmt.Errorf("failed to join sandbox(%v)", err)
    }
    ...
}

Conflict between Container Creation and Sandbox Initialization

It seems that the preceding procedure is completed seamlessly. However, if you review the start functions of ContainerManager, you will find that the sandbox and endpoints are created first, and then the createContainerdContainer function is called to create the real container. The carrier of sandbox, namely, the network namespace, is created during the container booting. Is there a conflict?

Creation of Network Namespace

The bottom layer of container depends on the cgroup and namespace technologies, in which cgroup is used to restrict and calculate the usage of container resources (such as CPU, memory, and I/O) and namespace is used to isolate these resources.

By default, the bottom-layer container of PouchContainer is run by runC. Therefore, cgroup and namespace are implemented by runC. The creation of namespace is completed by a section of C code in runC.

nsenter.c

nsexec creates the namespace by two forks and one unshare, and then executes the entrypoint or command entered by user.

Relationship between Sandbox and Network Namespace

To learn about the relationship between sandbox and network namespace, you must know the complete structure of sandbox.

type sandbox struct {
    id                 string
    containerID        string
    config             containerConfig
    extDNS             []string
    osSbox             osl.Sandbox
    controller         *controller
    resolver           Resolver
    resolverOnce       sync.Once
    refCnt             int  
    endpoints          epHeap
    epPriority         map[string]int
    populatedEndpoints map[string]struct{}
    joinLeaveDone      chan struct{}
    dbIndex            uint64
    dbExists           bool 
    isStub             bool 
    inDelete           bool 
    ingress            bool 
    sync.Mutex
}

Sandbox includes an osSbox member. The implementations of osSbox vary according to the operating systems. In Linux, osSbox is implemented by network namespace.

The following is the creation procedure of sandbox.

func (c *controller) NewSandbox(containerID string, options ...SandboxOption) (Sandbox, error) {
    if sb.config.useDefaultSandBox {
        c.sboxOnce.Do(func() {
            c.defOsSbox, err = osl.NewSandbox(sb.Key(), false, false)
        })
        sb.osSbox = c.defOsSbox
    }

    if sb.osSbox == nil && !sb.config.useExternalKey {
        if sb.osSbox, err = osl.NewSandbox(sb.Key(), !sb.config.useDefaultSandBox, false); err != nil {
            return nil, fmt.Errorf("failed to create new osl sandbox: %v", err)
        }
    }
    ...
    err = sb.storeUpdate()

    return sb, nil
}

First, construct the sandbox metadata. Then, judge the value of useDefaultSandBox. If the value is true, osSbox is set to defOsSbox.

When will the useDefaultSandBox value become true? When the network type of the container is host, that is, when the container uses the host network, osSbox is set to defOsSbox, which corresponds to the host network namespace with the path /var/run/pouch/netns/default.

When osSbox is null and useExternalKey is false, osSbox is created. If useExternalKey is true, osSbox is not created. In PouchContainer, the network type is null or bridge and useExternalKey is true; therefore, osSbox will not be initialized.

Join operation of endpoints

The endpoint join operation is to connect a sandbox to another sandbox. The following are the tasks involved in the join operation in terms of code. The final entrance of the join operation is the sbJoin function.

func (ep *endpoint) sbJoin(sb *sandbox, options ...EndpointOption) error {
    ...
    d, err := n.driver(true)
    err = d.Join(nid, epid, sb.Key(), ep, sb.Labels())
    if err != nil {
        return err
    }
    ...
    if err = sb.updateHostsFile(address); err != nil {
        return err
    }
    if err = sb.updateDNS(n.enableIPv6); err != nil {
        return err
    }
    ...
    
    if err = sb.populateNetworkResources(ep); err != nil {
        return err
    }
    ...
}

First, the join operation methods of the driver are called. The join operation varies according to the driver types. If the driver is bridge, a veth network device pair is created. One end of the veth device pair is connected to bridge, and the other end is not connected. During the update of Hosts file and DNS configuration, populateNetworkResources is called, which will initialize the network resources in the container.

func (sb *sandbox) populateNetworkResources(ep *endpoint) error {
    sb.Lock()
    if sb.osSbox == nil {
        sb.Unlock()
        return nil
    }    
    if i != nil && i.srcName != "" { 
        ifaceOptions = append(ifaceOptions, sb.osSbox.InterfaceOptions().Address(i.addr), sb.osSbox.InterfaceOptions().Routes(i.routes))
        if i.mac != nil {
            ifaceOptions = append(ifaceOptions, sb.osSbox.InterfaceOptions().MacAddress(i.mac))
        }    

        if err := sb.osSbox.AddInterface(i.srcName, i.dstPrefix, ifaceOptions...); err != nil {
            return fmt.Errorf("failed to add interface %s to sandbox: %v", i.srcName, err) 
        }    
    }
    
    if joinInfo != nil {
        for _, r := range joinInfo.StaticRoutes {
            if err := sb.osSbox.AddStaticRoute(r); err != nil {
                return fmt.Errorf("failed to add static route %s: %v", r.Destination.String(), err)
            }
        }
    }
    ...
}

This function will check whether the osSbox is null. If so, it does not initialize the network in container. If not, it puts the container NIC into the container's network namespace and also initializes the IP address, MAC address, and static routes.

Obviously, the network namespace of the container has not been created during the join operation, so the osSbox is null. The container's network is not initialized. Then how is the network initialization completed?

Magic ExternalKey

As mentioned in the Relationship between Sandbox and Network Namespace section, if useExternalKey is true, the osSbox of sandbox is null and the container's network will not be initialized. To achieve network initialization, osSbox cannot be null. However, osSbox depends on the container's namespace, and the namespace is created only after container creation. Therefore, the network initialization must be completed after container creation.

Prestart Hook of runC

The hooks of runC allow users to define the actions in container's lifecycle. There are three types of hooks in runC:

Prestart: The pre-start hooks are executed before the user-specified programs (namely, the entrypoint or command in the container) start. At this time, the container's namespace has been created to perform initialization operations.
Poststart: The post-start hooks are executed after the start of user-specified programs and before the return of container's start operation.
Poststop: The post-stop hooks are executed after the stop of container's programs and before the return of container's deletion operation. They are used to clear operations.

Obviously, the pre-start hooks are suitable for network initialization.

PouchContainer uses the following pre-start hooks for container initialization:

"hooks": {
    "prestart": [
      {
        "path": "/usr/bin/pouchd",
        "args": [
          "libnetwork-setkey",
          "76cf8065568ac429d5aec9908dc149decfadafa6091f991d01bb44f39d51312a",
          "edbfbf851eaee68102f15c50ae93739d8bd92d70c66a9be7b37c4d17ce124023"
        ]
      }
    ]
  }

Container Network Initialization

Functions of Libnetwork-setkey

Libnetwork-setkey is provided by the Libnetwork to implement network initialization. This function supports two parameters:

libnetwork-setkey

container-id is the ID of the container. Through container-id, you can obtain the sandbox information. controller-id is the ID of the Libnetwork Controller. Libnetwork-setkey will be processed by the processSetKeyReexec function.

func processSetKeyReexec() {
        containerID := os.Args[1]

        stateBuf, err := ioutil.ReadAll(os.Stdin)
        if err != nil {
                return
        }   
        var state configs.HookState
        if err = json.Unmarshal(stateBuf, &state); err != nil {
                return
        }   

        controllerID := os.Args[2]

        err = SetExternalKey(controllerID, containerID, fmt.Sprintf("/proc/%d/ns/net", state.Pid))
}

func SetExternalKey(controllerID string, containerID string, key string) error {
    keyData := setKeyData{    
        ContainerID: containerID, 
        Key:         key}

    c, err := net.Dial("unix", udsBase+controllerID+".sock")
    if err != nil {      
        return err       
    }
    defer c.Close()      

    if err = sendKey(c, keyData); err != nil {
        return fmt.Errorf("sendKey failed with : %v", err)
    }
    return processReturn(c)   
}

The processSetKeyReexec logic is simple. You can enter the Pid of a container process. With this Pid, you can obtain the network namespace of the container, that is, /proc/{pid}/ns/net. Then the container ID and namespace are written into a unix socket listened by the Libnetwork Controller.

ExternalKey processing by Libnetwork Controller

As mentioned previously, a Libnetwork Controller is created during Pouchd initialization, and a unix socket is created during Libnetwork Controller initialization to listen to the ExternalKey requests.

func (c *controller) startExternalKeyListener() error {
    if err := os.MkdirAll(udsBase, 0600); err != nil {
        return err 
    }   
    uds := udsBase + c.id + ".sock"
    l, err := net.Listen("unix", uds)
    ...
    go c.acceptClientConnections(uds, l)
    return nil 
}

func (c *controller) processExternalKey(conn net.Conn) error {
    buf := make([]byte, 1280)
    nr, err := conn.Read(buf)
    if err != nil {
        return err
    }
    var s setKeyData
    if err = json.Unmarshal(buf[0:nr], &s); err != nil {
        return err
    }
    ...
    return sandbox.SetKey(s.Key)
}

The requests are accepted by acceptClientConnections and processed by processExternalKey. processExternalKey reads the paths of container-id and namespace, traverses all sandboxes to find out the sandbox corresponding to the container-id, and then calls the SetKey function.

func (sb *sandbox) SetKey(basePath string) error {
    osSbox, err := osl.GetSandboxForExternalKey(basePath, sb.Key())
   ...

   for _, ep := range sb.getConnectedEndpoints() {
       if err = sb.populateNetworkResources(ep); err != nil {
           return err
       }
   }
   return nil
}

SetKey creates the osSbox by using the received network namespace key, traverses all endpoints connected to the sandbox, and calls populateNetworkResources. This function is used to initialize container's network resources. Now, the container's network initialization is complete.

Conclusion

This article describes how the PouchContainer initializes the container's network with Libnetwork. The entire process can be summarized as follows:

When Pouchd starts, a Libnetwork Controller is created, which listens to the unix socket, processes ExternalKey requests, and initializes the network types of bridge, null, and host.
When creating the container, the Pouchd calls Libnetwork to create sandbox and endpoints, and performs the join operation for endpoints. However, the container has not been started and network namespace has not been created. The container's network resources are not initialized.
When the container starts, the ExternalKey requests are sent to the Libnetwork Controller through the pre-start hooks of runC. The Libnetwork then completes the container's network initialization.

Community

PouchContainer CNM Network Initialization with Libnetwork

CNM overview

How the PouchContainer Invokes Libnetwork

Conflict between Container Creation and Sandbox Initialization

Creation of Network Namespace

Relationship between Sandbox and Network Namespace

Join operation of endpoints

Magic ExternalKey

Prestart Hook of runC

Container Network Initialization

Conclusion

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Accelerated Global Networking Solution for Distance Learning

Networking Overview

ACK One

Container Registry