Alibaba Open Source Container Image Acceleration Technology

Alibaba has recently opened up its cloud native container image acceleration technology and launched its overlay bd image format. Compared with the traditional layered tar package file format, Alibaba has realized network-based on-demand reading, which enables the container to start quickly.

This technical solution was originally a part of Alibaba Cloud's internal DADI project. DADI is the abbreviation of Data Accelerator for Disaggregated Infrastructure. It aims to provide various possible data access acceleration technologies for computing and storage separation architecture. Image acceleration is a breakthrough attempt of DADI architecture in the field of container and cloud native. Since its launch in 2019, the technology has deployed a large number of machines online, and the cumulative number of container starts has exceeded 1 billion. It supports multiple business lines of Alibaba Group and Alibaba Cloud, and greatly improves the efficiency of application release and expansion. In 2020, the team published the paper "DADI: Block-Level Image Service for Agile and Elastic Application Deployment. USENIX ATC'20" at the international top conference [1], and then launched the open source project, planning to contribute the technology to the community, and attract more developers to invest in the field of container and cloud native performance optimization by establishing standards and creating a model.

Background

With the explosion of Kubernetes and cloud native, the large-scale application of containers in enterprises has become more and more widespread. Fast deployment and startup is one of the core advantages of the container. This means that the local image instantiation time is very short, that is, the "hot start" time is short. However, for "cold start", that is, if there is no local image, you need to download the image from the Registry before creating the container. After long-term maintenance and update, the image of the business will reach a large scale, whether the number of image layers or the overall size, such as hundreds of megabytes or several gigabytes. Therefore, in the production environment, the cold start of the container often takes several minutes, and with the expansion of the scale, the Registry cannot quickly download the image due to the network congestion in the cluster.

For example, in the previous year's Double 11 event, an application in Alibaba triggered an emergency expansion due to insufficient capacity, but the overall expansion took a long time due to excessive concurrency, which affected the user experience of some users. By 2019, with the deployment and launch of DADI, the total time spent on "image pull+container start" of the new image format container was 5 times shorter than that of the ordinary container, and the p99 long tail time was 17 times faster than the latter.

How to handle the image data stored at the remote end is the core point to solve the problem of slow container cold start. Historically, the industry has tried to solve this problem by using block storage or NAS to save the container image to achieve on-demand reading; Use network-based distribution technology (such as p2p) to download images from multiple sources or preheat them to the host in advance to avoid single network bottlenecks. In recent years, the discussion of the new image format has gradually been put on the agenda. According to the research of Harter et al. [2], pulling images takes up 76% of the container startup time, while only 6.4% of the time is spent reading data. Therefore, the image supporting On-demand Read technology has become the default trend. The stargz [3] format proposed by Google, whose full name is Seekable tar.gz, as the name implies, can selectively search and extract specific files from the archive without scanning or decompressing the entire image. Stargz is designed to improve the performance of image pulling. Its lazy pull technology does not pull the entire image file, and realizes on-demand reading. In order to further improve the runtime efficiency, stargz also launched a containerd snapshot plug-in, which further optimized I/O at the storage layer.

In the life cycle of the container, the image needs to be mounted after it is ready. The core technology of hierarchical image mounting is overlay fs, which merges multiple layer files at the lower level in a form of stacking and exposes a unified read-only file system upwards. Similar to the block storage and NAS mentioned above, they can generally be layered and stacked in the form of snapshots, and the CRFS bound with stargz can also be seen as another implementation of overlay fs.

New image format

DADI does not directly use overlay fs, or it just draws on the ideas of overlay fs and the early union file system, but proposes a new block-based hierarchical stacking technology, called overlay bd, which provides a series of block-based consolidated data views for container mirroring. The implementation of overlay bd is very simple, so many things that you want to do but can't do before can become reality; However, implementing a fully POSIX-compatible file system interface is full of challenges and may have bugs, which can be seen from the development history of various mainstream file systems.

In addition to simplicity, other advantages of overlay bd over overlay fs are:

• Avoid the performance degradation caused by multi-tier mirroring. For example, the update of large files in the overlay fs mode will trigger cross-tier reference replication. The system must first copy the files to the writable layer; Or the speed of creating hard links is slow

• It can easily collect block-level I/O modes for recording and playback, so as to prefetch data and further accelerate startup

• The user's file system and host OS can be flexibly selected, such as supporting Windows NTFS

• Can use effective codecs for online decompression

• It can sink into distributed storage (such as EBS) in the cloud, and the image system disk can use the same storage scheme as the data disk

• Overlaybd has natural writable layer support (RW), and read-only mount can even become history

Overlaybd principle

In order to understand the principle of overlay bd, we first need to understand the hierarchical mechanism of container image. The container image consists of multiple incremental layer files, which are superimposed when used, so that only layer files need to be distributed when the image is distributed. Each layer is essentially a compressed package that differs from the previous layer (including the addition, modification or deletion of files). The container engine can stack the differences in the agreed way through its storage driver, and then mount it to the specified directory in the Read-Only mode, which is called lower_ dir; For the writable layer mounted in Read/Write mode, the mounted directory is generally called upper_ dir。

Please note that overlay bd itself has no concept of file. It just abstracts the image as a virtual block device and mounts a regular file system on it. When the user applies to read data, the read request is first processed by the conventional file system, and the request is converted into one or more reads of the virtual block device. These read requests will be forwarded to the receiving program in user mode, that is, the runtime carrier of overlay bd, and finally converted into random reads of one or more layers.

Like the traditional image, overlay bd still retains the hierarchical structure of the layer, but the contents of each layer are a series of data blocks corresponding to the file system change differences. Overlaybd provides a consolidated view up. The stacking rule for the layer is very simple, that is, for any data block, the last change is always used, and the blocks that have not changed in the layer are regarded as all zero blocks; It also provides the function of exporting a series of data blocks into a layer file, which is high-density, non-sparse, and indexable. Therefore, the read operation on a continuous LBA range of the block device may contain small data segments that originally belong to multiple layers. We call these small data segments segments segments. Find the layer number from the attribute of the segment, and you can continue to map to the reading of the layer file of this layer. The traditional container image can save its layer file in Registry or object storage, so the overlay bd image can also be used.

For better compatibility, overlay bd wraps the header and tail of a layer of tar files on the outermost layer of the layer file, so as to masquerade as a tar file. Since there is only one file in tar, on-demand reading is not affected. At present, no matter docker, containerd or buildkit, there are untar and tar processes for image download or upload by default. Non-intrusive code is insurmountable. Therefore, adding tar camouflage is conducive to the unification of compatibility and process. For example, when image conversion, construction, or full download is used, there is no need to modify the code, only need to provide plug-ins.

Overall architecture

The overall architecture of DADI is shown in the figure below. Each component is described separately

containerd snapshotter

Containerd has initially supported some functions of starting remote mirroring since version 1.4, and k8s has explicitly abandoned Docker as the runtime support. Therefore, the open source version of DADI chooses to support containerd ecosystem first, and then Docker.

The core function of snapshot is to implement the abstract service interface, which is used for the mount and unload operations of container rootfs. Its design replaces the module called graphdriver in the earlier version of Docker, which simplifies the storage driver and is compatible with block device snapshot and overlay fs.

The overlay bd-snapshot provided by DADI enables the container engine to support the new overlay bd format image, that is, mount the virtual block device to the corresponding directory. On the other hand, it is also compatible with the traditional OCI tar format image, allowing users to continue to run ordinary containers with overlay fs.

iSCSI target

ISCSI is a widely supported remote block device protocol. It is stable, mature and has high performance, and can be recovered in case of failure. The overlay bd module is used as the back-end storage of the iSCSI protocol. Even if the program crashes unexpectedly, it can be recovered by pulling it up again. However, file system-based image acceleration schemes, such as stargz, cannot be restored.

ISCSI target is the runtime carrier of overlay bd. In this project, we have implemented two target modules: the first is based on the open source project tgt [4]. Because it has the backing store mechanism, it can compile the code into a dynamic link library for runtime loading; The second is the LIO SCSI target (also known as TCMU) based on the Linux kernel [5]. The whole target runs in the kernel state and can easily output virtual block devices.

ZFile

ZFile is a data compression format that supports online decompression. It splits the source file according to the block size of fixed size, compresses each data block separately, maintains a jump table, and records the physical offset position of each data block in the ZFile. If you want to read data from ZFile, just look up the index to find the corresponding location and decompress the relevant data block.

ZFile supports various effective compression algorithms, including lz4, zstd, etc. It has extremely fast decompression speed and low cost, which can effectively save storage space and data transmission. The experimental data shows that the performance of decompressing remote ZFile data on demand is higher than that of loading uncompressed data, because the time saved by transmission is greater than the extra cost of decompressing.

Overlaybd supports exporting layer files to ZFile format.

cache

As mentioned above, the layer file is saved on the Registry, and the read I/O of the container to the block device will be mapped to the request to the Registry (where the Registry's support for HTTP Partial Content is used). However, due to the existence of cache mechanism, this situation will not always exist. The cache will automatically download the layer file after the container is started for a period of time, and persist it to the local file system. If the cache hits, the read I/O will not be sent to the Registry, but will be read locally.

Industry leading

On March 25, the authoritative consulting agency Forrester released the evaluation report of the first quarter of 2021's FaaS platform (Function-As-A-Service Platforms). Alibaba Cloud stood out with the advantage of being the first in the world in terms of product capability, and got the highest score in eight evaluation dimensions, becoming the global FaaS leader comparable to Amazon AWS. This is also the first time that a domestic technology company has entered the FaaS leader quadrant.

As we all know, containers are the foundation of the FaaS platform, and the container startup speed determines the performance and response delay of the entire platform. DADI helps Alibaba Cloud's functional computing products, significantly reducing the container startup time by 50% to 80% [6], bringing a new serverless experience.

Summary and outlook

Alibaba's open-source DADI container acceleration project and its over-laybd image format help to meet the container's need for rapid startup in the new era. In the future, the project team will work with the community to speed up the docking of the mainstream tool chain and actively participate in the development of the new image format standard, with the goal of making overlay bd one of the standards of OCI remote image format.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us