The Code of iLogtail Is Completely Open-Source

By Xunfei and Yemo, from Alibaba Cloud Storage Team

On June 29, 2022, Alibaba Cloud launched the first major update of iLogtail after making it open-source. The full-featured iLogtail Community Edition was released. All the C++ core code was open-sourced. This version is aligned with the Enterprise Edition for the first time in terms of core capabilities. Therefore, developers can build an iLogtail cloud-native observability data collector with the same performance as the Enterprise Edition. This release adds many important features, such as log file collection, container file collection, lock-free event processing, multi-tenant isolation, and the new pipeline-based configuration method, to comprehensively enhance the usability and performance of the Community Edition. Developers are welcome to build it together.

Challenges to Observability Data Collection

Observability is the ability to measure the internal state of the system by checking its output. This term originated from the control theory years ago, first proposed by a Hungarian engineer named Rudolf Kalman. In distributed IT systems, observability typically uses various types of telemetry data (logs, metrics, and traces) to measure infrastructure, platforms, and applications to understand their operating status and processes. The collection of the data is usually completed by a collection agent that runs together with the observed objects. Under the modern system architecture based on cloud-native and microservices, these observation objects are more distributed, more numerous, and change faster than before. Thus, collection agents are faced with the following challenges:

High Performance and Low Overhead: Modern applications usually run in thousands of servers, virtual machines, and containers in data centers, public clouds, and edges. Therefore, each performance and overhead of collection agents will be magnified by thousands. Currently, the design of many open-source agents focuses more on function than performance. The speed of single-core processing is generally about 2-10 M/s, while it is expected to reach 100 M/s. With the increases in collection targets and the amount of data, the collection delay, and server exceptions, the open-source agent memory will explode, and we hope that the memory can remain at a low level in various environments.
Stable and Accurate Collection, Multi-Level Isolation of Faults: The observability needs to be at least an order of magnitude more reliable than the observed systems. In addition to ensuring the accuracy of data collection, the stability of data collection agents also needs to ensure that business applications cannot be affected. Otherwise, it will bring disastrous consequences. On the other hand, no matter how the problem occurs, it is necessary to isolate the problem as much as possible. For example, if one of the configurations on an agent encounters problems, the problems cannot affect other configurations.
Large-Scale Configuration of Enterprise-Level Management and Control: The application range of observability data is wide. An enterprise often has a large number of configurations. The agent needs to provide capabilities for centralized and automated configuration management instead of manually logging on to the machine to modify the configurations. In addition, it is required to ensure that the data is not lost or duplicated during the configuration of reloading. When the agent has multiple collection configurations, a reasonable arrangement of resources is required. You give priority to the memory bandwidth and other resources to the high-priority configuration and ensure the low-priority configuration is not starved to death. It needs to have enough bursting ability to quickly catch up with the data after the peak.
More Native and Friendly Support of Kubernetes: Kubernetes provides powerful capabilities for O&M deployment, auto scaling, and fault recovery, which facilitates the development and management of distributed systems. However, problems with log collection also arise. The diverse data output of Kubernetes makes it necessary for the same agent to collect multiple data sources (such as host logs, container logs, and container stdout). The auto scaling capability of Kubernetes for business deployment requires agents to have the ability to dynamically discover and tag containers. At the same time, it also puts forward higher requirements on how to ensure the integrity of data collection.

We choose to develop iLogtail because there is no open-source agent that can solve the challenges above perfectly. The core position of iLogtail is the collector of observability data, which helps developers build a unified business data collection layer and helps the observable platform create upper-layer application scenarios.

An Introduction to iLogtail

iLogtail is a collection agent of observable data developed by the Alibaba Cloud Log Service (SLS) Team. iLogtail has many production-level features (such as lightweight, high performance, and automated configuration). It can be deployed in multiple environments (such as physical machines, virtual machines, and Kubernetes) to collect telemetry data. iLogtail serves tens of thousands of customers on Alibaba Cloud for observability collection of hosts and containers. It is also used in the core product lines of Alibaba Group (such as Taobao, Tmall, Alipay, Cainiao, and Amap) as the default collection tool for logs, metrics, traces, and other observability data. iLogtail has been installed on tens of millions of servers and collects tens of petabytes of observability data per day. It is widely used in various scenarios (such as online monitoring, problem analysis and location, operation analysis, and security analysis). Its powerful performance and stability are verified in practice.

Development of iLogtail

iLogtail originated from the Shennong project of Alibaba Cloud. It was created in 2013 and has been evolving since. In the early days, facing the requirements of Alibaba Cloud and early customers for O&M and observability, iLogtail mainly dealt with the challenges of single-machine, small-scale clusters, and large-scale O&M and monitoring. At that time, iLogtail already had capabilities for basic file discovery and rotation processing. Therefore, it can realize real-time the collection of logs and monitoring with a latency of milliseconds. The speed of single-core processing was about 10 M/s. The web frontend supported the automatic distribution of centralized configuration files. It supported the deployment scale of over 30,000, collects thousands of configuration items, and achieves efficient collection of 10TB of daily data.

In 2015, Alibaba began to migrate Alibaba Group and Ant Financial services to the cloud. Faced with the challenges of super-large data volume from nearly 1,000 teams, millions of terminals, and Double 11 and Double 12 Shopping Festivals, iLogtail needed to make huge improvements in terms of features, performance, stability, and multi-tenant support. Around 2017, iLogtail could parse logs in multiple formats, such as regex, delimiter, and JSON. It supported multiple log encoding methods and advanced processing capabilities such as data filtering and desensitization. The speed of single-core processing in the simple mode increased to 100 M/s, and the performance for processing regex, delimiter, and JSON reached 20 M/s or more. In terms of collection reliability, file discovery polling mode was added, rotation queue order was guaranteed, log cleaning and loss protection was provided, and CheckPoint was enhanced. In terms of process reliability, exception automatic recovery, crash automatic reporting, and daemon process were added. It supported the deployment scale of over 1 million, thousands of tenants, and over 100,000 collection configuration items through the full-process multi-tenant isolation, multi-level high- and low-level queues, configuration-level, and process-level traffic control, and temporary degradation mechanisms, realizing the stable collection of several petabytes of data per day.

As Alibaba migrates its core business to the cloud and promotes the commercialization of SLS (to which iLogtail belongs on Alibaba Cloud), iLogtail begins to embrace cloud-native in an all-around way. Faced with a diverse cloud environment, a rapidly developing open-source ecosystem, and a large number of industrial customers, iLogtail has shifted its focus of development to solving problems, such as how to adapt to a cloud-native environment, how to be compatible with open-source protocols, and how to deal with fragmentation requirements. In 2018, iLogtail supported Docker container collection. In 2019, it supported containerd container collection. In 2020, it fully upgraded metric collection. In 2021, the support for trace was added. With comprehensive support for containerization, Kubernetes Operator control, and extensible plug-in systems, iLogtail supports the deployment scale of over ten million, tens of thousands of internal and external customers, and millions of collection configuration items, achieving a stable collection of tens of petabytes of data per day.

In November 2021, iLogtail had the code of the Golang plug-in open-source, taking the first step towards open-source. Since then, it has attracted the attention of hundreds of developers, and many of them have contributed processor and flusher plug-ins. Today, the core code of C++ is also officially open-source. As a result, developers can build a complete cloud-native observability data collection scheme based on this version.

Benefits of iLogtail

There are many open-source collectors for gathering observability data, such as Logstash, Fluentd, and Filebeats. These collectors have a wide range of features, but iLogtail has advantages due to its unique design in terms of key features, such as performance, stability, and control capabilities.

table1

Important Features of the C++ Kernel

This time, the open-source iLogtail C++ kernel is fully aligned with the features of its Enterprise Edition, and a pipeline-based simple collection configuration item is added to improve the usability of the Community Edition.

Collection of Log Files

The core source code of C++ contains a fully functional file discovery mechanism. It can support dynamic log file monitoring of wildcards and multi-level directories and support log rotation, log count, and rotation settings. iLogtail uses iNotify as the main method of file monitoring in Linux to provide data discovery with millisecond latency. It also uses polling as the data discovery method to adapt to different operating systems and support various special collection scenarios. iLogtail creates a file discovery mechanism that has both performance advantages and robustness using a hybrid mode of polling and event.

Collection of Container Logs

The iLogtail C++ kernel works with the plug-in system to support container data collection in all scenarios. iLogtail uses plug-ins to discover the container list of nodes and maintain the mapping between containers and log collection paths. It combines the efficient capability for file collection of the C++ kernel to provide the ultimate container data collection experience. iLogtail allows you to filter containers using container tags, environment variables, Kubernetes tags, pod names, and namespaces. In addition, iLogtail provides the capability for conveniently configuring collection sources. It supports multiple deployment methods, such as DaemonSet, Sidecar, and CRD, providing flexible deployment capabilities for different scenarios. iLogtail uses a global container list and the Kubernetes CRI protocol to obtain container information. It is lighter in terms of permissions and component dependencies and has higher collection efficiency than other open-source agents.

Processing of Lock-Free Events

One of the secrets of iLogtail to achieve such high throughput is a lock-free event processing model. Unlike other open-source agents in the industry that allocate an independent thread or Goroutine for each configuration to read data, iLogtail only configures one thread to read data. Since the problem of data reading is not computing but disk, a single thread is sufficient to complete event processing and data reading of all configurations. Using a single thread enables iLogtail to run event processing and data reading in a lock-free environment, and the data structure is lighter, thus achieving better performance than multi-threaded processing.

Multi-Tenant Isolation

It is normal for a server to have hundreds of collection configurations in the production environment. The priority, log generation speed, processing method, and upload destination of each configuration may be different. Therefore, it is necessary to solve the problem of how to isolate various custom configurations and ensure that the collection configuration QoS is not affected by some configuration exceptions. iLogtail adopts several key technologies (such as time slice-based collection and scheduling, multi-level high- and low-level feedback queues, event non-blocking processing, throttling and shutdown strategies, and dynamic configuration updates) and achieves a multi-tenant isolation scheme with isolation, fairness, reliability, controllability, and cost-effectiveness. The tests of peak traffic in Double 11 over the years prove this solution has greater stability and cost-effective advantages than other open-source agents.

New Pipeline-Based Configuration Mode

Simple and intuitive configuration files are crucial to the use of agents. In the early days, iLogtail almost relied on graphical configuration. The default JSON configuration files are redundant and difficult to understand. This upgrade uses the YAML format. In addition to the advantages of strong readability, less string escape, support for multiple lines of text, and the ability to add comments, the configuration file is re-divided into four parts: inputs, processors, aggregators, and flushers, based on the latest data pipeline architecture of iLogtail. The feature configuration is focused on rather than implementation details, and the naming specification of configuration items is unified to reduce requirements for the configuration of iLogtail. This is an example of the simplest configuration:

enable: true inputs:   - Type: file_log     LogPath: /log     FilePattern: simple.log flushers:   - Type: flusher_stdout

Future Plans

In November 2021, we made the Golang plug-in open source, which is the most versatile and scalable part of iLogtail. We received a lot of attention and suggestions from developers. It has been added to favorites over 600 times and received more than 60 suggestions and over 120 times of PR. The C++ kernel module is the main factor that iLogtail has advantages over other open-source collection software in terms of performance and resource usage. It is expected that the open-source C++ kernel can help improve the resource efficiency for more enterprises, enrich the product ecology of iLogtail, and attract more developers to participate in community construction.

We believe that being open source is the best development strategy for iLogtail to release its greatest value in the current cloud-native era. As the most basic software in the observable field, iLogtail still has many scenarios to be discovered in various industries. We hope to construct iLogtail together with the open-source community, continue to optimize it, and strive to make it a world-class observable data collector.

Documentation

GitHub: https://github.com/alibaba/ilogtail
User Manual of iLogtail Community Edition: https://ilogtail.gitbook.io/ilogtail-docs
Official Website of iLogtail Enterprise Edition: https://www.alibabacloud.com/help/en/log-service/latest/logtail-collection

Community

The Code of iLogtail Is Completely Open-Source

Challenges to Observability Data Collection

An Introduction to iLogtail

Development of iLogtail

Benefits of iLogtail

Important Features of the C++ Kernel

Collection of Log Files

Collection of Container Logs

Processing of Lock-Free Events

Multi-Tenant Isolation

New Pipeline-Based Configuration Mode

Future Plans

Documentation

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

Managed Service for Prometheus

Cloud-Native Applications Management Solution

Log Service

Log Management for AIOps Solution