Best Practices of Kubernetes Log Collection

By Zhang Cheng (Yuanyi), Alibaba Cloud Storage Service Technical Expert

Based on the considerations for Kubernetes log output described in the previous article, this article focuses on the ultimate purpose of log output, which is to collect and analyze logs in a unified manner. Log collection in Kubernetes is very different from common virtual machines. It involves greater implementation difficulty and higher deployment costs. However, if properly used, it can benefit you with higher automation and lower O&M costs compared to conventional methods. This article is the fourth in the Kubernetes-related series.

Challenges in Kubernetes Log Collection

Log collection in Kubernetes is a lot more complicated than in conventional virtual machines or physical machines. The most fundamental factor is that Kubernetes blocks underlying exceptions and provides finer-grained resource scheduling, to deliver a stable and dynamic environment. As a result, the Kubernetes log collection is faced with more diversified and dynamic environments with more that need consideration.

For example:

For a job application whose runtime is as short as seconds, how can real-time log collection be ensured without losing any data?
Kubernetes generally uses large-scale nodes that are connected to between 10 and 100 or more containers. In light of this, how can logs be collected from more than 100 containers with the lowest possible resource consumption?
In Kubernetes, applications are deployed by using YAML files, but the majority of log collectors use manual configuration files. In this case, how can log collectors be deployed in Kubernetes?

Collection Method: Proactive or Passive

Log collection methods are divided into passive collection and proactive push. In Kubernetes, passive collection includes Sidecar and DaemonSet, and proactive push includes DockerEngine and business direct writing.

DockerEngine provides LogDriver capabilities. For log collection, you can make different configurations on LogDriver to write container stdout to a remote storage through DockerEngine. Due to its poor customizability, flexibility, and resource segregation, this method is generally not recommended for production environments.
Direct writing allows you to integrate the software development kit (SDK) of log collection with an application and use the SDK to send logs to a log server. This method does not require additional agent deployment or the logic of flushing logs to disks, so it consumes the fewest system resources. However, the business is bound with the log SDK, greatly compromising the overall flexibility. Therefore, this method is used only in scenarios with a large number of logs.
In DaemonSet mode, only one log agent runs on each node to collect logs from the node. DaemonSet consumes much fewer resources and provides limited scalability and tenant data segregation capability. It applies to clusters with a single feature or few businesses.
In Sidecar mode, a separate log collection agent is deployed for each pod. The log agent collects logs from one business application. Sidecar consumes more resources and provides great scalability and tenant data segregation capability. It applies to large Kubernetes clusters or clusters that serve multiple business parties as Platform-as-a-Service (PaaS) platforms.

In summary:

DockerEngine direct writing is not recommended in general scenarios.
Business direct writing is recommended for scenarios with a large number of logs.
DaemonSet applies to small- and medium-sized clusters.
Sidecar is recommended for ultra-large clusters.

The following table summarizes the comparison between these collection methods:

Log Output: Stdout or Files

Unlike virtual machines or physical machines, Kubernetes containers support the standard output mode and the file mode. In containers, logs are directly output to stdout or stderr in standard output mode. DockerEngine takes over the stdout or stderr file descriptor and processes the received logs according to the LogDriver rules configured for the DockerEngine. In containers, logs can be written into files in a similar way as physical machines or virtual machines. The logs can be stored in different ways such as default storage, EmptyDir, HostVolume, and NFS.

Docker officially recommends using stdout to output logs. However, this recommendation is based on scenarios where containers are used as simple applications. In business scenarios, we recommend that you use the file mode whenever possible. The main reasons are:

Stdout affects system performance. In stdout mode, logs must go through several processes (such as the commonly used JSON LogDriver) from an application to a server: application stdout > DockerEngine > LogDriver > serialization to JSON > saving to a file > collection by the agent > JSON parsing > uploading to the server. Compared with the file mode, this process causes much greater overhead. In stress testing, every additional output of 100,000 log entries every second occupied one more CPU core of DockerEngine.
Stdout does not support classification. That is, all output is mixed in a stream, instead of being classified into different files. An application usually generates access logs, error logs, interface logs (logs of external interface calls), and trace logs. These logs are different from each other in their formats and uses. It is difficult to collect and analyze these logs when they are mixed into one stream.
Stdout only supports the output from the main program in a container. It is not available to programs that run in daemon or fork mode.
Files can be dumped with various policies, such as synchronous or asynchronous writing, cache size, file rotation, compression, and cleanup policies, making the output more flexible.

We recommend using file mode for outputting logs of online applications and using stdout only for single-function applications, Kubernetes systems, or O&M components.

CICD Integration: Logging Operator

Kubernetes provides a standard business deployment process. You can use YAML (a Kubernetes API) to declare routing rules, expose services, mount storage, run businesses, and define scaling rules. Kubernetes can be easily integrated with CICD systems. Log collection is another important part of O&M and monitoring. After a business goes online, all logs must be collected in real-time.

According to the original method, you have to manually deploy log collection logic after a service is published. This goes against the purpose of CICD automation. To implement automation, some engineers encapsulate an automatically deployed service based on the log collection API or SDK, and then trigger calls through the webhook of CICD after the service is published. This method is expensive.

Kubernetes provides a standard log integration method. It allows you to register logs as a new resource with the Kubernetes system for management and maintenance through operators (Custom Resource Definitions (CRDs)). In this way, the CICD system does not require additional development. Instead, you only need to add log-related configurations when integrating the CICD system to Kubernetes.

Kubernetes Log Collection Solutions

The development of log collection solutions for container environments started before Kubernetes. With the performance of Kubernetes becoming more stable overtime, we began to migrate many businesses to the Kubernetes platform. Now, we have developed a log collection solution for Kubernetes. This solution provides the following benefits:

Supports real-time collection of various data including container files, container stdout, host files, journals, and events.
Supports multiple deployment methods including DaemonSet, Sidecar, and DockerEngine LogDriver.
Supports the enrichment of logs, including adding namespace, pod, container, image, and node information.
It is stable and highly reliable, implemented based on the Logtail collection agent developed by Alibaba Cloud, and has been deployed on millions of instances.
Supports CRD-based scaling, supports the deployment of log collection rules through Kubernetes deployment and publishing, and provides hitless integration with CICD.

Install Log Collection Components

This collection solution is now available to the public. We have provided a Helm installation package that contains Logtail, DaemonSet, AliyunlogConfig CRD declarations, and CRD Controller. After the installation is completed, you can use the DaemonSet collection method and CRD configurations. You can install the components like this:

Select the Install check box when creating an Alibaba Cloud Kubernetes cluster to automatically install the components. If you have created a Kubernetes cluster but did not install the components, see manually install Log Service components.
For a user-created Kubernetes cluster, whether it is on Alibaba Cloud, other clouds, or an on-premises IDC, you can use this collection solution. For more information about the installation method, see Install the Logtail component.

Once the components are installed, Logtail and the corresponding Controller run in the cluster but these components do not collect any logs by default. Instead, you must configure log collection rules to collect logs from a specified pod.

Configure Collection Rules: Environment Variables or CRDs

In addition to manual configuration in the Log Service console, Kubernetes supports another two options to configure collection rules: environment variables and CRDs.

The environment variable method has been used for configuration since the age of swarm clusters. By using this method, you only need to declare the data collection address in the environment variables of the relevant container, then Logtail automatically collects the data to the server.

This method features easy deployment and low learning costs. However, it supports few configuration rules and does not support most advanced configurations such as parsing methods, filtering methods, blacklists, and whitelists. Even worse, an address declared in this way cannot be modified or deleted. To specify another address, you have to create another collection configuration. You must also clear historical collection configurations to avoid the waste of resources.

The CRD-based configuration method is a standard extension that highly agrees with the official recommendations for Kubernetes. It allows you to manage your collection configurations as Kubernetes resources. You can declare the data to be collected by deploying AliyunLogConfig to Kubernetes.

The following is a sample deployment for collecting container standard output. In this example, both stdout and stderr will be collected, excluding containers with environment variables containing:

COLLEXT_STDOUT_FLAG:false.

The CRD-based method allows you to manage your configuration as standard Kubernetes extension resources. It supports complete semantics for configuration addition, deletion, modification, and search, and supports a variety of advanced configurations. In short, the CRD-based configuration method is highly recommended for data collection.

Recommended Methods of Configuring Collection Rules

In practical application scenarios, DaemonSet is used alone or in combination with Sidecar. DaemonSet benefits you with high resource utilization. All DaemonSet Logtail components share global configurations. Single Logtail can only support a limited number of configurations. Therefore, DaemonSet cannot support clusters with a large number of applications. Recommended configuration is shown in the preceding figure. The core ideas are:

Collect as much similar data as possible with each configuration to reduce the number of configurations and relieve pressure on DaemonSet.
Allocate sufficient resources for collecting data from core applications. You can use the Sidecar mode for this purpose.
Use the CRD-based configuration method whenever possible.
In Sidecar mode, each Logtail component is separately configured, so there is no limit on the number of configurations. This mode applies to ultra-large clusters.

Practice 1: Small- and Medium-sized Clusters

The vast majority of Kubernetes clusters are small- and medium-sized ones without a definition. Generally, a small- or medium-sized Kubernetes cluster contains less than 500 applications and less than 1,000 nodes. Such a cluster does not have a clearly defined Kubernetes platform O&M team. Since there are not many applications, DaemonSet is sufficient to support all collection configurations.

Use the DaemonSet mode to collect data from most business applications.
Use the Sidecar mode to separately collect data from core applications that require a highly reliable log collection, such as order systems and transaction systems.

Practice 2: Large Clusters

Some large and ultra-large clusters are used as PaaS platforms that host more than 1,000 businesses and more than 1,000 nodes. They have dedicated Kubernetes platform O&M teams. Since the number of applications is unlimited in this scenario, DaemonSet alone cannot support all the configurations. Therefore, you must use the Sidecar mode. The overall plan is:

Use the DaemonSet mode to collect system component logs and kernel logs of the Kubernetes platform since these types of logs are relatively fixed. These logs are mainly used for platform O&M personnel.
Use the Sidecar mode to collect business logs. You can set a Sidecar collection destination address for each business to provide high flexibility for business DevOps personnel.

The first article of this blog series is available here.

Community

Best Practices of Kubernetes Log Collection

Challenges in Kubernetes Log Collection

Collection Method: Proactive or Passive

Log Output: Stdout or Files

CICD Integration: Logging Operator

Kubernetes Log Collection Solutions

Install Log Collection Components

Configure Collection Rules: Environment Variables or CRDs

Recommended Methods of Configuring Collection Rules

Practice 1: Small- and Medium-sized Clusters

Practice 2: Large Clusters

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Simple Log Service

Managed Service for Prometheus

Application Real-Time Monitoring Service

Storage Capacity Unit