By Yitao
In the field of log analysis, Logtail, a widely used log collection tool, can significantly enhance overall efficiency with any performance improvements. Recently, during performance testing of Logtail, the author noticed that when the regular expression at the beginning of the line is enabled to process multi-line logs, the collection performance deteriorates. What exactly causes this phenomenon? Next, let's explore the secrets behind the performance improvement of Logtail's multi-line log collection.
To understand this phenomenon, you must first understand how Logtail works when processing multi-line logs. Logtail's multi-line log merging feature aggregates scattered multi-line data into complete events based on a specific log format. The following process shows how this feature works:
For example, assume that we have the following log format. Generally, we specify cnt.* as the regular expression at the beginning of the line. Logtail uses this regular expression to match each line and merges these single-line logs into a complete multi-line log.
cnt:13472391,thread:2,log:Exception in thread "main" java.lang.NullPointerExceptionat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitle
at com.example.myproject.Book.getTitle
at com.example.myproject.Book.getTitle
at com.example.myproject.Book.getTitle
Logtail uses the boost::regex_match
function to perform the regular expression match. This regular function performs a full match on the input log buffer based on the input regular expression reg. For example, in the preceding log, cnt.* matches all 1072 characters in the first line.
bool BoostRegexMatch(const char* buffer, size_t size, const boost::regex& reg, string& exception) {
// ...
if (boost::regex_match(buffer, buffer + size, reg))
return true;
// ...
}
The author wrote the following test code and found that as the length of the log that is not related to the regular expression at the beginning of the line increases (the part matched by .*
), the length of boost::regex_match
also grows linearly.
static void BM_Regex_Match(int batchSize) {
std::string buffer = "cnt:";
std::string regStr = "cnt.*";
boost::regex reg(regStr);
std::ofstream outFile("BM_Regex_Match.txt", std::ios::trunc);
outFile.close();
for (int i = 0; i < 1000; i++) {
std::ofstream outFile("BM_Regex_Match.txt", std::ios::app);
buffer += "a";
int count = 0;
uint64_t durationTime = 0;
for (int i = 0; i < batchSize; i++) {
count++;
uint64_t startTime = GetCurrentTimeInMicroSeconds();
if (!boost::regex_match(buffer, reg)) {
std::cout << "error" << std::endl;
}
durationTime += GetCurrentTimeInMicroSeconds() - startTime;
}
outFile << i << '\t' << "durationTime: " << durationTime << std::endl;
outFile << i << '\t' << "process: " << formatSize(buffer.size() * (uint64_t)count * 1000000 / durationTime)
<< std::endl;
outFile.close();
}
}
int main(int argc, char** argv) {
logtail::Logger::Instance().InitGlobalLoggers();
std::cout << "BM_Regex_Match" << std::endl;
BM_Regex_Match(10000);
return 0;
}
At this time, we need to note that when we use the regular expression at the beginning of the line, we usually only need to match the beginning part of a single-line log. For example, in the cnt log, we don't need the entire .*
part, as matching this part consumes unnecessary performance. This effect is particularly noticeable when the log is very long.
In fact, the Boost library provides the boost::regex_search
function. Setting the appropriate flags (such as boost::match_continuous
) can achieve the requirement of matching only prefixes. This is exactly what is needed for matching the regular expression at the beginning of the line. Let's take a look at how to use boost::regex_search
:
bool BoostRegexSearch(const char* buffer, size_t size, const boost::regex& reg, string& exception) {
// ...
if (boost::regex_search(buffer, buffer + size, what, reg, boost::match_continuous)) {
return true;
}
// ...
}
In Logtail, due to the existing implementations of the regular expression at the beginning of the line, the user's regular expressions have .*
suffixes. We can automatically remove the .*
and add ^
at the beginning of the regular expression to improve matching efficiency.
Like boost::regex_match
, the author also tested boost::regex_search
based on the length of the log. It can be found that the time consumed by boost::regex_search
remains relatively stable and does not increase with the length of the log.
static void BM_Regex_Search(int batchSize) {
std::string buffer = "cnt:";
std::string regStr = "^cnt";
boost::regex reg(regStr);
std::ofstream outFile("BM_Regex_Search.txt", std::ios::trunc);
outFile.close();
for (int i = 0; i < 1000; i++) {
std::ofstream outFile("BM_Regex_Search.txt", std::ios::app);
buffer += "a";
int count = 0;
uint64_t durationTime = 0;
for (int i = 0; i < batchSize; i++) {
count++;
uint64_t startTime = GetCurrentTimeInMicroSeconds();
if (!boost::regex_search(buffer, reg)) {
std::cout << "error" << std::endl;
}
durationTime += GetCurrentTimeInMicroSeconds() - startTime;
}
outFile << i << '\t' << "durationTime: " << durationTime << std::endl;
outFile << i << '\t' << "process: " << formatSize(buffer.size() * (uint64_t)count * 1000000 / durationTime)
<< std::endl;
outFile.close();
}
}
int main(int argc, char** argv) {
std::cout << "BM_Regex_Search" << std::endl;
BM_Regex_Search(10000);
return 0;
}
After making these adjustments, the author conducted a performance comparison test between the original and improved versions of Logtail. The test results showed a significant performance improvement. The test environment is as follows:
• The same ACK cluster with identical machine specifications (ecs.c7.4xlarge, 16 vCPU, 32 GiB, Compute-optimized type c7).
• PL3 ESSD of 2048GB.
• Logtail startup parameters are kept at default.
• The program used for printing logs is consistent.
• The same collection configuration, with only the regular expression cnt.* configured at the beginning of the line.
• Generate the same log in real time. The log feature is that the length of the first line is relatively long.
cnt:13472391,thread:2,log:Exception in thread "main" java.lang.NullPointerExceptionat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitleat com.example.myproject.Book.getTitle
at com.example.myproject.Book.getTitle
at com.example.myproject.Book.getTitle
at com.example.myproject.Book.getTitle
• The size of the first collection is 10485760, and the startup parameter is modified to "max_fix_pos_bytes" : 10485760 to prevent the log from being written a lot during the first read.
• Run eight pods at the same time to output logs to a local file.
The test found that the log collection speed of the optimized Logtail 2.1.1 version reaches 633 MB/s. However, the Logtail 1.8.7 version, which is before the optimization of the regular expression at the beginning of the line, is less than 90 MB/s, which is increased by 7 times.
Metric | Original | After Multi-line Optimization |
---|---|---|
iLogtail Collection Rate (MB/s) | 90MB/s | 633MB/s |
We have observed that when the log content is very long, the collection performance of Logtail can be significantly improved several times by a simple one-line code modification. Theoretically, the longer the parts of the log that are irrelevant to the regular expression at the beginning of the line are, the more significant the performance improvement brought by this optimization will be.
Development Trends and Open Source Technology Practices of AI Agents
547 posts | 52 followers
FollowAlibaba Container Service - July 16, 2019
Alibaba Cloud Serverless - February 28, 2023
Alibaba Clouder - November 11, 2019
Alibaba Clouder - May 27, 2019
Alibaba Cloud Native Community - March 29, 2024
Alibaba Cloud Community - July 5, 2022
547 posts | 52 followers
FollowMulti-source metrics are aggregated to monitor the status of your business and services in real time.
Learn MorePlan and optimize your storage budget with flexible storage services
Learn MoreA cost-effective, efficient and easy-to-manage hybrid cloud storage solution.
Learn MoreAccelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn MoreMore Posts by Alibaba Cloud Native Community