Algorithms for clustering based on log similarity, clustering based on word frequency, and pattern matching - Simple Log Service

The Intelligent Anomaly Analysis application of Simple Log Service provides the text analysis feature to intelligently and automatically analyze text content in logs and provide global statistical analysis results. The text analysis feature allows you to create log pattern discovery and log pattern matching jobs to monitor and analyze logs. You can select a job and an algorithm based on the characteristics of the logs that you want to analyze.

Overview of text analysis algorithms

In a log pattern discovery job, you can set up a log pattern library offline by using the log clustering algorithm or pattern discovery algorithm. In a log pattern matching job, you can monitor logs online by using the similarity clustering algorithm, hash clustering algorithm, or similarity matching algorithm.

The text analysis algorithms use LogParser and anomaly detection technologies. Log analysis reports are provided to help you understand the global information and potential anomalies of logs.

You can use the reports to identify the categories of logs that may have anomalies. This narrows down the scope of logs used for manual troubleshooting. The categories include new log categories and the top five log categories that have the highest anomaly scores.
You can view the reports on a regular basis to check changes in global log information. This helps check system stability.

Log pattern discovery

The log clustering algorithm is used in scenarios in which the volume of logs is large and log formats are consistent. The pattern discovery algorithm is used in scenarios in which the volume of logs is moderate and log formats are complex.

Log clustering algorithm

The log clustering algorithm is based on the log clustering feature. The log clustering feature performs coarse-grained clustering on logs. Then, the log clustering algorithm performs fine-grained clustering based on the coarse-grained clustering results. For more information about how to enable the log clustering feature and view clustering results, see LogReduce.

Pattern discovery algorithm

The pattern discovery algorithm clusters logs that have similar high-frequency words into one category by using the word frequency analysis algorithm. The high-frequency words form the log pattern of the category. For more information about the algorithms, see Efficient and Robust Syslog Parsing for Network Devices in Datacenter Networks.

Log pattern matching

The similarity clustering algorithm and hash clustering algorithm are used in scenarios in which the volume of logs is large and log formats are consistent. The similarity matching algorithm is used in scenarios in which the volume of logs is large.

Similarity clustering algorithm

The similarity clustering algorithm uses text similarity-based LogParser to parse text logs, clusters logs based on the content and structure of the logs, and then classifies similar logs into one category. Text similarity includes edit distance, Jaccard similarity, and cosine similarity. The similarity clustering algorithm further analyzes the changes of logs in continuous time windows by log category to detect potential anomalies. For more information about the algorithm, see Drain: An Online Log Parsing Approach with Fixed Depth Tree.

Hash clustering algorithm

The hash clustering algorithm is based on the log clustering feature. The log clustering feature clusters logs online. Then, the hash clustering algorithm performs further clustering based on the preceding clustering results. The hash clustering algorithm continuously analyzes and monitors logs. For more information, see LogReduce. The hash clustering algorithm does not use external log pattern libraries.

Similarity matching algorithm

The similarity matching algorithm uses external log pattern libraries to match and analyze logs. The similarity matching algorithm can also use log pattern libraries that are set up in a log pattern discovery job. The similarity matching algorithm collects statistics on the occurrences of each log pattern in a log pattern library and identifies new log patterns at the earliest opportunity. The similarity matching algorithm accelerates log pattern matching by using methods such as vector matching and hash matching.