This topic compares Log Service with the ELK stack for you to better understand the main features and benefits of Log Service.
The Elasticsearch, Logstash, and Kibana (ELK) stack is a popular solution to real-time log analysis. You can find many case studies and resources in the open-source ELK community.
Log Service is a solution dedicated to scenarios that involve log search and analytics. The service was developed from the monitoring and diagnosis tool used for the research and development of the Apsara system. As the number of users grows and the business evolves, Log Service is gradually geared for log analysis in Ops scenarios, such as DevOps, Market Ops, and SecOps. The service has withstood the challenges in scenarios such as Double 11, Ant Financial Double 12, Spring Festival red envelopes, and international businesses and is now serving global users.
Apache Lucene is an open-source search engine software library supported by the Apache Software Foundation. Apache Lucene provides full-text searching and indexing and text analysis capabilities. Elastic developed Elasticsearch in 2012 based on the Lucene library and launched the ELK stack in 2015 as an integrated solution to log collection, storage, and query. Lucene was designed to retrieve information based on documents. Its log processing capabilities are limited in many aspects, such as the data volume, query capability, intelligent grouping, and other custom features.
Log Service uses a self-developed log storage engine. In the past three years, Log Service has been applied to tens of thousands of applications. Log Service supports indexing for petabytes of data per day and serves tens of thousands of developers to query and analyze data hundreds of millions of times per day. Log Service serves as the log analysis engine for various Alibaba Cloud services, such as SQL audit, EagleEye, Cloud Map, sTrace, and Ditecting.
Log query is the most basic requirement of DevOps. According to the industry research report 50 Most Frequently Used UNIX/Linux Commands, the tar and grep commands are the top two commands used for programmers to query logs.
The following compares the ELK stack with Log Service in log query and analysis scenarios from five aspects:
- Ease of use: the convenience to get started and use the service.
- Features: search and analytics features.
- Performance: the capabilities to query and analyze data and latency.
- Capacity: the data volume that can be processed and the scalability.
- Cost: the cost for using features.
Ease of use
- Collection: writes data in a stable manner.
- Configuration: configures data sources.
- Capacity expansion: expands storage space and scales servers.
- Usage: provides query and analysis features, which are described in the Features section of this topic.
- Export: exports data to other systems for further processing, such as for stream computing and for data backup in Object Storage Service (OSS).
- Multi-tenancy: shares data with other users and uses data securely.
|Item||Sub-item||Self-built ELK stack||Log Service|
|Data collection||API||RESTful API||
|Client||Various clients in the ecosystem, including Logstash, Beats, and Fluentd||
|Configuration||Resource object||Provides indexes to classify logs||
Provides projects under which Logstores can be created to store logs
|Method||API and Kibana||
||Requires no more servers or disks|
|Computing||Requires more servers||Requires no more servers|
||Provides the console and API for configurations, without the need of a configuration management system|
|Collection point||Applies configurations and installs Logstash on server groups through the configuration management system.||Provides the console and API for configurations, without the need of a configuration management system|
|Capacity||Flexible capacity expansion not supported||Supports flexible and elastic capacity expansion|
|Multi-tenancy||Security||Commercial versions (high security)||
|Traffic shaping||No traffic shaping||
|Multi-tenancy||Supported by Kibana||Supported by providing accounts and granting related permissions|
- The ELK stack ecosystem provides extensive tools such as many write, installation, and configuration tools.
- Log Service is a managed service that enables easy integration with other services, convenient configurations, and ease of use. You can integrate Log Service with your services and use Log Service within 5 minutes.
- Log Service is a software as a service (SaaS) service. You do not need to worry about capacity or concurrency. It supports elastic scaling and does not require O&M.
Search and analytics features
The search feature finds log entries that meet search conditions. The analytics feature analyzes data.
- Search for the specified results and analyze the results.
- Analyze all log entries in a Logstore.
1. Status in (200,500] and Method:Get* | select count(1) as c, sum(inflow) as sum_inflow, ip group by Ip 2. * | select count(1) as c, sum(inflow) as sum_inflow, ip group by Ip
- Basic search capabilities
The following table lists the comparison results based on Elasticsearch 6.5 Indices.
Data type Feature ELK stack Log Service Text Search by index Supported Supported Delimiter Supported Supported [DO NOT TRANSLATE] [DO NOT TRANSLATE] [DO NOT TRANSLATE] Prefix Supported Supported Suffix Supported Unsupported Fuzzy search Supported Supported by using SQL statements Wildcard search Supported Supported by using SQL statements Numeric value LONG Supported Supported DOUBLE Supported Supported Nested JSON Supported Unsupported Geo Geo Supported Supported by using SQL statements IP Search by IP addresses Supported Supported by using SQL statementsConclusions:
- The ELK stack supports more data types and provides stronger native search capabilities than Log Service.
- Log Service allows you to use SQL statements instead of using fuzzy match or Geohash functions to search for data. However, the search performance is slightly compromised. The following examples show how to use SQL statements to query data:
Searches for data that matches the specified substring: * | select content where content like '%substring%' limit 100 Searches for data that matches the specified regular expression: * | select content where regexp_like(content, '\d+m') limit 100 Searches for parsed JSON-formatted data that matches the specified conditions: * | select content where json_extract(content, '$.store.book')='mybook' limit 100 Creates an index for JSON-formatted data: field.store.book='mybook'
- Extended search capabilities
In log search scenarios, you may want to perform follow-up operations on searched data. For example:
- After finding an error log entry, check the context to find out the parameters that cause the error.
- After finding an error, check similar errors. You can run the tail -f command to display raw log entries and run the grep command to search for similar errors.
- After obtaining millions of log entries from a query by keyword, filter out 90% of known issues that distract you.
- Contextual query: queries the context of a log entry in the raw log file and displays the results in multiple pages. You do not need to log on to the server to query the context.
- LiveTail: uses the tail-f command to display raw log entries in real time.
- LogReduce: dynamically groups logs based on different patterns to detect anomalies.
- Use LiveTail to monitor and analyze logs
To monitor logs in real time, the traditional O&M model requires you to run the
tail -fcommand on the server to display the logs. If the displayed logs contain distracting information, you need to run the
grep -vcommand to filter data by keyword. LiveTail provided in the Log Service console allows you to monitor and analyze online log data in real time, thus reducing your O&M workloads.LiveTail has the following features:
- Supports data collected from Docker and Kubernetes containers, servers, Log4j Appenders, and other data sources.
- Monitors log data in real time, and allows you to filter data by keyword.
- Delimits log fields to facilitate searching for log entries that contain specific delimiters.
With the rapid development of businesses, massive volumes of log data is generated every day. This introduces the following concerns:
- Potential system anomalies are difficult to be found.
- Unusual logons by intruders are not detected in an efficient way.
- System behavior changes caused by version updates are not detected due to too much
distracting information. In addition, logs recorded are of various formats and are
not marked by topics, and therefore cannot be well grouped. LogReduce provided in Log Service groups logs based on different patterns and delivers a full
view of the logs. LogReduce has the following features:
- Various formats of logs such as Log4j logs, JSON-formatted logs, and syslog logs can be grouped.
- Logs can be filtered based on conditions that you specify before being grouped. Raw log entries can be retrieved based on the signature of log entries grouped in a pattern.
- The number of log entries grouped in a log pattern in different time ranges can be compared.
- The precision of log grouping can be adjusted based on your needs.
- Hundreds of millions of log entries can be grouped in seconds.
Elasticsearch supports data aggregation based on the doc values data. Elasticsearch 6.x supports data grouping and aggregation by using the SQL syntax. Log Service supports the RESTful API and JDBC API and is compatible with the SQL-92 standard. Log Service supports complete SQL statements, including basic aggregate functions. In addition, Log Service allows you to perform JOIN operations on internal and external data sources, and implement machine learning and pattern analysis on data.
- Interval-valued comparison and periodicity-valued comparison function
You can nest the interval-valued comparison and periodicity-valued comparison functions in SQL statements to calculate the changes of a single field value, multiple field values, and a curve in different time windows.
* | select compare( pv , 86400) from (select count(1) as pv from log)
*|select t, diff as current, diff as yesterday, diff as percentage from(select t, compare( pv , 86400) as diff from (select count(1) as pv, date_format(from_unixtime(__time__), '%H:%i') as t from log group by t) group by t order by t) s
- Join internal and external data sources for data query
You can join Log Service data with external data sources for data search and analytics. The supported data sources and JOIN operations are as follows:
The following example shows how to join Log Service data with external tables.
- You can perform JOIN operations on data in Logstores, MySQL databases, and OSS buckets (CSV files).
- You can perform left outer join, right outer join, full outer join, and inner join operations on data.
- You can use SQL statements to query data in external tables and join Log Service data with external tables.
SQL statements Create an external table: * | create table user_meta ( userid bigint, nick varchar, gender varchar, province varchar, gender varchar,age bigint) with ( endpoint='oss-cn-hangzhou.aliyuncs.com',accessid='LTA288',accesskey ='EjsowA',bucket='testossconnector',objects=ARRAY['user.csv'],type='oss') Join Log Service data with the external table: * | select u.gender, count(1) from chiji_accesslog l join user_meta1 u on l.userid = u.userid group by u.gender
- Geolocation functions
You can use the built-in geolocation functions to identify users based on IP addresses and mobile phone numbers. The following lists the available geolocation functions:
- IP functions: identify the country, province, city, city longitude and latitude, and ISP of an IP address.
- Phone number functions: identify the ISP, province, and city where a mobile phone number is registered.
- Geohash functions: encodes the longitude and latitude of a city.
SQL statements * | SELECT count(1) as pv, ip_to_province(ip) as province WHERE ip_to_domain(ip) ! = 'intranet' GROUP BY province ORDER BY pv desc limit 10 * | SELECT mobile_city(try_cast("mobile" as bigint)) as "city", mobile_province(try_cast("mobile" as bigint)) as "province", count(1) as "number of requests" group by "province", "city" order by "number of requests" desc limit 100
- Security detection functions
Security detection functions in Log Services are designed based on the globally shared White Hat Security asset library. You can use security detection functions to check whether an IP address, domain name, or URL in logs is secure.
- Machine learning and time series detection functions
Log Service provides machine learning and intelligent diagnostic functions that you can use to:
Machine learning and intelligent diagnostic functions provide the following features:
- Automatically learn historical data regularities and predict the future trend.
- Detect imperceptible anomalies in real time, and combine analytics functions to analyze the causes of the anomalies.
- Intelligently detect exceptions and inspect the system based on the interval-valued comparison and alert features. You can use this feature to analyze data for scenarios such as intelligent O&M, security, and operations in a fast and efficient manner.
- Prediction: fits a baseline based on the historical data.
- Anomaly detection, change point detection, and inflection point detection: detect anomalies.
- Multi-period detection: detects the periodicity of time-series data.
- Time series clustering: finds time series curves that have different curve shapes with other curves.
- Pattern analysis functions
Pattern analysis functions can help you detect data patterns and thus identify anomalies in a fast and efficient manner. You can use pattern analysis functions to:
- Identify patterns that frequently occur. For example, you can use pattern analysis functions to identify the ID the user who sent 90% of invalid requests.
- Identify the factor that most influence two patterns.
- In requests with a latency greater than 10 seconds, the ratio of combined dimensions that contain an ID is much higher than that of other combined dimensions.
- The ratio of this ID in Pattern B is lower than that in Pattern A.
- Patterns A and B are significantly different.
The following compares the data write, query, and aggregation performance of Log Service and the ELK stack by using the same dataset.
- Test environment
- Test configurations
Item Self-built ELK stack Log Service Runtime environment Four Elastic Compute Service (ECS) instances, each with 4 CPU cores and 16 GB memory, and ultra disks or standard SSDs N/A Shard 10 10 Copies 2 3 (configured by default and invisible to users)
- Test data
The sample test log entry is as follows:
- Five columns of the double data type, five columns of the long data type, and five columns of the text data type, displayed in 256, 512, 768, 1,024, and 1,280 dictionaries.
- Fields in test data are randomly sampled, as shown in the following figure.
- Raw data size: 50 GB.
- Number of raw log entries: 162,640,232 (about 160 million).
timestamp:August 27th 2017, 21:50:19.000 long_1:756,444 double_1:0 text_1:value_136 long_2:-3,839,872,295 double_2:-11.13 text_2:value_475 long_3:-73,775,372,011,896 double_3:-70,220.163 text_3:value_3 long_4:173,468,492,344,196 double_4:35,123.978 text_4:value_124 long_5:389,467,512,234,496 double_5:-20,10.312 text_5:value_1125
- Test configurations
- Test data writes
The Bulk API is used to write batch data to the ELK stack and the PostLogstoreLogs API is used to write batch data to Log Service. The following table lists the test results.
Item Sub-item Self-built ELK stack Log Service Latency Average write latency 40 ms 14 ms Storage Data volume of a copy 86 GB 58 GB Expansion rate: Data volume/Raw data size 172% 116%Note The storage fee for 50 GB of data incurred in Log Service includes the fee incurred for writing 23 GB of compressed data and the fee incurred for 27 GB of indexes.Conclusions:
- Log Service has a lower data write latency than the ELK stack.
- The raw data size is 50 GB. The stored data volume expands because the test data is random. In most scenarios, the stored data volume after compression is smaller than the raw data size. The data stored in the self-built ELK stack expands to 86 GB. The expansion rate is 172%, which is 58% higher than that in Log Service. This expansion rate is approximate to 220% that is recommended when you write new data to the ELK stack.
- Test data reads (search and analytics)
- Test scenario
The two common scenarios of log search and aggregation are used as an example. The average latency is calculated in the two scenarios when the number of concurrent read requests is 1, 5, and 10, respectively. The two scenarios are as follows:
- Performs aggregate functions (AVG, MIN, MAX, SUM, and COUNT) on the five columns of
the long data type and groups values of five numeric columns, and then sorts the calculated
values by the COUNT value to obtain the first 1,000 results:
select count(long_1) as pv,sum(long_2),min(long_3),max(long_4),sum(long_5) group by text_1 order by pv desc limit 1000
- Uses a keyword, for example, value_126, to query the number of log entries that contain
the keyword and the top 100 rows.
- Performs aggregate functions (AVG, MIN, MAX, SUM, and COUNT) on the five columns of the long data type and groups values of five numeric columns, and then sorts the calculated values by the COUNT value to obtain the first 1,000 results:
- Test results
Scenario Number of concurrent read requests Latency of the self-built ELK stack (Unit: seconds) Latency of Log Service (Unit: seconds) Log analytics 1 3.76 3.4 5 3.9 4.7 10 6.6 7.2 Log search 1 0.097 0.086 5 0.171 0.083 10 0.2 0.082
- Both Log Service and the ELK stack can query and analyze 160 million of log entries within seconds.
- In the log analytics scenario, the latency is similar between Log Service and the ELK stack. The ELK stack uses SSDs and delivers better I/O performance than Log Service when a large amount of data is read.
- In the log search scenario, Log Service has a much shorter latency than the ELK stack. As the number of concurrent requests increases, the latency of the ELK stack increases, while that of Log Service remains stable and even decreases.
- Test scenario
- Log Service allows you to index petabytes of data per day and query dozens of terabytes of data within seconds at a time. It supports elastic scaling and scale-out for the processing scale.
- The ELK stack is suitable for scenarios where data is written in units of GB to TB
per day and stored in units of TB. The processing capacity is constrained by the following
- The size of a cluster: A cluster that consists of about 20 nodes has optimal performance. A large cluster in the industry can contain 100 nodes and is often split into multiple clusters for data processing.
- Write capacity: The number of shards cannot be modified after they are created. Therefore, the maximum number of available nodes cannot be increased when more write capacity is required along with the increasing throughput.
- Storage capacity: When the data stored on the primary shard reaches the maximum disk capacity, you must either migrate the shard to another disk with a larger capacity, or allocate more shards to the disk. The typical solution is to create an index, specify more shards, and rebuild existing data.
- The cluster becomes unavailable to other users when a user queries a large amount of data.
- The cluster is fully occupied during peak hours, busy collecting and processing data. This compromises data integrity and the accuracy of query results.
- The cluster becomes unavailable and inaccurate in some cases because, as the business grows, out of memory (OOM) often occurs due to memory settings and heartbeat synchronization failures. This cluster is useless to developers.
- The team uses Logtail to collect online logs, and uses the API to integrate log collection and server management configurations into the O&M system.
- The team embeds the Log Service query page into the unified logon and O&M platform to separate business permissions from account permissions.
- The team embeds the Log Service console page into the customer's own platform so that the development team can query logs in an efficient way. The team also configures Grafana plug-ins to monitor business and configures DataV to create dashboards in Log Service.
- The number of queries per day has increased significantly. Developers are increasingly using the O&M platform to search and analyze logs and have their efficiency improved. The O&M team also revokes online logon permissions.
- In addition to NGINX logs, the O&M platform also imports application logs, mobile device logs, and container logs into Log Service. The amount of data that are processed increases by 9 times.
- More applications are developed. For example, Jaeger plug-ins are integrated with Log Service to build a tracing system for logs. Alerts and charts are configured to detect online errors on a daily basis.
- Various platforms are interconnected with the unified O&M platform to collect data in a uniform manner and avoid repeated data collection. In addition, the Spark and Flink platforms of the big data department can consume log data in real time.
Elasticsearch supports more common scenarios such as data updates, queries, and deletions. It is widely used in fields such as data search and analysis, and application development. The ELK stack maximizes the flexibility and performance of Elasticsearch in log analytics scenarios. Log Service is designed for log data search and analytics scenarios. Many of its features are unique in the industry. The ELK stack can cover a wider range of scenarios while Log Service provides deeper analytics features for specific scenarios. You can choose either one of the two services based on your business needs.