This topic describes the main features and benefits of Log Service. In this case, it compares Log Service with the Elasticsearch, Logstash, and Kibana (ELK) stack.
Background information
The ELK stack is a solution that is widely used to analyze logs in real time. For information about the case studies, visit the open source ELK community.
Log Service is a solution dedicated to log query and analysis scenarios. The service is developed based on the monitoring and diagnosis tool that is used to develop the Apsara distributed operating system. To meet the requirement of growth in both users and business, Log Service is improved to support log analysis in Ops scenarios, such as DevOps, Market Ops, and SecOps. The service has surpassed the challenges in scenarios such as Double 11, Ant Financial Double 12, Spring Festival red envelopes, and international business.
Overview
Apache Lucene is an open source search engine software library that is supported by the Apache Software Foundation. Apache Lucene provides full-text searching, full-text indexing, and text analysis capabilities. Elastic developed Elasticsearch in 2012 based on the Lucene library. It also launched the ELK stack in 2015 as an integrated solution to collect, store, and query logs. Lucene is designed to retrieve information based on documents. Its log processing capabilities are limited in many aspects, such as the data volume, query capability, intelligent grouping (LogReduce), and other custom features.
Log Service uses a self-developed log storage engine. In the past three years, Log Service has been applied to tens of thousands of applications. Log Service supports indexing for petabytes of data per day and serves tens of thousands of developers. It can be used to query and analyze data hundreds of millions of times per day. Log Service serves as the log analysis engine for multiple Alibaba Cloud services, such as SQL audit, EagleEye, Cloud Map, FLIGGY Trace, and Ditecting.
Log query is the most basic requirement of DevOps. In the industry research report 50 Most Frequently Used UNIX/Linux Commands, the tar and grep commands are the top two commands that are used by programmers to query logs.
The following list compares the ELK stack with Log Service in log query and analysis scenarios from five aspects:
- Ease of use: the convenience to get started and use the service.
- Features: the query and analysis features.
- Performance: the query capabilities, analysis capabilities, and latency.
- Capacity: the data processing capacity and the scalability.
- Cost: the cost of using features.
Ease of use
- Collection: uses stable method to write data.
- Configuration: configures data sources.
- Capacity expansion: expands storage space and scales out servers.
- Usage: provides query and analysis features. These features are described in the "Features" section of this topic.
- Export: exports data to other systems for further processing, such as for stream computing and for data backup in Object Storage Service (OSS).
- Multi-tenancy: shares data with other users and ensures data security.
Item | Sub-item | Self-managed ELK stack | Log Service |
---|---|---|---|
Data collection | API | Restful API |
|
Client | Various clients in the ecosystem, including Logstash, Beats, and Fluentd |
|
|
Configuration | Resource object | Provides indexes to classify logs. |
Provides projects in which Logstores can be created to store logs. |
Method | API + Kibana |
|
|
Capacity expansion | Storage |
|
Requires no added servers or disks. |
Computing | Requires more servers. | Requires no added servers. | |
Configurations |
|
Provides the console and API for configurations, without the need of a configuration management system. | |
Collection point | Applies configurations and installs Logstash on server groups by using the configuration management system. | Provides the console and API for configurations, without the need of a configuration management system. | |
Capacity | Does not support flexible capacity expansion. | Supports flexible and elastic capacity expansion. | |
Export | Method |
|
|
Multi-tenancy | Security | Commercial versions |
|
Traffic shaping | No traffic shaping |
|
|
Multi-tenant | Supported by Kibana | Supported by the provision of accounts and granted by the related permissions |
Comparison results:
- The ELK stack ecosystem provides multiple tools such as write, installation, and configuration tools.
- Log Service is a managed service that is easy to access, configure, and implement. You can integrate Log Service with your services and use Log Service within 5 minutes.
- Log Service is a service that uses the software as a service (SaaS) model. This can be used to address challenges in terms of capacity or concurrency. Log Service supports elastic scaling and does not require O&M.
Query and analysis features
The query feature allows you to query log entries that meet specified search conditions. The analysis feature allows you to calculate and analyze data.
- Query log entries and analyze the query results.
- Analyze all log entries in a Logstore.
1. Status in (200,500] and Method:Get* | select count(1) as c, sum(inflow) as sum_inflow, ip group by Ip
2. * | select count(1) as c, sum(inflow) as sum_inflow, ip group by Ip
- Basic query capabilities
The following table includes a list that is used to compare the ELK stack with Log Service based on Elasticsearch 6.5 Indices.
Data type Feature ELK Log Service Text Query by index Supported Supported Delimiter Supported Supported Chinese delimiter Supported Supported Prefix Supported Supported Suffix Supported - Fuzzy query Supported Supported by using SQL statements Wildcard Supported Supported by using SQL statements Numeric long Supported Supported double Supported Supported Nested Json Supported - Geo Geo Supported Supported by using SQL statements IP Query by IP address Supported Supported by using SQL statements Comparison results:- The ELK stack supports more data types and provides stronger native query capabilities than Log Service.
- Log Service allows you to use SQL statements instead of fuzzy match or Geo functions to query data. However, the query performance is slightly compromised. The following examples provide further details about how to use SQL statements to query data:
To query data that matches the specified substring, execute the following query statement: * | select content where content like '%substring%' limit 100 To query data that matches the specified regular expression, execute the following query statement: * | select content where regexp_like(content, '\d+m') limit 100 To query parsed JSON-formatted data that matches the specified conditions, execute the following query statement: * | select content where json_extract(content, '$.store.book')='mybook' limit 100 To create an index for JSON-formatted data, execute the following query statement: field.store.book='mybook'
- Extended query capabilitiesIn log query scenarios, you may need to perform the following operations based on the query results:
- If you find an error log entry, you can check the context to identify the parameters that cause the error.
- If you find an error, you can run the tail -f command to display raw log entries and run the grep command to search for similar errors.
- If you obtain millions of log entries from a query by using keywords, you can filter out 90% of log entries related to known issues and focus on unknown issues.
- Contextual query: queries the context of a log entry in the raw log file and displays the results on multiple pages. You do not need to log on to the server to query the context.
- LiveTail: uses the tail-f command to display raw log entries in real time.
- LogReduce: dynamically groups logs based on different patterns to detect exceptions.
- LiveTail
In the traditional O&M model, you must run the
tail -f
command on the server to monitor logs in real time. If you want a more specific result, run thegrep
orgrep -v
command to filter data by keyword. LiveTail provided in the Log Service console allows you to monitor and analyze online log data in real time. This reduces your O&M workloads.LiveTail supports the following features:- Supports data collected from Docker and Kubernetes containers, servers, Log4j Appenders, and other data sources.
- Monitors log data in real time, and allows you to filter data by keyword.
- Delimits log fields to facilitate the queries for log entries that contain specific delimiters.
- LogReduceLarge volumes of log data generated every day by rapid business development has created the following issues:
- Potential system exceptions are difficult to identify.
- Suspicious logons by intruders cannot be detected in real time.
- System behavior changes caused by version updates cannot be detected due to large
amounts of information. In addition, recorded logs have various formats and have no
topics. Therefore, the logs are difficult to group. LogReduce provided in Log Service can group logs based on different patterns and provide an
overview of the logs. LogReduce supports the following features:
- Various formats of logs such as Log4j logs, JSON-formatted logs, and syslog logs are supported.
- Before logs are grouped, these logs can be filtered based on specified conditions. Raw log entries can be retrieved based on the signature of log entries grouped in a specific pattern.
- Compares the patterns of different time periods.
- The precision of log grouping can be adjusted based on your business requirements.
- Hundreds of millions of log entries can be grouped in seconds.
Analysis capabilities
Elasticsearch supports data aggregation based on the doc values. Elasticsearch 6.x supports data grouping and aggregation by using the SQL syntax. Log Service supports the RESTful API and JDBC API and is compatible with the SQL-92 standard. Log Service supports complete SQL statements, including basic aggregate functions. In addition, Log Service allows you to perform JOIN operations on internal and external data sources, and implement machine learning and pattern analysis.
- Interval-valued comparison and periodicity-valued comparison functionsYou can nest the interval-valued comparison and periodicity-valued comparison functions in SQL statements. You can use this method to calculate the changes of a single field value, multiple field values, and curves in different time windows.
* | select compare( pv , 86400) from (select count(1) as pv from log)
*|select t, diff[1] as current, diff[2] as yestoday, diff[3] as percentage from(select t, compare( pv , 86400) as diff from (select count(1) as pv, date_format(from_unixtime(__time__), '%H:%i') as t from log group by t) group by t order by t) s
- Join internal and external data sources for data queryYou can join the data in Log Service with external data sources for data query and analysis. The following list shows the supported data sources and JOIN operations:
- You can perform JOIN operations on the data in Logstores, MySQL databases, and OSS buckets (CSV files).
- You can perform LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN, and INNER JOIN operations on data.
- You can use SQL statements to query data in external tables and join data in Log Service with external tables.
The following example provides further details about how to join the data in Log Service with external tables:sql To create an external table, execute the following query statement: * | create table user_meta ( userid bigint, nick varchar, gender varchar, province varchar, gender varchar,age bigint) with ( endpoint='oss-cn-hangzhou.aliyuncs.com',accessid='LTA288',accesskey ='EjsowA',bucket='testossconnector',objects=ARRAY['user.csv'],type='oss') To join the data in Log Service with the external table, execute the following query statement: * | select u.gender, count(1) from chiji_accesslog l join user_meta1 u on l.userid = u.userid group by u.gender
- Geolocation functions
You can use the built-in geolocation functions to identify users based on IP addresses and mobile phone numbers. The following list shows the available geolocation functions:
- IP functions: identify the country, province, city, longitude and latitude, and Internet service provider (ISP) of an IP address.
- Phone number functions: identify the ISP, province, and city where a mobile phone number is registered.
- Geohash functions: encode the longitude and latitude of a location.
sql * | SELECT count(1) as pv, ip_to_province(ip) as province WHERE ip_to_domain(ip) != 'intranet' GROUP BY province ORDER BY pv desc limit 10 * | SELECT mobile_city(try_cast("mobile" as bigint)) as "city", mobile_province(try_cast("mobile" as bigint)) as "province", count(1) as "number of requests" group by "province", "city" order by "number of requests" desc limit 100
- Security check functionsLog Service provides security check functions based on the globally shared asset library of WhiteHat Security. You can use security check functions to check whether an IP address, domain name, or the URL of a log is secure.
- security_check_ip
- security_check_domain
- security_check_url
- Machine learning and time series functions
Log Service provides machine learning and intelligent diagnostic functions. This can be used to perform the following operations:
- Automatically learns historical data regularities and predicts the future trends.
- Detects imperceptible exceptions in real time, and combines analysis functions to analyze error causes.
- Intelligently detects exceptions and inspects the system based on the periodicity-valued comparison function and alerting feature. This feature provides an efficient method to analyze data in scenarios, such as intelligent O&M, security, and operations.
Machine learning and intelligent diagnostic functions provide the following features:- Prediction: fits a baseline based on the historical data.
- Exception detection, change point detection, and inflection point detection: detect exceptions.
- Multi-period detection: detects the periodicity of time series data.
- Time series clustering: finds time series curves that have different curve shapes.
- Pattern analysis functions
Pattern analysis functions provide a fast and efficient method to detect data patterns and identify issues. You can use pattern analysis functions to perform the following operations:
- Identifies patterns that frequently occur. For example, you can use pattern analysis functions to identify the ID of the user who has sent 90% of invalid requests.
- Identifies the factor that most affects two patterns.
- In requests whose latency is greater than 10 seconds, the ratio of combined dimensions that contain an ID is significantly higher than the ratio of other combined dimensions.
- The ratio of the ID in Pattern B is lower than the ratio in Pattern A.
- Patterns A and B are significantly different from each other.
Performance
The following examples compare the data write, query, and aggregation performance of Log Service and the ELK stack by using the same dataset.
- Test environment
- Test configurations
Item Self-managed ELK stack Log Service Runtime environment Four Elastic Compute Service (ECS) instances, each configured with four CPU cores, 16 GB of memory, and ultra disks or standard SSDs - Shard 10 10 Copies 2 3 (the default value that is invisible to users) - Test data
- Five columns of the double data type, five columns of the long data type, and five columns of the text data type are tested. The test data is displayed in 256, 512, 768, 1,024, and 1,280 dictionaries.
- Fields are randomly sampled from the test data, as shown in the following figure.
- Raw data size: 50 GB.
- Number of raw log entries: 162,640,232 (about 160 million).
The following example provides further details of the sample test log entry:timestamp:August 27th 2017, 21:50:19.000 long_1:756,444 double_1:0 text_1:value_136 long_2:-3,839,872,295 double_2:-11.13 text_2:value_475 long_3:-73,775,372,011,896 double_3:-70,220.163 text_3:value_3 long_4:173,468,492,344,196 double_4:35,123.978 text_4:value_124 long_5:389,467,512,234,496 double_5:-20,10.312 text_5:value_1125
- Test configurations
- Test data write operations
The Bulk API is called to write batch data to the ELK stack and the PostLogstoreLogs API is called to write batch data to Log Service. The following table includes a list that is used to compare the test results.
Item Sub-item Self-managed ELK stack Log Service Latency Average write latency 40 ms 14 ms Storage Data volume of a copy 86G 58G Expansion rate: Data volume/Raw data size 172% 116% Note The storage fee incurred in Log Service for 50 GB of data includes the fee that is incurred for writing 23 GB of compressed data. It also includes the fee that is incurred for writing 27 GB of indexes.Comparison results:- Log Service has a lower data write latency than the ELK stack.
- The size of the raw data is 50 GB. The size of the stored data exceeds 50 GB because the test data is random. In most scenarios, the size of the stored data after compression is smaller than the size of the raw data. The data stored in the self-managed ELK stack expands to 86 GB. In this case, the expansion rate is 172%, which is 58% higher than that in Log Service. This expansion rate is close to the recommended value 220% when you write new data to the ELK stack.
- Test data read operations (query and analysis)
- Test scenarios
Log query and aggregation are used as example scenarios. The average latency is calculated in the two scenarios when the number of concurrent read requests is 1, 5, and 10. The following examples provide further details of the two scenarios.
- Use aggregate functions (AVG, MIN, MAX, SUM, and COUNT) on the five columns of the
long data type and group the values of five numeric columns. Then, sort the calculated
values by the COUNT value to obtain the first 1,000 rows of data. You can execute
the following query statement:
select count(long_1) as pv,sum(long_2),min(long_3),max(long_4),sum(long_5) group by text_1 order by pv desc limit 1000
- Use a keyword, for example, value_126, to obtain the number of log entries that contain
the keyword, and display the first 100 rows of data. You can execute the following
query statement:
value_126
- Use aggregate functions (AVG, MIN, MAX, SUM, and COUNT) on the five columns of the
long data type and group the values of five numeric columns. Then, sort the calculated
values by the COUNT value to obtain the first 1,000 rows of data. You can execute
the following query statement:
- Test results
Scenario Number of concurrent read requests Latency of the self-managed ELK stack (Unit: seconds) Latency of Log Service (Unit: seconds) Log analysis 1 3.76 3.4 5 3.9 4.7 10 6.6 7.2 Log query 1 0.097 0.086 5 0.171 0.083 10 0.2 0.082 - Comparison results:
- Both Log Service and the ELK stack can query and analyze 160 million log entries within seconds.
- In the log analysis scenario, the latency is similar between Log Service and the ELK stack. The ELK stack uses SSDs and delivers better I/O performance than Log Service when a large amount of data is read.
- In the log query scenario, Log Service has a much shorter latency than the ELK stack. When the number of concurrent requests increases, the latency of the ELK stack increases. However, the latency of Log Service remains stable and slightly decreases.
- Test scenarios
Capacity
- Log Service allows you to index petabytes of data per day and query dozens of terabytes of data within seconds at a time. Log Service supports elastic scaling and scale-out for the processing scale.
- The ELK stack is applicable to scenarios where data is written in units of GB to TB
per day and stored in units of TB. The processing capacity is limited by the following
factors:
- The size of a cluster: A cluster that consists of about 20 nodes has optimal performance. A large cluster in the industry can contain 100 nodes and is often split into multiple clusters for data processing.
- Write capacity: The number of shards cannot be modified after the shards are created. Therefore, the maximum number of available nodes cannot be increased if more write capacity is required due to the increasing throughput.
- Storage capacity: If the data size in the primary shard reaches the maximum disk capacity, you must migrate the shard to a larger disk, or allocate more shards to the current disk. In this case, you can create an index, specify more shards, and rebuild existing data.
Use case
- When a user queries a large amount of data, the cluster is unavailable to other users.
- The cluster is fully occupied during peak hours to collect and process data. This compromises data integrity and the accuracy of query results.
- When the business grows, out of memory (OOM) errors frequently occur due to invalid memory settings and heartbeat synchronization failures. As a result, the query results are unavailable and inaccurate, and the cluster is inoperable for developers.
- The team used Logtail to collect online logs, and called the API to integrate log collection and server management configurations into the O&M system.
- The team embedded the query page of Log Service into the unified logon and O&M platform to isolate business permissions from account permissions.
- The team embedded the console page of Log Service into the platform of customer A. This way, the development team can use an efficient method to query logs. The team also configured Grafana plug-ins to monitor business and configured DataV to create dashboards in Log Service.
- The number of queries per day was significantly increased. Developers used the O&M platform to query and analyze logs. This improved the efficiency of development. The O&M team also revoked the permissions of online logon.
- In addition to NGINX logs, the O&M platform also imported application logs, mobile device logs, and container logs into Log Service. The amount of processed data increased by 9 times.
- More applications were developed. For example, Jaeger plug-ins were integrated with Log Service to build a log tracing system. Alerts and charts were configured to detect online errors in real time.
- Various platforms are connected with the unified O&M platform that provides a centralized method to collect data. This prevents repeated data collections. In addition, the Spark and Flink platforms of the big data team can consume log data in real time.
Conclusion
Elasticsearch applies to scenarios where you want to update, query, and delete data. Therefore, Elasticsearch is widely used in fields such as data query, data analysis, and application development. The ELK stack applies to log analysis scenarios because it can maximize the flexibility and performance of Elasticsearch. Log Service is designed for log data query and analysis scenarios by providing multiple unique features. The ELK stack covers a wider range of scenarios. However, Log Service provides deeper analysis features for specific scenarios. You can select one of the two services based on your business requirements.