This topic compares Log Service with the ELK stack for you to better understand the main features and benefits of Log Service.

Background information

The Elasticsearch, Logstash, and Kibana (ELK) stack is a popular solution to real-time log analysis. You can find many case studies and resources in the open-source ELK community.

Log Service is a solution dedicated to scenarios that involve log search and analytics. The service was developed from the monitoring and diagnosis tool used for the research and development of the Apsara system. As the number of users grows and the business evolves, Log Service is gradually geared for log analysis in Ops scenarios, such as DevOps, Market Ops, and SecOps. The service has withstood the challenges in scenarios such as Double 11, Ant Financial Double 12, Spring Festival red envelopes, and international businesses and is now serving global users.

Overview

Apache Lucene is an open-source search engine software library supported by the Apache Software Foundation. Apache Lucene provides full-text searching and indexing and text analysis capabilities. Elastic developed Elasticsearch in 2012 based on the Lucene library and launched the ELK stack in 2015 as an integrated solution to log collection, storage, and query. Lucene was designed to retrieve information based on documents. Its log processing capabilities are limited in many aspects, such as the data volume, query capability, intelligent grouping, and other custom features.

Log Service uses a self-developed log storage engine. In the past three years, Log Service has been applied to tens of thousands of applications. Log Service supports indexing for petabytes of data per day and serves tens of thousands of developers to query and analyze data hundreds of millions of times per day. Log Service serves as the log analysis engine for various Alibaba Cloud services, such as SQL audit, EagleEye, Cloud Map, sTrace, and Ditecting.

Log query is the most basic requirement of DevOps. According to the industry research report 50 Most Frequently Used UNIX/Linux Commands, the tar and grep commands are the top two commands used for programmers to query logs.

The following compares the ELK stack with Log Service in log query and analysis scenarios from five aspects:

  • Ease of use: the convenience to get started and use the service.
  • Features: search and analytics features.
  • Performance: the capabilities to query and analyze data and latency.
  • Capacity: the data volume that can be processed and the scalability.
  • Cost: the cost for using features.

Ease of use

Log analysis involves the following considerations:
  • Collection: writes data in a stable manner.
  • Configuration: configures data sources.
  • Capacity expansion: expands storage space and scales servers.
  • Usage: provides query and analysis features, which are described in the Features section of this topic.
  • Export: exports data to other systems for further processing, such as for stream computing and for data backup in Object Storage Service (OSS).
  • Multi-tenancy: shares data with other users and uses data securely.
The following table lists the comparison results between Log Service and the ELK stack in terms of ease of use.
Item Sub-item Self-built ELK stack Log Service
Data collection API RESTful API
  • RESTful API
  • Java Database Connectivity (JDBC) API
Client Various clients in the ecosystem, including Logstash, Beats, and Fluentd
  • Logtail
  • Other clients such as Logstash
Configuration Resource object Provides indexes to classify logs
  • Project
  • Logstore

Provides projects under which Logstores can be created to store logs

Method API and Kibana
  • API and SDK
  • Console
Capacity expansion Storage
  • Requires more servers
  • Requires more disks
Requires no more servers or disks
Computing Requires more servers Requires no more servers
Configurations
  • Configures servers through the configuration management system
  • The beta release of Logstash supports centralized configurations
Provides the console and API for configurations, without the need of a configuration management system
Collection point Applies configurations and installs Logstash on server groups through the configuration management system. Provides the console and API for configurations, without the need of a configuration management system
Capacity Flexible capacity expansion not supported Supports flexible and elastic capacity expansion
Export Method
  • API
  • SDK
  • API
  • SDK
  • Kafka-like consumer API
  • Consumer API for stream computing engines, such as Spark, Storm, and Flink
  • Consumer API for stream computing class libraries, such as Python and Java class libraries
Multi-tenancy Security Commercial versions (high security)
  • HTTPS
  • Encrypted through signatures
  • The data of each tenant isolated from each other's
  • Access control
Traffic shaping No traffic shaping
  • Traffic shaping based on projects
  • Traffic shaping based on shards
Multi-tenancy Supported by Kibana Supported by providing accounts and granting related permissions

Conclusions:

  • The ELK stack ecosystem provides extensive tools such as many write, installation, and configuration tools.
  • Log Service is a managed service that enables easy integration with other services, convenient configurations, and ease of use. You can integrate Log Service with your services and use Log Service within 5 minutes.
  • Log Service is a software as a service (SaaS) service. You do not need to worry about capacity or concurrency. It supports elastic scaling and does not require O&M.

Search and analytics features

The search feature finds log entries that meet search conditions. The analytics feature analyzes data.

For example, you need to calculate the number of read requests with a status code greater than 200 and related traffic data based on an IP address. You can use either of the following methods for analysis:
  • Search for the specified results and analyze the results.
  • Analyze all log entries in a Logstore.
1. Status in (200,500] and Method:Get* | select count(1) as c, sum(inflow) as sum_inflow, ip group by Ip
2. * | select count(1) as c, sum(inflow) as sum_inflow, ip group by Ip
  • Basic search capabilities

    The following table lists the comparison results based on Elasticsearch 6.5 Indices.

    Data type Feature ELK stack Log Service
    Text Search by index Supported Supported
    Delimiter Supported Supported
    [DO NOT TRANSLATE] [DO NOT TRANSLATE] [DO NOT TRANSLATE]
    Prefix Supported Supported
    Suffix Supported Unsupported
    Fuzzy search Supported Supported by using SQL statements
    Wildcard search Supported Supported by using SQL statements
    Numeric value LONG Supported Supported
    DOUBLE Supported Supported
    Nested JSON Supported Unsupported
    Geo Geo Supported Supported by using SQL statements
    IP Search by IP addresses Supported Supported by using SQL statements
    Conclusions:
    • The ELK stack supports more data types and provides stronger native search capabilities than Log Service.
    • Log Service allows you to use SQL statements instead of using fuzzy match or Geohash functions to search for data. However, the search performance is slightly compromised. The following examples show how to use SQL statements to query data:
    Searches for data that matches the specified substring:
    * | select content where content like '%substring%' limit 100
    
    Searches for data that matches the specified regular expression:
    * | select content where regexp_like(content, '\d+m') limit 100
    
    Searches for parsed JSON-formatted data that matches the specified conditions:
    * | select content where json_extract(content, '$.store.book')='mybook' limit 100
    
    Creates an index for JSON-formatted data:
    field.store.book='mybook'
  • Extended search capabilities
    In log search scenarios, you may want to perform follow-up operations on searched data. For example:
    • After finding an error log entry, check the context to find out the parameters that cause the error.
    • After finding an error, check similar errors. You can run the tail -f command to display raw log entries and run the grep command to search for similar errors.
    • After obtaining millions of log entries from a query by keyword, filter out 90% of known issues that distract you.
    To resolve the preceding issues, Log Service provides the following closed-loop solutions:
    • Contextual query: queries the context of a log entry in the raw log file and displays the results in multiple pages. You do not need to log on to the server to query the context.
    • LiveTail: uses the tail-f command to display raw log entries in real time.
    • LogReduce: dynamically groups logs based on different patterns to detect anomalies.
    • Use LiveTail to monitor and analyze logs

      To monitor logs in real time, the traditional O&M model requires you to run the tail -f command on the server to display the logs. If the displayed logs contain distracting information, you need to run the grep or grep -v command to filter data by keyword. LiveTail provided in the Log Service console allows you to monitor and analyze online log data in real time, thus reducing your O&M workloads.

      LiveTail has the following features:
      • Supports data collected from Docker and Kubernetes containers, servers, Log4j Appenders, and other data sources.
      • Monitors log data in real time, and allows you to filter data by keyword.
      • Delimits log fields to facilitate searching for log entries that contain specific delimiters.
    • LogReduce
      With the rapid development of businesses, massive volumes of log data is generated every day. This introduces the following concerns:
      • Potential system anomalies are difficult to be found.
      • Unusual logons by intruders are not detected in an efficient way.
      • System behavior changes caused by version updates are not detected due to too much distracting information. In addition, logs recorded are of various formats and are not marked by topics, and therefore cannot be well grouped. LogReduce provided in Log Service groups logs based on different patterns and delivers a full view of the logs. LogReduce has the following features:
        • Various formats of logs such as Log4j logs, JSON-formatted logs, and syslog logs can be grouped.
        • Logs can be filtered based on conditions that you specify before being grouped. Raw log entries can be retrieved based on the signature of log entries grouped in a pattern.
        • The number of log entries grouped in a log pattern in different time ranges can be compared.
        • The precision of log grouping can be adjusted based on your needs.
        • Hundreds of millions of log entries can be grouped in seconds.

Analytics capabilities

Elasticsearch supports data aggregation based on the doc values data. Elasticsearch 6.x supports data grouping and aggregation by using the SQL syntax. Log Service supports the RESTful API and JDBC API and is compatible with the SQL-92 standard. Log Service supports complete SQL statements, including basic aggregate functions. In addition, Log Service allows you to perform JOIN operations on internal and external data sources, and implement machine learning and pattern analysis on data.

Note The analytics capabilities of the ELK stack and Log Service are compared based on Elasticsearch 6.5 Aggregations and Log Service analytics syntax as follows:
In addition to the SQL-92 standard syntax, a series of features specific to log analytics scenarios are also provided in Log Service.
  • Interval-valued comparison and periodicity-valued comparison function
    You can nest the interval-valued comparison and periodicity-valued comparison functions in SQL statements to calculate the changes of a single field value, multiple field values, and a curve in different time windows.
    * | select compare( pv , 86400) from (select count(1) as pv from log)
    *|select t, diff[1] as current, diff[2] as yesterday, diff[3] as percentage from(select t, compare( pv , 86400) as diff from (select count(1) as pv, date_format(from_unixtime(__time__), '%H:%i') as t from log group by t) group by t order by t) s 
  • Join internal and external data sources for data query
    You can join Log Service data with external data sources for data search and analytics. The supported data sources and JOIN operations are as follows:
    • You can perform JOIN operations on data in Logstores, MySQL databases, and OSS buckets (CSV files).
    • You can perform left outer join, right outer join, full outer join, and inner join operations on data.
    • You can use SQL statements to query data in external tables and join Log Service data with external tables.
    The following example shows how to join Log Service data with external tables.
    SQL statements
    Create an external table:
    * | create table user_meta ( userid bigint, nick varchar, gender varchar, province varchar, gender varchar,age bigint) with ( endpoint='oss-cn-hangzhou.aliyuncs.com',accessid='LTA288',accesskey ='EjsowA',bucket='testossconnector',objects=ARRAY['user.csv'],type='oss')
    
    Join Log Service data with the external table:
    * | select u.gender, count(1) from chiji_accesslog l join user_meta1 u on l.userid = u.userid group by u.gender
  • Geolocation functions
    You can use the built-in geolocation functions to identify users based on IP addresses and mobile phone numbers. The following lists the available geolocation functions:
    • IP functions: identify the country, province, city, city longitude and latitude, and ISP of an IP address.
    • Phone number functions: identify the ISP, province, and city where a mobile phone number is registered.
    • Geohash functions: encodes the longitude and latitude of a city.
    The following shows a log analysis example by using the geolocation functions:
    SQL statements
    * | SELECT count(1) as pv, ip_to_province(ip) as province WHERE ip_to_domain(ip) ! = 'intranet' GROUP BY province ORDER BY pv desc limit 10
    
    * | SELECT mobile_city(try_cast("mobile" as bigint)) as "city", mobile_province(try_cast("mobile" as bigint)) as "province", count(1) as "number of requests" group by "province", "city" order by "number of requests" desc limit 100
  • Security detection functions
    Security detection functions in Log Services are designed based on the globally shared White Hat Security asset library. You can use security detection functions to check whether an IP address, domain name, or URL in logs is secure.
    • security_check_ip
    • security_check_domain
    • security_check_url
  • Machine learning and time series detection functions
    Log Service provides machine learning and intelligent diagnostic functions that you can use to:
    • Automatically learn historical data regularities and predict the future trend.
    • Detect imperceptible anomalies in real time, and combine analytics functions to analyze the causes of the anomalies.
    • Intelligently detect exceptions and inspect the system based on the interval-valued comparison and alert features. You can use this feature to analyze data for scenarios such as intelligent O&M, security, and operations in a fast and efficient manner.
    Machine learning and intelligent diagnostic functions provide the following features:
    • Prediction: fits a baseline based on the historical data.
    • Anomaly detection, change point detection, and inflection point detection: detect anomalies.
    • Multi-period detection: detects the periodicity of time-series data.
    • Time series clustering: finds time series curves that have different curve shapes with other curves.
  • Pattern analysis functions
    Pattern analysis functions can help you detect data patterns and thus identify anomalies in a fast and efficient manner. You can use pattern analysis functions to:
    • Identify patterns that frequently occur. For example, you can use pattern analysis functions to identify the ID the user who sent 90% of invalid requests.
    • Identify the factor that most influence two patterns.
      • In requests with a latency greater than 10 seconds, the ratio of combined dimensions that contain an ID is much higher than that of other combined dimensions.
      • The ratio of this ID in Pattern B is lower than that in Pattern A.
      • Patterns A and B are significantly different.

Performance

The following compares the data write, query, and aggregation performance of Log Service and the ELK stack by using the same dataset.

  • Test environment
    • Test configurations
      Item Self-built ELK stack Log Service
      Runtime environment Four Elastic Compute Service (ECS) instances, each with 4 CPU cores and 16 GB memory, and ultra disks or standard SSDs N/A
      Shard 10 10
      Copies 2 3 (configured by default and invisible to users)
    • Test data
      • Five columns of the double data type, five columns of the long data type, and five columns of the text data type, displayed in 256, 512, 768, 1,024, and 1,280 dictionaries.
      • Fields in test data are randomly sampled, as shown in the following figure.
      • Raw data size: 50 GB.
      • Number of raw log entries: 162,640,232 (about 160 million).
      The sample test log entry is as follows:
      timestamp:August 27th 2017, 21:50:19.000 
      long_1:756,444 double_1:0 text_1:value_136 
      long_2:-3,839,872,295 double_2:-11.13 text_2:value_475 
      long_3:-73,775,372,011,896 double_3:-70,220.163 text_3:value_3 
      long_4:173,468,492,344,196 double_4:35,123.978 text_4:value_124 
      long_5:389,467,512,234,496 double_5:-20,10.312 text_5:value_1125
  • Test data writes

    The Bulk API is used to write batch data to the ELK stack and the PostLogstoreLogs API is used to write batch data to Log Service. The following table lists the test results.

    Item Sub-item Self-built ELK stack Log Service
    Latency Average write latency 40 ms 14 ms
    Storage Data volume of a copy 86 GB 58 GB
    Expansion rate: Data volume/Raw data size 172% 116%
    Note The storage fee for 50 GB of data incurred in Log Service includes the fee incurred for writing 23 GB of compressed data and the fee incurred for 27 GB of indexes.
    Conclusions:
    • Log Service has a lower data write latency than the ELK stack.
    • The raw data size is 50 GB. The stored data volume expands because the test data is random. In most scenarios, the stored data volume after compression is smaller than the raw data size. The data stored in the self-built ELK stack expands to 86 GB. The expansion rate is 172%, which is 58% higher than that in Log Service. This expansion rate is approximate to 220% that is recommended when you write new data to the ELK stack.
  • Test data reads (search and analytics)
    • Test scenario
      The two common scenarios of log search and aggregation are used as an example. The average latency is calculated in the two scenarios when the number of concurrent read requests is 1, 5, and 10, respectively. The two scenarios are as follows:
      • Performs aggregate functions (AVG, MIN, MAX, SUM, and COUNT) on the five columns of the long data type and groups values of five numeric columns, and then sorts the calculated values by the COUNT value to obtain the first 1,000 results:
        select count(long_1) as pv,sum(long_2),min(long_3),max(long_4),sum(long_5) 
          group by text_1 order by pv desc limit 1000
      • Uses a keyword, for example, value_126, to query the number of log entries that contain the keyword and the top 100 rows.
        value_126
    • Test results
      Scenario Number of concurrent read requests Latency of the self-built ELK stack (Unit: seconds) Latency of Log Service (Unit: seconds)
      Log analytics 1 3.76 3.4
      5 3.9 4.7
      10 6.6 7.2
      Log search 1 0.097 0.086
      5 0.171 0.083
      10 0.2 0.082
    • Conclusions:
      • Both Log Service and the ELK stack can query and analyze 160 million of log entries within seconds.
      • In the log analytics scenario, the latency is similar between Log Service and the ELK stack. The ELK stack uses SSDs and delivers better I/O performance than Log Service when a large amount of data is read.
      • In the log search scenario, Log Service has a much shorter latency than the ELK stack. As the number of concurrent requests increases, the latency of the ELK stack increases, while that of Log Service remains stable and even decreases.

Capacity

  • Log Service allows you to index petabytes of data per day and query dozens of terabytes of data within seconds at a time. It supports elastic scaling and scale-out for the processing scale.
  • The ELK stack is suitable for scenarios where data is written in units of GB to TB per day and stored in units of TB. The processing capacity is constrained by the following factors:
    • The size of a cluster: A cluster that consists of about 20 nodes has optimal performance. A large cluster in the industry can contain 100 nodes and is often split into multiple clusters for data processing.
    • Write capacity: The number of shards cannot be modified after they are created. Therefore, the maximum number of available nodes cannot be increased when more write capacity is required along with the increasing throughput.
    • Storage capacity: When the data stored on the primary shard reaches the maximum disk capacity, you must either migrate the shard to another disk with a larger capacity, or allocate more shards to the disk. The typical solution is to create an index, specify more shards, and rebuild existing data.

Use case

Customer A is one of the major consulting websites in China. It has thousands of servers and hundreds of developers. The O&M team used an ELK cluster to process NGINX logs. However, the cluster always failed when processing a large volume of data. For example
  • The cluster becomes unavailable to other users when a user queries a large amount of data.
  • The cluster is fully occupied during peak hours, busy collecting and processing data. This compromises data integrity and the accuracy of query results.
  • The cluster becomes unavailable and inaccurate in some cases because, as the business grows, out of memory (OOM) often occurs due to memory settings and heartbeat synchronization failures. This cluster is useless to developers.
In June 2018, the O&M team began to use Log Service to solve the problems.
  1. The team uses Logtail to collect online logs, and uses the API to integrate log collection and server management configurations into the O&M system.
  2. The team embeds the Log Service query page into the unified logon and O&M platform to separate business permissions from account permissions.
  3. The team embeds the Log Service console page into the customer's own platform so that the development team can query logs in an efficient way. The team also configures Grafana plug-ins to monitor business and configures DataV to create dashboards in Log Service.
After two months, the customer's O&M has been improved as follows:
  • The number of queries per day has increased significantly. Developers are increasingly using the O&M platform to search and analyze logs and have their efficiency improved. The O&M team also revokes online logon permissions.
  • In addition to NGINX logs, the O&M platform also imports application logs, mobile device logs, and container logs into Log Service. The amount of data that are processed increases by 9 times.
  • More applications are developed. For example, Jaeger plug-ins are integrated with Log Service to build a tracing system for logs. Alerts and charts are configured to detect online errors on a daily basis.
  • Various platforms are interconnected with the unified O&M platform to collect data in a uniform manner and avoid repeated data collection. In addition, the Spark and Flink platforms of the big data department can consume log data in real time.

Summary

Elasticsearch supports more common scenarios such as data updates, queries, and deletions. It is widely used in fields such as data search and analysis, and application development. The ELK stack maximizes the flexibility and performance of Elasticsearch in log analytics scenarios. Log Service is designed for log data search and analytics scenarios. Many of its features are unique in the industry. The ELK stack can cover a wider range of scenarios while Log Service provides deeper analytics features for specific scenarios. You can choose either one of the two services based on your business needs.