Data Lake Analytics (DLA) allows you to read and analyze data of log files stored in OSS. You can identify service failures without the need to move the log files.

Log files record all the detailed running information of a service. Therefore, you need to query and analyze log files for troubleshooting, status monitoring, and alert prediction. Object Storage Service (OSS) is a secure, cost-effective, and reliable object storage service provided by Alibaba Cloud. OSS allows you to store large volumes of data in the cloud. More and more users have selected OSS as their ideal data store for log files. DLA can read and analyze data of these files. You can identify service failures without the need to move the log files.

This topic describes how to read and analyze data of log files stored in OSS by using DLA. Apache web server logs, NGINX access logs, and Apache Log4j logs are used in this example.

Prerequisites

The following operations are performed to prepare test data in OSS:

  1. Activate the OSS service. For more information, see Get started with Object Storage Service.

  2. Create an OSS bucket. For more information, see Create buckets.

  3. Upload log files to OSS. For more information, see Upload objects.

    You must upload the log files webserver.log, ngnix_log, and log4j_sample.log to the log directory of OSS.

    • Data in the Apache web server log file webserver.log:
       127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
      127.0.0.1 - - [26/May/2009:00:00:00 +0000] "GET /someurl/? track=Blabla(Main) HTTP/1.1" 200 5864 - "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19"
      Regular expression:
      ([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?
    • Data in the NGINX access log file ngnix_log:
        127.0.0.1 - - [14/May/2018:21:58:04 +0800] "GET /? stat HTTP/1.1" 200 182 "-" "aliyun-sdk-java/2.6.0(Linux/2.6.32-220.23.2.ali927.el5.x86_64/amd64;1.6.0_24)" "-"
        127.0.0.1 - - [14/May/2018:21:58:04 +0800] "GET /? prefix=&delimiter=%2F&max-keys=100&encoding-type=url HTTP/1.1" 200 7202 "https://help.aliyun.com/product/70174.html" "aliyun-sdk-java/2.6.0(Linux/2.6.32-220.23.2.ali927.el5.x86_64/amd64;1.6.0_24)" "-"
      Regular expression:
      ([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) (\". *? \") (-|[0-9]*) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) (\". *? \") (-|[0-9]*) (-|[0-9]*)
    • Data in the Apache Log4j log file. The following example shows data of the log4j_sample.log file that is generated by Hadoop by default:
        2018-11-27 17:45:23,128 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Minimum allocation = <memory:1024, vCores:1>
        2018-11-27 17:45:23,128 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Maximum allocation = <memory:8192, vCores:4>
        2018-11-27 17:45:23,154 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration: max alloc mb per queue for root is undefined
        2018-11-27 17:45:23,154 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration: max alloc vcore per queue for root is undefined
      Regular expression:
      ^(\\d{4}-\\d{2}-\\d{2})\\s+(\\d{2}.\\d{2}.\\d{2}.\\d{3})\\s+(\\S+)\\s+(\\S+)\\s+(. *)$

Usage notes

When you use DLA to read data of a log file, make sure that the log file meets the following conditions:

  • The log file is in the plaintext format and each row in the file can be mapped to a data record in a DLA table.

  • Data of all rows uses the same format and can be matched by using a regular expression.

When you create an external table for the log file in the DLA console, writing a regular expression is the most complex step. We recommend that you take note of the following items when you write a regular expression:

  • Each field in a regular expression must be quoted in parentheses (). Fields in the log file are separated with spaces.

  • The number of columns specified in the CREATE TABLE statement is the same as the number of fields in the regular expression.

  • In most cases, digits are matched by using ([0-9]*) or (-|[0-9]*), and strings are matched by using (1*)` or (“.*?”).