All Products
Search
Document Center

Analyze a large number of historical logs inexpensively

Last Updated: Oct 18, 2019

Background

Logs are a special type of data that plays an important role in processing historical data, diagnosing problems, and tracing system activities. Logs are essential data sources for the work of data analysts, developers, and O&M personnel.

A commonly used method to reduce costs is to retain logs over a specified period of time and analyze only the logs generated during this period of time. These logs are called hot logs. This method meets short-term service requirements, but in the long run, it leaves a large number of historical logs unused.

Many enterprises only need to analyze logs occasionally and do not require high efficiency of log analysis. In addition, the establishment of a complete log analysis system for occasional log analysis by an enterprise requires high capital expenditure and O&M costs. Enterprises are faced with a challenge to reduce storage costs while analyzing a large number of logs.

Solution

Alibaba Cloud has developed a simplified historical log analysis solution to address the needs of enterprises. This solution uses Alibaba Cloud Log Service (LOG for short, originally known as SLS) to ship logs. It uses Alibaba Cloud Object Storage Service (OSS) to store logs, and Data Lake Analytics (DLA) to analyze logs. The solution has three advantages:

  • Log Service is a one-stop real-time data service that was developed by Alibaba Group. It is applicable for a large number of big data scenarios. It provides log data collection, intelligent query and analysis, consumption, and shipment functions to improve the capability to process and analyze a large number of logs. With an outstanding log shipping capability, Log Service supports a variety of log formats on data sources and steadily ships logs to specified locations.

  • This cost-effective OSS service enables you to save your logs as long as you want.

  • DLA provides a powerful log analysis capability, adopts a serverless architecture, and its fees are charged based on the volume of data it scans. It allows you to partition logs shipped to OSS by year, month, or day to improve the log hit rate and reduce the amount of data to be scanned. In this way, a large amount of historical log data can be analyzed very quickly for a low cost.

Architecture

For example, a service is deployed in an Elastic Compute Service (ECS) cluster, and each host in the cluster records access information in a log file named access.log. Assume that you want to extract information from the access.log files and store the extracted information in OSS. This topic provides a step-by-step guidance for implementation in this scenario.

Prerequisites

Before starting the procedure, ensure that you have completed the following operations:

Procedure

Step 1: Use Logtail to collect ECS logs

For more information, see Use Logtail to collect ECS logs.

Based on the characteristics of log files in this example, configure Logtail as follows.

Configuration

If you select Full Regex Mode, you must enter a complete regular expression.

Step 2: Ship logs to OSS

For more information, see Ship logs to OSS and Parquet configuration for log shipping to OSS.

On the OSS Shipper page, configure related parameters.

CongConfig

Parameter description:

  • Set OSS Bucket and OSS Prefix to specify the OSS directory to which logs are shipped.

  • Modify Partition Format to add the partition column names in the format of Partition column name=Partition column value.

    As shown in the figure, modify the default partition format. Change the level-1 partition column name to year and column value to %Y, the level-2 partition column name to month and column value to %m, and the level-3 partition column name to day and column value to %d.

  • Set Storage Format to parquet.

  • Set Compression to Compress (snappy). You can compress data with the snappy algorithm to reduce OSS bucket storage usage.

After the logs are shipped to OSS, you can use DLA to read and analyze the logs in OSS.

Step 3: Create a connection to OSS in DLA

Log on to the DLA console. In the endpoint list, locate the row that contains the target endpoint and click Log On to DMS Console in the Actions column to create a connection to OSS in DLA. Use the following syntax to create the connection:

  1. CREATE SCHEMA oss_log_schema with DBPROPERTIES(
  2. catalog='oss',
  3. location = 'oss://myappbucket/sls_parquet/'
  4. );

Location: indicates the log file storage directory in the OSS bucket, which must end with a slash (/). myappbucket is the OSS bucket name.

Step 4: Create an external table (partition table) pointing to the OSS file in DLA

  1. CREATE EXTERNAL TABLE sls_parquet (
  2. content STRING,
  3. client STRING,
  4. process_id STRING.
  5. start_time STRING,
  6. total_time STRING,
  7. status STRING,
  8. original_sql STRING,
  9. rewritten_sql STRING
  10. ) PARTITIONED BY (year STRING, month STRING, day STRING)
  11. STORED AS PARQUET
  12. LOCATION 'oss://myappbucket/sls_parquet/';

Note:

  • The column names in the new table must be consistent with the column names set in the generated parquet file.

  • The names and sequence of partition columns must be consistent with those of the partition columns set in Step 2: Ship logs to OSS. For more information about how to create partition tables, see Use DLA to create an OSS partition table.

Step 5: Run the MSCK command to update partition information

After the external table is created, run the MSCK REPAIR TABLE command to synchronize partition information to DLA. The MSCK command can only identify the directories that are named following the partition column naming rule Partition column name=Partition column value for DLA.

  1. MSCK REPAIR TABLE sls_parquet;

Step 6: Query data in the partition table

After partition information is synchronized, use the SELECT statement to query and analyze logs. For example, you can obtain the five statements with the lowest query speeds on a specific day using the following code:

  1. SELECT original_sql, total_time
  2. FROM sls_parquet
  3. WHERE client! =''
  4. ORDER BY total_time DESC
  5. LIMIT 5;

Subsequent operations

In the preceding example, logs shipped to OSS are stored in Parquet format. LOG also supports the JSON and CSV formats for files to be shipped. For detailed information about configuration, see JSON storage and CSV storage.

JSON format

  • To set the format of log files to be shipped to JSON without compressing the files, create the external table using the following statement:

    1. CREATE EXTERNAL TABLE sls_json (
    2. content STRING,
    3. client STRING,
    4. process_id STRING,
    5. start_time STRING,
    6. total_time STRING,
    7. status STRING,
    8. original_sql STRING,
    9. rewritten_sql STRING
    10. ) PARTITIONED BY (year STRING, month STRING, day STRING)
    11. STORED AS JSON
    12. LOCATION 'oss://myappbucket/sls_json/';
  • To set the format of files to be shipped to JSON and compress the files using the standard snappy algorithm, create the external table using the following statement:

    1. CREATE EXTERNAL TABLE sls_json_snappy (
    2. content STRING,
    3. client STRING,
    4. process_id STRING,
    5. start_time STRING,
    6. total_time STRING,
    7. status STRING,
    8. original_sql STRING,
    9. rewritten_sql STRING
    10. ) PARTITIONED BY (year STRING, month STRING, day STRING)
    11. STORED AS JSON
    12. LOCATION 'oss://myappbucket/sls_json_snappy/'
    13. TBLPROPERTIES(
    14. 'text.compression'='snappy',
    15. 'io.compression.snappy.native'='true'
    16. );

CSV format

  • To set the format of files to be shipped to CSV without headers and compress the files using the standard snappy algorithm, create the external table using the following statement:

    1. CREATE EXTERNAL TABLE sls_csv_snappy (
    2. content STRING,
    3. client STRING,
    4. process_id STRING,
    5. start_time STRING,
    6. total_time STRING,
    7. status STRING,
    8. original_sql STRING,
    9. rewritten_sql STRING
    10. ) PARTITIONED BY (year STRING, month STRING, day STRING)
    11. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2. OpenCSVSerde'
    12. WITH SERDEPROPERTIES(
    13. 'separatorChar'=',',
    14. 'quoteChar'='"',
    15. 'escapeChar'='\\'
    16. )
    17. STORED AS TEXTFILE
    18. LOCATION 'oss://myappbucket/sls_csv_snappy/'
    19. TBLPROPERTIES(
    20. 'text.compression'='snappy',
    21. 'io.compression.snappy.native'='true',
    22. 'skip.header.line.count'='0'
    23. );

    csv

  • To set the format of files to be shipped to CSV containing headers, without compressing the files, create the external table using the following statement:

    1. CREATE EXTERNAL TABLE sls_csv (
    2. content STRING,
    3. client STRING,
    4. process_id STRING
    5. start_time STRING,
    6. total_time STRING,
    7. status STRING,
    8. original_sql STRING,
    9. rewritten_sql STRING
    10. ) PARTITIONED BY (year STRING, month STRING, day STRING)
    11. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2. OpenCSVSerde'
    12. WITH SERDEPROPERTIES(
    13. 'separatorChar'=',',
    14. 'quoteChar'='"',
    15. 'escapeChar'='\\'
    16. )
    17. STORED AS TEXTFILE
    18. LOCATION 'oss://myappbucket/sls_csv/'
    19. TBLPROPERTIES(
    20. 'skip.header.line.count'='1
    21. );