Alibaba Cloud Content Delivery Network (CDN) is a distributed network built over the bearer network and contains edge nodes deployed in different regions. CDN offloads network traffic workloads from origin servers to prevent network congestion. You can use CDN to accelerate content delivery in different regions and scenarios. This improves the user experience of website visitors. Before you analyze log data, you can transfer CDN logs to Object Storage Service (OSS) or transfer these logs to Log Service in real time. This way, you can use DLA to directly analyze data of CDN logs that are stored in OSS. This helps you identify issues in a timely manner and resolve these issues to improve the service quality of CDN.

Prerequisites

The following operations are performed:

Step 1: Transfer CDN logs to OSS

For more information, see Use Function Compute to deliver logs.

Step 2: Create an OSS schema

In this topic, the cdn_log.log file is stored in the OSS bucket.

To create an OSS schema and a table for log files and read data from log files, you can edit SQL statements in the DMS console or use a MySQL client or MySQL command-line tool to connect to DLA and edit SQL statements.

​CREATE SCHEMA cdn_log_schema with DBPROPERTIES(
   catalog='oss',
  location = 'oss://bucket-name/cdn/'
  );​

Step 3: Create a table for the cdn_log.log file

The cdn_log.log file contains the following information:

[9/Jun/2015:01:58:09 +0800] 188.165.XX.XX - 1542 "-" "GET http://www.aliyun.com/index.html" 200 191 2830 MISS "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "text/html"

The following table describes the fields in the preceding information.

Field Description
[9/Jun/2015:01:58:09 +0800] The time at which the log was generated.
188.165.XX.XX The IP address.
- The proxy IP address.
1542 The response time. Unit: milliseconds.
"-" The referer.
"GET http://www.aliyun.com/index.html" The request.
200 The HTTP code.
191 The request size. Unit: byte.
2830 The response size. Unit: byte.
MISS The cache hit status.
"Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" The user-agent header in the request.
"text/html" The file type.

When you create a table in DLA, you must use regular expressions to parse CDN log files. The following table describes the fields and the related regular expressions.

Field Regular expression
[9/Jun/2015:01:58:09 +0800] (-|\\[[^\\]]*\\])
188.165.XX.XX ([^ ]*)
- ([^ ]*)
1542 ([^ ]*)
"-" ([^ \"]*|\"[^\"]*\")
"GET http://www.aliyun.com/index.html" ([^ \"]*|\"[^\"]*\")
200 (-|[0-9]*)
191 (-|[0-9]*)
2830 (-|[0-9]*)
MISS ([^ ]*)
"Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" ([^ \"]*|\"[^\"]*\")
"text/html" ([^ \"]*|\"[^\"]*\")

A complete regular expression is (-|\\[[^\\]]*\\]) ([^ ]*) ([^ ]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (-|[0-9]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\").

You can execute the following SQL statement to create a table for the cdn_log.log file in DLA:

​CREATE EXTERNAL TABLE cdn_log (
  log_timestamp           VARCHAR,
  access_ip               VARCHAR,
  proxy_ip                VARCHAR,
  response_time           VARCHAR,
  referer                 VARCHAR,
  request                 VARCHAR,
  httpcode                SMALLINT,
  request_size            BIGINT,
  response_size           BIGINT,
  cache_hit_status        VARCHAR,
  ua_header               VARCHAR,
  file_type               VARCHAR
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(-|\\[[^\\]]*\\]) ([^ ]*) ([^ ]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (-|[0-9]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\")"
)
STORED AS TEXTFILE
LOCATION 'oss://bucket-name/cdn/'
TBLPROPERTIES ('recursive.directories' = 'true');

Step 4: Query and analyze data of the cdn_log.log file

​SELECT * FROM cdn_log;
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+
| log_timestamp                | access_ip     | proxy_ip | response_time | referer | request                          | httpcode | request_size | response_size | cache_hit_status | ua_header                                                   | file_type                 |
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+
| [18/Jun/2019:05:08:33 +0800] | 47.92.XX.XX   | -        | 777           | "-"     |                                  |      200 |          201 |          7159 | MISS             | "Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D))" | "text/html;charset=UTF-8" |
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+​

Use the following functions of DLA to analyze data in the cdn_log.log file:

​SELECT date_parse(substring(log_timestamp, 2, length(log_timestamp) - 8), '%d/%b/%Y:%H:%i:%s') as log_timestamp,
       access_ip,
       proxy_ip,
       response_time,
       substring(referer, 2, length(referer) - 2) as referer,
       substring(request, 2, length(request) - 2) as request,
       httpcode,
       request_size,
       response_size,
       cache_hit_status,
       substring(ua_header, 2, length(ua_header) - 2) as ua_header,
       substring(file_type, 2, length(file_type) - 2) as file_type 
FROM cdn_log;
+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+
| log_timestamp           | access_ip     | proxy_ip | response_time | referer | request                        | httpcode | request_size | response_size | cache_hit_status | ua_header                                                 | file_type               |
+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+
| 2019-06-18 05:08:33.000 | 47.92.XX.XX   | -        | 777           | -       |                                |      200 |          201 |          7159 | MISS             | Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D)) | text/html;charset=UTF-8 |
+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+​