×
Community Blog Log Service - Data Transformation Practices for Resolving Nginx Logs

Log Service - Data Transformation Practices for Resolving Nginx Logs

This article describes the data transformation feature of Alibaba Cloud Log Service by using Nginx logs as an example.

Overview of Data Transformation

Data transformation is a service provided by Alibaba Cloud Log Service to extract, transform, and load (ETL) log data. It supports data transformation, filtering, distribution, and enrichment.

The data transformation service is integrated into Log Service.

The following figure shows the common scenarios supported by the data transformation service.

1

Data Distribution

1.  Data standardization (one-to-one)

2

2.  Data distribution (one-to-many)

3

In the following section, we will use the resolution of Nginx logs as an example to help you quickly get started with data transformation for Alibaba Cloud Log Service.

Parsing an Ngnix Log

Assume that we have collected the default Nginx log in simple mode. The default Nginx log is in the following format:

log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                     '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

The following figure shows the log on a server.

4

The following figure shows the log collected by Alibaba Cloud Log Service in simple mode.

5

Enable Data Transformation in the Console

6

Log on to the console and enable Data Transformation. Enter domain specific language (DSL) statements in the text box and click Preview Data to preview the data transformation result.

Extract Fields by Using Regex

Extract fields from Nginx logs by using regex. The capture group name in regex is used to set the variable name.

e_regex("Source field name", "Regex or named capture regex", "Target field name or array (optional)", mode="fill-auto")

We recommend that you use the regex compilation tool available at: https://regex101.com/

DSL statement used:

e_regex("content",'(? <remote_addr>[0-9:\.] *) - (? <remote_user>[a-zA-Z0-9\-_]*) \[(? <local_time>[a-zA-Z0-9\/ :\-]*)\] "(? <request>[^"]*)" (? <status>[0-9]*) (? <body_bytes_sent>[0-9\-]*) "(? <refer>[^"]*)" "(? <http_user_agent>[^"]*)"')

7

Processing the Time Field

The default local time format is not easy to read and can be resolved into a more readable format.

8

DSL statement used:

e_set("Field name", "Fixed value or expression function", ..., mode="overwrite")

dt_strftime(Date and time expression, "format string")

dt_strptime('Value such as v("Field name")', "Format string")

DSL statement used:

e_set("local_time", dt_strftime(dt_strptime(v("local_time"),"%d/%b/%Y:%H:%M:%S %z"),"%Y-%m-%d %H:%M:%S"))

9

Request URI Resolution

Next, we want to extract the request field. We can see that the request field consists of HTTP_METHOD, URI, and HTTP version.

We can use the following function for implementation:

e_regex("Source field name", "Regex or named capture regex", "Target field name or array (optional)", mode="fill-auto")

# Decode the URI
url_decoding('Value such as v("Field name")')

# Set the field value
e_set("Field name", "Fixed value or expression function", ..., mode="overwrite")

e_kv extracts the key value pair from the request URI. 

Statement

e_regex("request", "(? <request_method>[^\s]*) (? <request_uri>[^\s]*) (? <http_version>[^\s]*)")
e_set("request_uri", url_decoding(v("request_uri")))
e_kv("request_uri")

Result

10

Mappings of HTTP Status Codes

If we want to map an HTTP code to a specific code description, such as map "404" to "not found", we can use the e_dict_map function.

e_dict_map("Dictionary such as {'v':'v1', 'k2':'v2'}", "Source field regex or list", "Target field name")

If no key is matched by the DSL, the value of the key (*) is used.

e_dict_map({'200':'OK',
            '304' : '304 Not Modified',
            '400':'Bad Request',
            '401':'Unauthorized',
            '403':'Forbidden',
            '404':'Not Found',
            '500':'Internal Server Error',
            '*':'unknown'}, "status", "status_desc")

Result:

11

Identify the Operating System of a Client by Using User Agent

If we want to know the operating system version of a client, we can use fields in user agent for regex matching. The following DSL statement is used:

e_switch("Condition 1 e_match(...)", "Operation 1 such as e_regex(...)", "Condition 2", "Operation 2", ..., default="Optional operation upon no match")

regex_match('Value such as v("Field name")', r"Regex", full=False)

e_set("Field name", "Fixed value or expression function", ..., mode="overwrite")

DSL statement used:

e_switch(regex_match(v("content"), "Mac"), e_set("os", "osx"),
         regex_match(v("content"), "Linux"), e_set("os", "linux"),
         regex_match(v("content"), "Windows"), e_set("os", "windows"),
         default=e_set("os", "unknown")
)

Result:

12

Ship 4xx Logs to a Specified Logstore

We can use the e_output function to ship logs and use the regex_match function to match fields.

regex_match('Value such as v("Field name")', r"Regex", full=False)

e_output(name=None, project=None, logstore=None, topic=None, source=None, tags=None)

e_if("Condition 1 such as e_match(...)", "Operation 1 such as e_regex(...)", "Condition 2", "Operation 2", ....)

DSL statement used:

e_if(regex_match(v("status"),"^4. *"), 
                   e_output(name="logstore_4xx", 
                   project="dashboard-demo", 
                   logstore="dsl-nginx-out-4xx"))

We can see the result in the preview. When we save the transformation result, we need to set the AccessKey information of the corresponding project and Logstore.

13

Complete DSL Code and Publishing Process

Complete DSL code

# Extract general fields
e_regex("content",'(? <remote_addr>[0-9:\.] *) - (? <remote_user>[a-zA-Z0-9\-_]*) \[(? <local_time>[a-zA-Z0-9\/ :\-]*)\] "(? <request>[^"]*)" (? <status>[0-9]*) (? <body_bytes_sent>[0-9\-]*) "(? <refer>[^"]*)" "(? <http_user_agent>[^"]*)"')

# Set the local time
e_set("local_time", dt_strftime(dt_strptime(v("local_time"),"%d/%b/%Y:%H:%M:%S %z"),"%Y-%m-%d %H:%M:%S"))

# Extract the URI field
e_regex("request", "(? <request_method>[^\s]*) (? <request_uri>[^\s]*) (? <http_version>[^\s]*)")
e_set("request_uri", url_decoding(v("request_uri")))
e_kv("request_uri")

# Map the HTTP code
e_dict_map({'200':'OK',
            '304':'304 Not Modified',
            '400':'Bad Request',
            '401':'Unauthorized',
            '403':'Forbidden',
            '404':'Not Found',
            '500':'Internal Server Error',
            '*':'unknown'}, "status", "status_desc")

# Identify the User Agent field
e_switch(regex_match(v("content"), "Mac"), e_set("os", "osx"),
         regex_match(v("content"), "Linux"), e_set("os", "linux"),
         regex_match(v("content"), "Windows"), e_set("os", "windows"),
         default=e_set("os", "unknown")
)

# Ship the log to a specified Logstore
e_if(regex_match(v("status"),"^4. *"), 
                   e_output(name="logstore_4xx", project="dashboard-demo", logstore="dsl-nginx-out-4xx"))

After we submit the code on the page, save the transformation result.

14

Configure the destination Logstore. If the e_output function is used, we need to specify the destination storage name, project, and Logstore, which must be the same as those in the code.

15

After we save the transformation result, the data is published. We can find the task under Data Transformation > Data Transformation. After we click the task name, we can find information such as the transformation delay.

If we need to modify the task, we can also click the task name and modify it on the page that appears.

16

References

  1. Alibaba Cloud Log Service - Data transformation - Overview
  2. Alibaba Cloud Log Service - Data transformation - Function overview
0 0 0
Share on

Teddy.Sun

2 posts | 0 followers

You may also like

Comments

Teddy.Sun

2 posts | 0 followers

Related Products