This practice case compares multiple solutions for parsing NGINX logs to help you choose a most efficient solution. Log parsing using a regular expression is the focus in this topic.

Parse NGINX success logs

The following section uses an NGINX log as an example to describe multiple solutions for parsing NGINX logs.
203.208.xx.xx - - [04/Jan/2019:16:06:38 +0800] "GET /atom.xml HTTP/1.1" 200 273932 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Parsing requirements
  1. Extract the clientip, ident, auth, timestamp, verb, request, uri, httpversion, response, bytes, referrer, and agent fields from the NGINX log.
  2. Extract the uri_proto, uri_host, and uri_param fields from the uri field.
  3. Extract the uri_path and uri_query fields from the uri_param field.
The raw log collected in the console is in String format, as shown below:
__source__:  30.43.xx.xx
__tag__:__client_ip__:  12.120.xx.xx
__tag__:__receive_time__:  1563443076
content: 203.208.xx.xx - - [04/Jan/2019:16:06:38 +0800] "GET http://cdn1cdedge0001.coxlab.net/_astats?application=&inf.name=eth0 HTTP/1.1" 200 273932 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
LOG DSL orchestration
  • Solution 1: use a regular expression for log parsing
    1. Transform the NGINX log as follows to meet requirement 1:
      e_regex("content",r'(? P<ip>\d+\.\d+\.\d+\.\d+)( - - \[)(? P<datetime>[\s\S]+)\] \"(? P<verb>[A-Z]+) (? P<request>[\S]*) (? P<protocol>[\S]+)["] (? P<code>\d+) (? P<sendbytes>\d+) ["](? P<refere>[\S]*)["] ["](? P<useragent>[\S\s]+)["]')
      The log after processing is as follows:
      ip: 203.208.xx.xx
      datetime: 04/Jan/2019:16:06:38 +0800
      verb: GET
      request: http://cdn1cdedge0001.coxlab.net/_astats?application=&inf.name=eth0
      protocol: HTTP/1.1
      code: 200
      sendbytes: 273932
      refere: -
      useragent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
    2. Transform the uri field obtained in step 1 as follows to meet requirement 2:
      e_regex('uri',r'(? P<uri_proto>(\w+)):\/\/(? P<uri_domain>[a-z0-9.] *[^\/])(? P<uri_param>(. +)$)')
      The log after processing is as follows:
      uri_proto: http
      uri_domain: cdn1cdedge0001.coxlab.net
      uri_param: /_astats? application=&inf.name=eth0
    3. Transform the uri_param field obtained in step 2 as follows to meet requirement 3:
      e_regex('uri_param',r'(? P<uri_path>\/\_[a-z]+[^?]) \?(? <uri_query>(. +)$)')
      The log after processing is as follows:
      uri_path: /_astats
      uri_query: application=&inf.name=eth0
    4. To sum up, use the following LOG domain specific language (DSL) rules:
      """Step 1: Parse the NGINX log."""
      e_regex("content",r'(? P<ip>\d+\.\d+\.\d+\.\d+)( - - \[)(? P<datetime>[\s\S]+)\] \"(? P<verb>[A-Z]+) (? P<request>[\S]*) (? P<protocol>[\S]+)["] (? P<code>\d+) (? P<sendbytes>\d+) ["](? P<refere>[\S]*)["] ["](? P<useragent>[\S\s]+)["]')
      """Step 2: Parse the uri field obtained in step 1."""
      e_regex('uri',r'(? P<uri_proto>(\w+)):\/\/(? P<uri_domain>[a-z0-9.] *[^\/])(? P<uri_param>(. +)$)')
      """Step 3: Parse the uri_param field obtained in step 2."""
      e_regex('uri_param',r'(? P<uri_path>\/\_[a-z]+[^?]) \?(? <uri_query>(. +)$)')
      The log after processing is as follows:
      __source__:  30.43.xx.xx
      __tag__:__client_ip__:  12.120.xx.xx
      __tag__:__receive_time__:  1563443076
      content: 203.208.xx.xx - - [04/Jan/2019:16:06:38 +0800] "GET http://cdn1cdedge0001.coxlab.net/_astats?application=&inf.name=eth0 HTTP/1.1" 200 273932 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
      ip: 203.208.xx.xx
      datetime: 04/Jan/2019:16:06:38 +0800
      verb: GET
      request: http://cdn1cdedge0001.coxlab.net/_astats?application=&inf.name=eth0
      protocol: HTTP/1.1
      code: 200
      sendbytes: 273932
      refere: -
      useragent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
      uri_proto: http
      uri_domain: cdn1cdedge0001.coxlab.net
      uri_param: /_astats? application=&inf.name=eth0
      uri_path: /_astats
      uri_query: application=&inf.name=eth0
  • Solution 2: use the Grok method for log parsing
    To use the Grok method to parse the NGINX log, you only need to use the COMBINEDAPACHELOG pattern.
    Pattern Rule Description
    COMMONAPACHELOG

    %{IPORHOST:clientip} %

    {HTTPDUSER:ident} %

    {USER:auth} \[%

    {HTTPDATE:timestamp}\] "(?:%

    {WORD:verb} %

    {NOTSPACE:request}(?: HTTP/%

    {NUMBER:httpversion})? |%

    {DATA:rawrequest})" %

    {NUMBER:response} (?:%

    {NUMBER:bytes}|-)

    Parses the clientip, ident, auth, timestamp, verb, request, httpversion, response, and bytes fields.
    COMBINEDAPACHELOG

    %{COMMONAPACHELOG} %

    {QS:referrer} %{QS:agent}

    Parses the referrer and agent fields, in addition to all the fields parsed by the COMMONAPACHELOG pattern.
    1. Transform the NGINX log as follows to meet requirement 1:
      e_regex('content',grok('%{COMBINEDAPACHELOG}'))
      The log after processing is as follows:
      clientip: 203.208.xx.xx
      ident: -
      auth: -
      timestamp: 04/Jan/2019:16:06:38 +0800
      verb: GET
      request: http://cdn1cdedge0001.coxlab.net/_astats?application=&inf.name=eth0
      httpversion: 1.1
      response: 200
      bytes: 273932
      referrer: "-"
      agent: "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
    2. To parse the request field in the Grok method, use free combinations of the patterns described in the following table.
      Pattern Rule Description
      URIPROTO [A-Za-z]+(\+[A-Za-z+]+)? Matches the header in a URI. For example, this pattern matches http in http://hostname.domain.tld/_astats?application=&inf.name=eth0.
      USER [a-zA-Z0-9. _-]+ Matches content consisting of letters, digits, and special characters, including periods, underscores, and hyphens (. _-.
      URIHOST %{IPORHOST}(?::% Matches IP addresses, hostnames, or positive integers.
      URIPATHPARAM %{URIPATH}(?:%{URIPARAM})? Matches the uri_param field.
      Transform the request field obtained in step 1 as follows to meet requirement 2:
      e_regex('request',grok("%{URIPROTO:uri_proto}://(?:%{USER:user}(?::[^@]*)? @)?(?:%{URIHOST:uri_domain})?(?:%{URIPATHPARAM:uri_param})?"))
      The log after processing is as follows:
      uri_proto: http
      uri_domain: cdn1cdedge0001.coxlab.net
      uri_param: /_astats? application=&inf.name=eth0
    3. Use the Grok pattern described in the following table to parse the uri_param field.
      Pattern Rule Description
      GREEDYDATA . * Matches content without line breaks.
      Transform the uri_param field obtained in step 2 as follows to meet requirement 3:
      e_regex('uri_param',grok("%{GREEDYDATA:uri_path}\? %{GREEDYDATA:uri_query}"))
      The log after processing is as follows:
      uri_path: /_astats
      uri_query: application=&inf.name=eth0
    4. To sum up, use the following LOG DSL rules:
      """Step 1: Parse the NGINX log."""
      e_regex('content',grok('%{COMBINEDAPACHELOG}'))
      """Step 2: Parse the uri field obtained in step 1."""
      e_regex('request',grok("%{URIPROTO:uri_proto}://(?:%{USER:user}(?::[^@]*)? @)?(?:%{URIHOST:uri_domain})?(?:%{URIPATHPARAM:uri_param})?"))
      """Step 3: Parse the uri_param field obtained in step 2."""
      e_regex('uri_param',grok("%{GREEDYDATA:uri_path}\? %{GREEDYDATA:uri_query}"))
      The log after processing is as follows:
      __source__:  30.43.xx.xx
      __tag__:__client_ip__:  12.120.xx.xx
      __tag__:__receive_time__:  1563443076
      content: 203.208.xx.xx - - [04/Jan/2019:16:06:38 +0800] "GET http://cdn1cdedge0001.coxlab.net/_astats?application=&inf.name=eth0 HTTP/1.1" 200 273932 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
      clientip: 203.208.xx.xx
      ident: -
      auth: -
      timestamp: 04/Jan/2019:16:06:38 +0800
      verb: GET
      request: http://cdn1cdedge0001.coxlab.net/_astats?application=&inf.name=eth0
      httpversion: 1.1
      response: 200
      bytes: 273932
      referrer: "-"
      agent: "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
      uri_proto: http
      uri_domain: cdn1cdedge0001.coxlab.net
      uri_param: /_astats? application=&inf.name=eth0
      uri_path: /_astats
      uri_query: application=&inf.name=eth0

Solution comparison

The following section compares the advantages and disadvantages of the two NGINX log parsing methods.
  • Use a regular expression
    For developers who are not familiar with regular expressions, log parsing using a regular expression has low efficiency and high learning cost. In addition, this method has low flexibility. For example, assume that the user twiss is added to the request field as follows:
    http://twiss@cdn1cdedge0001.coxlab.net/_astats?application=&inf.name=eth0
    Use the following regular expression:
    (? P<uri_proto>(\w+)):\/\/(? P<uri_domain>[a-z0-9.] *[^\/])(? P<uri_param>(. +)$)
    The request field is parsed as follows:
    uri_proto: http
    uri_domain: twiss@
    uri_param: cdn1cdedge0001.coxlab.net/_astats? application=&inf.name=eth0

    Obviously, the parsed content does not meet the requirements if you use the original regular expression. Therefore, you need to correct the regular expression to guarantee normal log parsing. However, it is not an easy job to use regular expressions flexibly.

  • Use the Grok method

    The Grok method has low learning cost. You can use it to easily parse logs as long as you understand the field types in each Grok pattern. For more information, see Grok patterns.

    The Grok method is flexible. Let us go back to the previous example in the regular expression method. The request field is as follows:
    http://twiss@cdn1cdedge0001.coxlab.net/_astats?application=&inf.name=eth0
    The Grok pattern remains unchanged.
    e_regex('request',grok("%{URIPROTO:uri_proto}://(?:%{USER:user}(?::[^@]*)? @)?(?:%{URIHOST:uri_domain})?(?:%{URIPATHPARAM:uri_param})?"))
    The request field is parsed as follows:
    uri_proto: http
    user: twiss
    uri_domain: cdn1cdedge0001.coxlab.net
    uri_param: /_astats? application=&inf.name=eth0

    You can correctly parse the log content by using the same Grok pattern when the user information is added to the request field.

Conclusion:

The Grok method is superior to the regular expression method in terms of flexibility, efficiency, cost effectiveness, and learning curves. Currently, four hundred Grok patterns are available for data transformation. We recommend that you use the Grok method first. Grok patterns are actually regular expressions. If necessary, you can use Grok patterns and regular expressions together, or even compile your own regular expressions.

Parse NGINX error logs

The following section uses an NGINX error log as an example to describe how to use the Grok method to parse NGINX error logs.
__source__:  30.43.xx.xx
__tag__:__client_ip__:  12.120.xx.xx
__tag__:__receive_time__:  1563443076
content: 203.208.xx.xx - - [04/Jan/2019:16:06:38 +0800] "GET /atom.xml HTTP/1.1" 200 273932 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Requirement: Parse the NGINX error log. This case uses the Grok method to parse the NGINX error log.

The raw log collected in the console is in String format, as shown below:
__source__:  30.43.xx.xx
__tag__:__client_ip__:  12.120.xx.xx
__tag__:__receive_time__:  1563443076
content: 2019/08/07 16:05:17 [error] 1234#1234: *1234567 attempt to send data on a closed socket: u:111111ddd, c:0000000000000000, ft:0 eof:0, client: 1.2.3.4, server: sls.aliyun.com, request: "GET /favicon.ico HTTP/1.1", host: "sls.aliyun.com", referrer: "https://sls.aliyun.com/question/answer/123.html?from=singlemessage"
LOG DSL orchestration:
e_regex('content',grok('%{DATESTAMP:request_time} \[%{LOGLEVEL:log_level}\] %{POSINT:pid}#%{NUMBER}: %{GREEDYDATA:errormessage}(?:, client: (? <client>%{IP}|%{HOSTNAME}))(?:, server: %{IPORHOST:server})(?:, request: "%{WORD:verb} %{NOTSPACE:request}( HTTP/%{NUMBER:http_version})")(?:, host: "%{HOSTNAME:host}")?(?:, referrer: "%{NOTSPACE:referrer}")?'))
Output log:
__source__:  30.43.xx.xx
__tag__:__client_ip__:  12.120.xx.xx
__tag__:__receive_time__:  1563443076
request_time: 19/08/07 16:05:17
log_level: error
pid: 1234
errormessage: *1234567 attempt to send data on a closed socket: u:111111ddd, c:0000000000000000, ft:0 eof:0
client: 1.2.3.4
server: sls.aliyun.com
verb: GET
request: /favicon.ico
http_version: 1.1
host: sls.aliyun.com
referrer: https://sls.aliyun.com/question/answer/123.html?from=singlemessage