You can use ingest processors to process logs before the logs are written to a Logstore. For example, you can use ingest processors to modify fields, parse fields, filter data, and mask data. This topic describes how to configure ingest processors. This topic also describes the scenarios in which ingest processors are used.
Prerequisites
A project and a Standard Logstore are created, and log collection settings are configured. For more information, see Create a project, Create a Logstore, and Data collection overview.
Scenarios
To extract the request_method
, request_uri
, and status
fields from a raw log, perform the following operations.
Raw log
body_bytes_sent: 22646
host: www.example.com
http_protocol: HTTP/1.1
remote_addr: 192.168.31.1
remote_user: Elisa
request_length: 42450
request_method: GET
request_time: 675
request_uri: /index.html
status: 200
time_local: 2024-12-04T13:47:54+08:00
Procedure
Create an ingest processor.
Log on to the Simple Log Service console.
In the Projects section, click the project that you want to manage.
In the left-side navigation pane, choose
.On the Ingest Processor tab, click Create. In the Create Processor panel, configure the Processor Name, SPL, and Error Handling parameters and click OK. The following table describes the parameters.
Parameter
Description
Processor Name
The name of the ingest processor. Example:
nginx-logs-text
.Description
The description of the ingest processor.
SPL
The Simple Log Service Processing Language (SPL) statement. Example:
* | project request_method, request_uri, status
For more information, see SPL instructions.
Error Handling
The action that is performed when an SPL-based data processing failure occurs. Valid values:
Retain Raw Data
Discard Raw Data
NoteIn this topic, SPL-based data processing failures refer to the execution failures of SPL statements. For example, SPL statements may fail to be executed due to invalid data input. SPL-based data processing failures that are caused by invalid SPL syntax are not involved.
If data fails to be parsed due to invalid SPL syntax configurations, the raw data is retained by default.
Associate the ingest processor with a Logstore.
In the left-side navigation pane of the project that you want to manage, click Log Storage, move the pointer over the Logstore that you want to manage, and then choose
.In the upper-right corner of the Logstore Attributes page, click Modify. In edit mode, select the ingest processor that you want to associate with the Logstore from the Ingest Processor drop-down list and click Save.
NoteAn associated ingest processor takes effect only for incremental logs. Approximately 1 minute is required for the ingest processor to take effect.
On the query and analysis page of the Logstore, click Search & Analyze to query the collected logs.
Other scenarios
Modify fields
You can use the following SPL instructions to manage fields:
project
,project-away
,project-rename
, andextend
. Raw log:body_bytes_sent: 22646 host: www.example.com http_protocol: HTTP/1.1 referer: www.example.com remote_addr: 192.168.31.1 remote_user: Elisa request_length: 42450 request_method: PUT request_time: 675 request_uri: /request/path-1/file-6?query=123456 status: 200 time_local: 2024-12-04T13:47:54+08:00
Scenario
Requirement description
SPL statement
Result
Retain specific fields
Retain only the following fields:
request_method
request_uri
status
* | project request_method, request_uri, status
request_method: PUT request_uri: /request/path-1/file-6?query=123456 status: 200
Retain only the following fields and rename the result fields:
Rename the
request_method
field tomethod
.Rename the
request_uri
field touri
.status
.
* | project method=request_method, uri=request_uri, status
method: PUT uri: /request/path-1/file-6?query=123456 status: 200
Retain all fields whose names start with
request_
.* | project -wildcard "request_*"
request_length: 42450 request_method: PUT request_time: 675 request_uri: /request/path-1/file-6?query=123456
Delete specific fields
Delete the following fields:
http_protocol
referer
remote_addr
remote_user
* | project-away http_protocol, referer, remote_addr, remote_user
body_bytes_sent: 22646 host: www.example.com request_length: 42450 request_method: PUT request_time: 675 request_uri: /request/path-1/file-6?query=123456 status: 200 time_local: 2024-12-04T13:47:54+08:00
Delete all fields whose names start with
request_
.* | project-away -wildcard "request_*"
body_bytes_sent: 22646 host: www.example.com http_protocol: HTTP/1.1 referer: www.example.com remote_addr: 192.168.31.1 remote_user: Elisa status: 200 time_local: 2024-12-04T13:47:54+08:00
Create fields
Create a field named
app
and set the value of the field totest-app
.* | extend app='test-app'
app: test-app body_bytes_sent: 22646 host: www.example.com http_protocol: HTTP/1.1 referer: www.example.com remote_addr: 192.168.31.1 remote_user: Elisa request_length: 42450 request_method: PUT request_time: 675 request_uri: /request/path-1/file-6?query=123456 status: 200 time_local: 2024-12-04T13:47:54+08:00
Create a field named
request_query
and extract the value of the request_query field from the value of therequest_uri
field.* | extend request_query=url_extract_query(request_uri)
body_bytes_sent: 22646 host: www.example.com http_protocol: HTTP/1.1 referer: www.example.com remote_addr: 192.168.31.1 remote_user: Elisa request_length: 42450 request_method: PUT request_query: query=123456 request_time: 675 request_uri: /request/path-1/file-6?query=123456 status: 200 time_local: 2024-12-04T13:47:54+08:00
Modify field names
Rename the
time_local
field totime
.* | project-rename time=time_local
body_bytes_sent: 22646 host: www.example.com http_protocol: HTTP/1.1 referer: www.example.com remote_addr: 192.168.31.1 remote_user: Elisa request_length: 42450 request_method: PUT request_time: 675 request_uri: /request/path-1/file-6?query=123456 status: 200 time: 2024-12-04T13:47:54+08:00
Modify field values
Retain the path in the
request_uri
field and delete the query parameter.* | extend request_uri=url_extract_path(request_uri)
Or
* | extend request_uri=regexp_replace(request_uri, '\?.*', '')
body_bytes_sent: 22646 host: www.example.com http_protocol: HTTP/1.1 referer: www.example.com remote_addr: 192.168.31.1 remote_user: Elisa request_length: 42450 request_method: PUT request_time: 675 request_uri: /request/path-1/file-6 status: 200 time_local: 2024-12-04T13:47:54+08:00
Parse fields
You can use the following SPL instructions and SQL processing functions such as regular expression functions and JSON functions to parse and extract fields:
parse-regexp
,parse-json
, andparse-csv
.Scenario
Raw data
Requirement description
SPL statement
Result
Data parsing in regex mode
content: 192.168.1.75 - David [2024-07-31T14:27:24+08:00] "PUT /request/path-0/file-8 HTTP/1.1" 819 21577 403 73895 www.example.com www.example.com "Mozilla/5.0 (Windows NT 5.2; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.1"
Extract fields from a NGINX access log by using a regular expression and discard the
content
field in the raw data.* | parse-regexp content, '(\S+)\s-\s(\S+)\s\[(\S+)\]\s"(\S+)\s(\S+)\s(\S+)"\s(\d+)\s(\d+)\s(\d+)\s(\d+)\s(\S+)\s(\S+)\s"(.*)"' as remote_addr, remote_user, time_local, request_method, request_uri, http_protocol, request_time, request_length, status, body_bytes_sent, host, referer, user_agent | project-away content
body_bytes_sent: 73895 host: www.example.com http_protocol: HTTP/1.1 referer: www.example.com remote_addr: 192.168.1.75 remote_user: David request_length: 21577 request_method: PUT request_time: 819 request_uri: /request/path-0/file-8 status: 403 time_local: 2024-07-31T14:27:24+08:00 user_agent: Mozilla/5.0 (Windows NT 5.2; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.1
request_method: PUT request_uri: /request/path-0/file-8 status: 200
Extract the string
file-8
from therequest_uri
field and name the stringfile
.* | extend file=regexp_extract(request_uri, 'file-.*')
file: file-8 request_method: PUT request_uri: /request/path-0/file-8 status: 200
Data parsing in JSON mode
headers: {"Authorization": "bearer xxxxx", "X-Request-ID": "29bbe977-9a62-4e4a-b2f4-5cf7b65d508f"}
Parse the
headers
field in JSON mode and discard theheaders
field in the raw data.* | parse-json headers | project-away headers
Autorization: bearer xxxxx X-Request-ID: 29bbe977-9a62-4e4a-b2f4-5cf7b65d508f
Extract specific fields from the
headers
field. For example, extract theAuthorization
field from the headers field and rename the Authorization field totoken
.* | extend token=json_extract_scalar(headers, 'Authorization')
headers: {"Authorization": "bearer xxxxx", "X-Request-ID": "29bbe977-9a62-4e4a-b2f4-5cf7b65d508f"} token: bearer xxxxx
request: {"body": {"user_id": 12345, "user_name": "Alice"}}
Parse the
body
field in therequest
field in JSON mode.* | parse-json -path='$.body' request
request: {"body": {"user_id": 12345, "user_name": "Alice"}} user_id: 12345 user_name: Alice
Data parsing in delimiter mode
content: 192.168.0.100,"10/Jun/2019:11:32:16,127 +0800",www.example.com
Split fields by using commas (,) and discard the
content
field in the raw data.* | parse-csv -quote='"' content as ip, time, host | project-away content
host: www.example.com ip: 192.168.0.100 time: 10/Jun/2019:11:32:16,127 +0800
content: 192.168.0.100||10/Jun/2019:11:32:16,127 +0800||www.example.com
Split fields by using the delimiter
||
and discard thecontent
field in the raw data.* | parse-csv -delim='||' content as ip, time, host | project-away content
host: www.example.com ip: 192.168.0.100 time: 10/Jun/2019:11:32:16,127 +0800
Filter data
You can use the
where
instruction to filter data.Note that during SPL-based data processing, all fields in the raw data are used as text by default. Before you process numeric values, you must use data type conversion functions to convert the values of the required fields. For more information, see Data type conversion functions.
Raw data
Requirement description
SPL statement
Result
request_id: ddbde824-7c3e-4ff1-a6d1-c3a53fd4a919 status: 200 --- request_id: 7f9dad20-bc57-4aa7-af0e-436621f1f51d status: 500
Retain only logs whose
status
field value is200
.* | where status='200'
Or
* | where cast(status as bigint)=200
request_id: ddbde824-7c3e-4ff1-a6d1-c3a53fd4a919 status: 200
request_id: ddbde824-7c3e-4ff1-a6d1-c3a53fd4a919 status: 200 --- request_id: 7f9dad20-bc57-4aa7-af0e-436621f1f51d status: 500 error: something wrong
Retain only logs that does not contain the
error
field.* | where error is null
request_id: ddbde824-7c3e-4ff1-a6d1-c3a53fd4a919 status: 200
Retain only logs that contain the
error
field.* | where error is not null
request_id: 7f9dad20-bc57-4aa7-af0e-436621f1f51d status: 500 error: something wrong
method: POST request_uri: /app/login --- method: GET request_uri: /user/1/profile status: 404 --- method: GET request_uri: /user/2/profile status: 200
Retain only logs whose
request_uri
field value starts with/user/
.* | where regexp_like(request_uri, '^\/user\/')
method: GET request_uri: /user/1/profile status: 404 --- method: GET request_uri: /user/2/profile status: 200
Retain only logs whose
request_uri
field value starts with/user/
and whosestatus
field value is200
.* | where regexp_like(request_uri, '^\/user\/') and status='200'
method: GET request_uri: /user/2/profile status: 200
Mask data
You can use the
extend
instruction and SQL functions such as regular expression functions, string functions, and URL functions to mask data.When you use the
regexp_replace
function to replace field values, you can use capturing groups. You can use\1
,\2
, and \N to represent the values of the first, second, and Nth capturing groups.For example, the result of the
regexp_replace('192.168.1.1', '(\d+)\.(\d+)\.\d+\.\d+', '\1.\2.*.*')
function is192.168.*.*
.Raw data
Requirement description
SPL statement
Result
request_uri: /api/v1/resources?user=123&ticket=abc status: 200
Remove sensitive information from the
request_uri
field.* | extend request_uri=url_extract_path(request_uri)
Or
* | extend request_uri=regexp_replace(request_uri, '\?.*', '')
request_uri: /api/v1/resources status: 200
client_ip: 192.168.1.123 latency: 100
Mask the middle two octets of an IP address with asterisks (*).
* | extend client_ip=regexp_replace(client_ip, '(\d+)\.\d+\.\d+\.(\d+)', '\1.*.*.\2')
client_ip: 192.*.*.123 latency: 100
sql: SELECT id, name, config FROM app_info WHERE name="test-app" result_size: 1024
The
sql
field contains sensitive information. In this case, retain only operations and the name of the table.* | extend table=regexp_extract(sql, '\bFROM\s+([^\s;]+)|\bINTO\s+([^\s;]+)|\bUPDATE\s+([^\s;]+)', 1) | extend action=regexp_extract(sql,'\b(SELECT|INSERT|UPDATE|DELETE|CREATE|DROP|ALTER)\b', 1) | project-away sql
action: SELECT table: app_info result_size: 1024