This topic describes how to use instructions to extract semi-structured data and provides relevant examples.
parse-regexp
Extracts information from a specified field that matches regular expression groups.
The data type of the extracted data is VARCHAR. If an extracted field has the same name as a field in the input data, see Value retention and overwriting for the value retention policy.
You cannot perform operations on the __time__ and __time_ns_part__ time fields. For more information, see Time fields.
Syntax
| parse-regexp <field>, <pattern> as <output>, ...Parameters
Parameter | Type | Required | Description |
field | Field | Yes | The name of the source field to extract. The input data must contain this field. The field must be of the |
pattern | Regexp | Yes | The regular expression. The RE2 syntax is supported. |
output | Field | No | The name of the field used to store the extraction result. |
Examples
Example 1: Perform exploratory matching sequentially.
SPL statement
* | parse-regexp content, '(\S+)' as ip -- Generates the ip field: 10.0.0.0. | parse-regexp content, '\S+\s+(\w+)' as method -- Generates the method field: GET.Input data
content: '10.0.0.0 GET /index.html 15824 0.043'Output
content: '10.0.0.0 GET /index.html 15824 0.043' ip: '10.0.0.0' method: 'GET'
Example 2: Perform a full pattern match using non-named regular expression capturing.
SPL statement
* | parse-regexp content, '(\S+)\s+(\w+)' as ip, methodInput data
content: '10.0.0.0 GET /index.html 15824 0.043'Output
content: '10.0.0.0 GET /index.html 15824 0.043' ip: '10.0.0.0' method: 'GET'
parse-csv
Extracts data in CSV format from a specified field.
The data type of the extracted data is VARCHAR. If an extracted field has the same name as a field in the input data, see Value retention and overwriting for the value retention policy.
You cannot perform operations on the __time__ and __time_ns_part__ time fields. For more information, see Time fields.
Syntax
| parse-csv -delim=<delim> -quote=<quote> -strict <field> as <output>, ...Parameters
Parameter | Type | Required | Description |
delim | String | No | The separator character for the data content. It can be one to three valid ASCII characters. You can use escape characters to represent special characters. For example, \t represents a tab character, \11 represents the ASCII character with the octal ordinal number 11, and \x09 represents the ASCII character with the hexadecimal ordinal number 09. You can also use a multi-character separator, such as The default value is a comma (,). |
quote | Char | No | The quote character for the data content. It is a single valid ASCII character used when the data content contains the separator. Examples include a double quotation mark ("), a single quotation mark ('), and an invisible character (0x01). By default, no quote character is used. Important This parameter takes effect only when the delim parameter is a single character. The value of this parameter cannot be the same as the value of the delim parameter. |
strict | Bool | No | Specifies whether to enable strict matching when the number of values in the data content does not match the number of fields specified in
This feature is disabled by default. To enable it, add this parameter. |
field | Field | Yes | The name of the source field to parse. The data content must include this field. The field must be of the |
output | Field | Yes | The name of the field used to store the parsed data content. |
Examples
Example 1: Simple data matching.
SPL statement
* | parse-csv content as x, y, zInput data
content: 'a,b,c'Output
content: 'a,b,c' x: 'a' y: 'b' z: 'c'
Example 2: Use the default double quotation mark (") as the quote character to match content that contains special characters.
SPL statement
* | parse-csv content as ip, time, hostInput data
content: '192.168.0.100,"10/Jun/2019:11:32:16,127 +0800",example.aliyundoc.com'Output
content: '192.168.0.100,"10/Jun/2019:11:32:16,127 +0800",example.aliyundoc.com' ip: '192.168.0.100' time: '10/Jun/2019:11:32:16,127 +0800' host: 'example.aliyundoc.com'
Example 3: Use a multi-character separator.
SPL statement
* | parse-csv -delim='||' content as time, ip, reqInput data
content: '05/May/2022:13:30:28||127.0.0.1||POST /put?a=1&b=2'Output
content: '05/May/2022:13:30:28||127.0.0.1||POST /put?a=1&b=2' time: '05/May/2022:13:30:28' ip: '127.0.0.1' req: 'POST /put?a=1&b=2'
parse-json
Extracts the first-layer key-value pairs from a specified field in JSON format.
The data type of the extracted data is VARCHAR. If an extracted field has the same name as a field in the input data, see Value retention and overwriting for the value retention policy.
You cannot perform operations on the __time__ and __time_ns_part__ time fields. For more information, see Time fields.
Syntax
| parse-json -mode=<mode> -path=<path> -prefix=<prefix> <field>Parameters
Parameter | Type | Required | Description |
mode | String | No | Specifies the value mode for the result if a new field has the same name as a field in the input data. The default value is overwrite. |
path | JSONPath | No | Specifies the JSON path in the field content to locate the content to be extracted. The default value is an empty string, which indicates that the full content of the specified field is extracted directly. |
prefix | String | No | The prefix for the result fields after the JSON structure is expanded. The default value is an empty string. |
field | Field | Yes | The name of the source field to parse. The input data must contain this field, its value cannot be null, and one of the following conditions must be met. Otherwise, the extraction operation is not performed.
|
Examples
Example 1: Extract all key-value pairs from the y field.
SPL statement
* | parse-json yInput data
x: '0' y: '{"a": 1, "b": 2}'Output
x: '0' y: '{"a": 1, "b": 2}' a: '1' b: '2'
Example 2: Extract the value of the body key from the content field, and then extract all of its key-value pairs.
SPL statement
* | parse-json -path='$.body' contentInput data
content: '{"body": {"a": 1, "b": 2}}'Output
content: '{"body": {"a": 1, "b": 2}}' a: '1' b: '2'
Example 3: Set the field value output mode to preserve to retain the original values of existing fields.
SPL statement
* | parse-json -mode='preserve' yInput data
a: 'xyz' x: '0' y: '{"a": 1, "b": 2}'Output
x: '0' y: '{"a": 1, "b": 2}' a: 'xyz' b: '2'
parse-kv
Extracts key-value pairs from a specified field.
The data type of the extracted data is VARCHAR. If an extracted field has the same name as a field in the input data, see Value retention and overwriting for the value retention policy.
You cannot perform operations on the __time__ and __time_ns_part__ time fields. For more information, see Time fields.
Syntax
Extraction by separator
Extracts key-value pairs based on specified separators.
| parse-kv -mode=<mode> -prefix=<prefix> -greedy <field>, <delim>, <kv-sep>Extraction by regular expression
Extracts key-value pairs based on a specified regular expression.
| parse-kv -regexp -mode=<mode> -prefix=<prefix> <field>, <pattern>Parameters
Extraction by separator
Parameter | Type | Required | Description |
mode | String | No | If the corresponding destination field already exists in the input data, you can select a data overwriting mode. The default value is overwrite. For more information, see Field extraction check and overwrite modes. |
prefix | String | No | The prefix for the names of the output fields that contain the extraction results. The default value is an empty string. |
greedy | Bool | No | Enables greedy matching for field values.
|
field | Field | Yes | The name of the source field to parse.
|
delim | Char | Yes | The separator character between different key-value pairs. It can be one to five valid ASCII characters, such as You cannot specify a substring of |
kv-sep | Char | Yes | The character that connects the key and value within a key-value pair. It can be one to five valid ASCII characters, such as You cannot specify a substring of |
Extraction by regular expression
Parameter | Type | Required | Description |
regexp | Bool | Yes | Enables the regular expression extraction mode. |
mode | String | No | If the corresponding destination field already exists in the input data, you can select a data overwriting mode. The default value is overwrite. For more information, see Field extraction check and overwrite modes. |
prefix | String | No | The prefix for the names of the output fields that contain the extraction results. The default value is an empty string. |
field | Field | Yes | The name of the source field to extract. The input data must contain this field. The field must be of the |
pattern | RegExpr | Yes | A regular expression that contains two capturing groups. The first capturing group extracts the field name, and the second capturing group extracts the field value. The RE2 syntax is supported. |
Examples
Example 1: Use multi-character separators to extract labels from SLS metric data as data fields
SPL statement
* | parse-kv -prefix='__labels__.' __labels__, '|', '#$#'Input data
__name__: 'net_in' __value__: '231461.57374215033' __time_nano__: '1717378679274117026' __labels__: 'cluster#$#sls-etl|hostname#$#iZbp17raa25u0xi4wifopeZ|interface#$#veth02cc91d2|ip#$#192.168.22.238'Output data
__name__: 'net_in' __value__: '231461.57374215033' __time_nano__: '1717378679274117026' __labels__: 'cluster#$#sls-etl|hostname#$#iZbp17raa25u0xi4wifopeZ|interface#$#veth02cc91d2|ip#$#192.168.22.238' __labels__.cluster: 'sls-etl' __labels__.hostname: 'iZbp17raa25u0xi4wifopeZ' __labels__.interface: 'veth02cc91d2' __labels__.ip: '192.168.22.238'
Example 2: Enable greedy matching mode to extract key-value pairs from access logs.
SPL statement
* | parse-kv -greedy content, ' ', '='Input data
content: 'src=127.0.0.1 dst=192.168.0.0 bytes=125 msg=connection refused body=this is test time=2024-05-21T00:00:00'Output data
content: 'src=127.0.0.1 dst=192.168.0.0 bytes=125 msg=connection refused body=this is test time=2024-05-21T00:00:00' src: '127.0.0.1' dst: '192.168.0.0' bytes: '125' msg: 'connection refused' body: 'this is test' time: '2024-05-21T00:00:00'
Example 3: Use the regular expression extraction mode to handle complex key-value pair delimiters and key-value separators.
SPL statement
* | parse-kv -regexp content, '([^&?]+)(?:=|:)([^&?]+)'Input data
content: 'k1=v1&k2=v2?k3:v3' k1: 'xyz'Output data
content: 'k1=v1&k2=v2?k3:v3' k1: 'v1' k2: 'v2' k3: 'v3'
Example 4: Set the field value output mode to preserve to retain the original values of existing fields.
SPL statement
* | parse-kv -regexp -mode='preserve' content, '([^&?]+)(?:=|:)([^&?]+)'Input data
content: 'k1=v1&k2=v2?k3:v3' k1: 'xyz'Output
content: 'k1=v1&k2=v2?k3:v3' k1: 'xyz' k2: 'v2' k3: 'v3'