Doris allows you to import JSON data. This topic describes the parameters that are used when you import JSON data into Doris and the usage notes.
Supported import methods
- Import a local JSON file by using Stream Load.
- Subscribe to and consume JSON-formatted messages in Kafka by using Routine Load.
Supported JSON formats
- Multiple rows of data represented by an arrayThe JSON format in which an array is used as the root node. Each element in the array represents a row of data to be imported, usually an object. Example:
[ { "id": 123, "city" : "beijing"}, { "id": 456, "city" : "shanghai"}, ... ]
This format is typically used for the stream load method to represent multiple rows of data in a batch of imported data.Important To use this format, you must set the strip_outer_array parameter to true. During parsing, Doris expands the array, and then parses each object in turn as a row of data. - A single row of data represented by an objectThe JSON format in which an object is used as the root node. An entire object represents a row of data to be imported. Example:
{ "id": 123, "city" : "beijing"}
{ "id": 123, "city" : { "name" : "beijing", "region" : "haidian" }}
This format is typically used for the routine load method. For example, this format can represent a message in Kafka, which is a row of data.
- Multiple rows of object data separated by a fixed delimiterA row of data represented by an object represents a row of data to be imported. Example:
{ "id": 123, "city" : "beijing"} { "id": 456, "city" : "shanghai"} ...
This format is typically used for the stream load method to represent multiple rows of data in a batch of imported data.Important To use this format, you must set the read_json_by_line parameter to true. You must also configure the line_delimiter parameter to specify a delimiter. Default value: \n. When Doris parses objects, the objects are separated by the specified delimiter. Each row of objects is parsed as a row of data.
JSON parameters
- streaming_load_json_max_mb
- Data in some formats, such as JSON, cannot be split. Doris must read and import all data in these formats into the memory before the data can be parsed. Therefore, this parameter is used to specify the maximum amount of data in these formats that can be imported in a single stream load.
- Default value: 100. Unit: MB. For more information about how to modify this parameter, see Configuration items of backend nodes.
- fuzzy_parse
- In a stream load, you can add the fuzzy_parse parameter to improve the import efficiency of JSON data.Stream Load
- This parameter is usually used to import multiple rows of data represented by an array. Therefore, if you configure this parameter, you must set the strip_outer_array parameter to true.
- If you use the fuzzy_parse parameter, make sure that each row of data in the array has the same order of fields. Doris parses the data based only on the field order of the first row, and then accesses the subsequent data in the form of subscripts. This improves the import efficiency by three to five times.
JSON path
Doris allows you to specify JSON paths to extract the specified JSON data.
- Do not specify JSON paths
If you do not specify JSON paths, Doris searches for the elements in the object based on the column names in the table by default. Example:
The table contains two columns namedid and city
. The following sample JSON data is available:
Doris uses the{ "id": 123, "city" : "beijing"}
id and city
names to match elements in the object. Then,123
andbeijing
are obtained.The following sample JSON data is available:
Doris uses the{ "id": 123, "name" : "beijing"}
id and city
names to match elements in the object. Then,123
andnull
are obtained. - Specify JSON pathsSpecify a set of JSON paths in the format of JSON data. Each element in the array represents a column to be extracted. Example:
["$.id", "$.name"]
Doris uses the specified JSON path to match and extract data.["$.id.sub_id", "$.name[0]", "$.city[0]"]
- Match data of non-primitive data types
The data that is matched in the preceding examples are of primitive data types, such as Integer and String. Doris does not support composite data types, such as Array and Map. Therefore, if data of a non-primitive data type is matched, Doris converts the type to a JSON-formatted string and imports the data as a string. Example:
The following sample JSON data is available:
Specify the JSON path as{ "id": 123, "city" : { "name" : "beijing", "region" : "haidian" }}
["$.city"]
. Then, the following elements are matched:
The elements are converted to a string for subsequent import operations:{ "name" : "beijing", "region" : "haidian" }
"{'name':'beijing','region':'haidian'}"
- Match failed
If the match fails, null is returned. Example:
The following sample JSON data is available:
Specify JSON path as{ "id": 123, "name" : "beijing"}
["$.id", "$.info"]
. Then, the matched elements are123
andnull
.Doris does not distinguish between null values in JSON data and null values that are returned when a match fails. The following sample JSON data is available:
Then, the same results,{ "id": 123, "name" : null }
123
andnull
, are returned if you use the following two JSON paths to match elements:["$.id", "$.name"]
["$.id", "$.info"]
- Exact match failedTo prevent accidental operations that are caused by invalid parameter settings, Doris considers a row of data as an error row if all columns fail to be matched when Doris tries to match the row of data. The following sample JSON data is available:
{ "id": 123, "city" : "beijing" }
If the JSON path is specified to the following invalid value, or if the JSON path is not specified and theid
andcity
columns do not exist in the table:
Then, the exact match fails, and Doris marks this row as an error row instead of yielding["$.ad", "$.infa"]
null, null
.
JSON path and columns
JSON paths are used to specify how to extract JSON data, and columns are used to specify the mapping and conversion relationship of columns. JSON paths can be used together with Columns.
- Data content:
{"k1" : 1, "k2": 2}
- Table schema:
k2 int, k1 int
- Import statement 1 in which the stream load method is used:
In import statement 1, only the JSON path is specified, and the columns field is not specified. The JSON path is used to extract JSON data based on the order of fields in the JSON path. Then, data is written based on the order of the table schema. The following data import results are returned:curl -v --location-trusted -u root: -H "format: json" -H "jsonpaths: [\"$.k2\", \"$.k1\"]" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load
You can see that the value of the k2 column in the JSON data is imported into the k1 column. This is because the name of a field in the JSON data is not the same as the name of the corresponding field in the table schema. Therefore, you must explicitly specify the mapping between the fields in the JSON data and table schema.+------+------+ | k1 | k2 | +------+------+ | 2 | 1 | +------+------+
- Import statement 2:
curl -v --location-trusted -u root: -H "format: json" -H "jsonpaths: [\"$.k2\", \"$.k1\"]" -H "columns: k2, k1" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load
Compared with import statement 1, the columns field is added to describe the mapping between the k1 and k2 columns in the order of k2, k1. After the fields in the JSON path are extracted in sequence, the value of the first column is specified to the value of the k2 column in the table, and the value of the second column is specified to the value of the k1 column in the table. The following data import results are returned:+------+------+ | k1 | k2 | +------+------+ | 1 | 2 | +------+------+
Same as other import methods, you can perform column transformations in the columns field. Example:curl -v --location-trusted -u root: -H "format: json" -H "jsonpaths: [\"$.k2\", \"$.k1\"]" -H "columns: k2, tmp_k1, k1 = tmp_k1 * 100" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load
In the preceding example, the value of the k1 column is multiplied by 100 and then imported. The following data import results are returned:+------+------+ | k1 | k2 | +------+------+ | 100 | 2 | +------+------+
JSON root
Doris allows you to specify the JSON root to extract the specified JSON data.
- Do not specify the JSON root
If you do not specify the JSON root, Doris searches for the elements in the object based on the column names in the table by default. Example:
The table contains two columns namedid
andcity
. The following sample JSON data is available:
Doris uses the{ "id": 123, "name" : { "id" : "321", "city" : "shanghai" }}
id and city
names to match elements in the object. Then,123
andnull
are obtained. - Specify the JSON root
Specify the root node of the JSON data by using the json_root parameter. Doris extracts the elements of the root node for parsing by using the JSON root. By default, this parameter is empty.
Specify the JSON root as-H "json_root: $.name"
. Then, the following elements are matched:
The elements are treated as new JSON data to be imported.{ "id" : "321", "city" : "shanghai" }
321
andshanghai
are obtained.
NULL and default values
- The following sample JSON data is available:
The table schema is[ {"k1": 1, "k2": "a"}, {"k1": 2}, {"k1": 3, "k2": "c"}, ]
k1 int null, k2 varchar(32) null default "x"
. - Import statement:
The following import results may be expected, in which the default value is entered into the missing column:curl -v --location-trusted -u root: -H "format: json" -H "strip_outer_array: true" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load
However, the following actual import results are returned, in which NULL is entered into the missing column:+------+------+ | k1 | k2 | +------+------+ | 1 | a | +------+------+ | 2 | x | +------+------+ | 3 | c | +------+------+
This is because Doris does not obtain the information that the missing column is the k2 column in the table from the information in the import statement. To obtain the expected import results, execute the following import statement:+------+------+ | k1 | k2 | +------+------+ | 1 | a | +------+------+ | 2 | NULL | +------+------+ | 3 | c | +------+------+
curl -v --location-trusted -u root: -H "format: json" -H "strip_outer_array: true" -H "jsonpaths: [\"$.k1\", \"$.k2\"]" -H "columns: k1, tmp_k2, k2 = ifnull(tmp_k2, 'x')" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load