All Products
Search
Document Center

E-MapReduce:Import JSON data into Doris

Last Updated:May 12, 2023

Doris allows you to import JSON data. This topic describes the parameters that are used when you import JSON data into Doris and the usage notes.

Supported import methods

You can use only the following import methods to import JSON data:

  • Import a local JSON file by using Stream Load.

  • Subscribe to and consume JSON-formatted messages in Kafka by using Routine Load.

Note

You cannot import JSON data by using other import methods. Some information in this topic is from Apache Doris. For more information, see Load Json Format Data.

Supported JSON formats

Only the following two JSON formats are supported:

  • Multiple rows of data represented by an array

    The JSON format in which an array is used as the root node. Each element in the array represents a row of data to be imported, usually an object. Example:

    [
        { "id": 123, "city" : "beijing"},
        { "id": 456, "city" : "shanghai"},
        ...
    ]

    This format is typically used for the stream load method to represent multiple rows of data in a batch of imported data.

    Important

    To use this format, you must set the strip_outer_array parameter to true. During parsing, Doris expands the array, and then parses each object in turn as a row of data.

  • A single row of data represented by an object

    The JSON format in which an object is used as the root node. An entire object represents a row of data to be imported. Example:

    { "id": 123, "city" : "beijing"}
    { "id": 123, "city" : { "name" : "beijing", "region" : "haidian" }}

    This format is typically used for the routine load method. For example, this format can represent a message in Kafka, which is a row of data.

  • Multiple rows of object data separated by a fixed delimiter

    A row of data represented by an object represents a row of data to be imported. Example:

    { "id": 123, "city" : "beijing"}
    { "id": 456, "city" : "shanghai"}
    ...

    This format is typically used for the stream load method to represent multiple rows of data in a batch of imported data.

    Important

    To use this format, you must set the read_json_by_line parameter to true. You must also configure the line_delimiter parameter to specify a delimiter. Default value: \n. When Doris parses objects, the objects are separated by the specified delimiter. Each row of objects is parsed as a row of data.

JSON parameters

  • streaming_load_json_max_mb

    • Data in some formats, such as JSON, cannot be split. Doris must read and import all data in these formats into the memory before the data can be parsed. Therefore, this parameter is used to specify the maximum amount of data in these formats that can be imported in a single stream load.

    • Default value: 100. Unit: MB. For more information about how to modify this parameter, see Configuration items of backend nodes.

  • fuzzy_parse

    • In a stream load, you can add the fuzzy_parse parameter to improve the import efficiency of JSON data.Stream Load

    • This parameter is usually used to import multiple rows of data represented by an array. Therefore, if you configure this parameter, you must set the strip_outer_array parameter to true.

    • If you use the fuzzy_parse parameter, make sure that each row of data in the array has the same order of fields. Doris parses the data based only on the field order of the first row, and then accesses the subsequent data in the form of subscripts. This improves the import efficiency by three to five times.

JSON path

Doris allows you to specify JSON paths to extract the specified JSON data.

Note

To parse data in an array, Doris expands the array first, and then processes each row of data in the object format. Therefore, JSON data in a single object format is used in the following examples.

  • Do not specify JSON paths

    If you do not specify JSON paths, Doris searches for the elements in the object based on the column names in the table by default. Example:

    The table contains two columns named id and city. The following sample JSON data is available:

    { "id": 123, "city" : "beijing"}

    Doris uses the id and city names to match elements in the object. Then, 123 and beijing are obtained.

    The following sample JSON data is available:

    { "id": 123, "name" : "beijing"}

    Doris uses the id and city names to match elements in the object. Then, 123 and null are obtained.

  • Specify JSON paths

    Specify a set of JSON paths in the format of JSON data. Each element in the array represents a column to be extracted. Example:

    ["$.id", "$.name"]
    ["$.id.sub_id", "$.name[0]", "$.city[0]"]

    Doris uses the specified JSON path to match and extract data.

  • Match data of non-primitive data types

    The data that is matched in the preceding examples are of primitive data types, such as Integer and String. Doris does not support composite data types, such as Array and Map. Therefore, if data of a non-primitive data type is matched, Doris converts the type to a JSON-formatted string and imports the data as a string. Example:

    The following sample JSON data is available:

    { "id": 123, "city" : { "name" : "beijing", "region" : "haidian" }}

    Specify the JSON path as ["$.city"]. Then, the following elements are matched:

    { "name" : "beijing", "region" : "haidian" }

    The elements are converted to a string for subsequent import operations:

    "{'name':'beijing','region':'haidian'}"
  • Match failed

    If the match fails, null is returned. Example:

    The following sample JSON data is available:

    { "id": 123, "name" : "beijing"}

    Specify JSON path as ["$.id", "$.info"]. Then, the matched elements are 123 and null.

    Doris does not distinguish between null values in JSON data and null values that are returned when a match fails. The following sample JSON data is available:

    { "id": 123, "name" : null }

    Then, the same results, 123 and null, are returned if you use the following two JSON paths to match elements:

    • ["$.id", "$.name"]
    • ["$.id", "$.info"]
  • Exact match failed

    To prevent accidental operations that are caused by invalid parameter settings, Doris considers a row of data as an error row if all columns fail to be matched when Doris tries to match the row of data. The following sample JSON data is available:

    { "id": 123, "city" : "beijing" }

    If the JSON path is specified to the following invalid value, or if the JSON path is not specified and the id and city columns do not exist in the table:

    ["$.ad", "$.infa"]

    Then, the exact match fails, and Doris marks this row as an error row instead of yielding null, null.

JSON path and columns

JSON paths are used to specify how to extract JSON data, and columns are used to specify the mapping and conversion relationship of columns. JSON paths can be used together with Columns.

This method is equivalent to using a JSON path to rearrange the columns of JSON data based on the column order that is specified in the JSON path. Then, you can use columns to map the rearranged source data to the columns of the table. Example:

  • Data content:

    {"k1" : 1, "k2": 2}
  • Table schema:

    k2 int, k1 int
  • Import statement 1 in which the stream load method is used:

    curl -v --location-trusted -u root: -H "format: json" -H "jsonpaths: [\"$.k2\", \"$.k1\"]" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load

    In import statement 1, only the JSON path is specified, and the columns field is not specified. The JSON path is used to extract JSON data based on the order of fields in the JSON path. Then, data is written based on the order of the table schema. The following data import results are returned:

    +------+------+
    | k1   | k2   |
    +------+------+
    |    2 |    1 |
    +------+------+

    You can see that the value of the k2 column in the JSON data is imported into the k1 column. This is because the name of a field in the JSON data is not the same as the name of the corresponding field in the table schema. Therefore, you must explicitly specify the mapping between the fields in the JSON data and table schema.

  • Import statement 2:

    curl -v --location-trusted -u root: -H "format: json" -H "jsonpaths: [\"$.k2\", \"$.k1\"]" -H "columns: k2, k1" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load

    Compared with import statement 1, the columns field is added to describe the mapping between the k1 and k2 columns in the order of k2, k1. After the fields in the JSON path are extracted in sequence, the value of the first column is specified to the value of the k2 column in the table, and the value of the second column is specified to the value of the k1 column in the table. The following data import results are returned:

    +------+------+
    | k1   | k2   |
    +------+------+
    |    1 |    2 |
    +------+------+

    Same as other import methods, you can perform column transformations in the columns field. Example:

    curl -v --location-trusted -u root: -H "format: json" -H "jsonpaths: [\"$.k2\", \"$.k1\"]" -H "columns: k2, tmp_k1, k1 = tmp_k1 * 100" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load

    In the preceding example, the value of the k1 column is multiplied by 100 and then imported. The following data import results are returned:

    +------+------+
    | k1   | k2   |
    +------+------+
    |  100 |    2 |
    +------+------+

JSON root

Doris allows you to specify the JSON root to extract the specified JSON data.

Note

To parse data in an array, Doris expands the array first, and then processes each row of data in the object format. Therefore, JSON data in a single object format is used in the following examples.

  • Do not specify the JSON root

    If you do not specify the JSON root, Doris searches for the elements in the object based on the column names in the table by default. Example:

    The table contains two columns named id and city. The following sample JSON data is available:

    { "id": 123, "name" : { "id" : "321", "city" : "shanghai" }}

    Doris uses the id and city names to match elements in the object. Then, 123 and null are obtained.

  • Specify the JSON root

    Specify the root node of the JSON data by using the json_root parameter. Doris extracts the elements of the root node for parsing by using the JSON root. By default, this parameter is empty.

    Specify the JSON root as -H "json_root: $.name". Then, the following elements are matched:

    { "id" : "321", "city" : "shanghai" }

    The elements are treated as new JSON data to be imported. 321 and shanghai are obtained.

NULL and default values

  • The following sample JSON data is available:

    [
        {"k1": 1, "k2": "a"},
        {"k1": 2},
        {"k1": 3, "k2": "c"},
    ]

    The table schema is k1 int null, k2 varchar(32) null default "x".

  • Import statement:

    curl -v --location-trusted -u root: -H "format: json" -H "strip_outer_array: true" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load

    The following import results may be expected, in which the default value is entered into the missing column:

    +------+------+
    | k1   | k2   |
    +------+------+
    |    1 |    a |
    +------+------+
    |    2 |    x |
    +------+------+
    |    3 |    c |
    +------+------+

    However, the following actual import results are returned, in which NULL is entered into the missing column:

    +------+------+
    | k1   | k2   |
    +------+------+
    |    1 |    a |
    +------+------+
    |    2 | NULL |
    +------+------+
    |    3 |    c |
    +------+------+

    This is because Doris does not obtain the information that the missing column is the k2 column in the table from the information in the import statement. To obtain the expected import results, execute the following import statement:

    curl -v --location-trusted -u root: -H "format: json" -H "strip_outer_array: true" -H "jsonpaths: [\"$.k1\", \"$.k2\"]" -H "columns: k1, tmp_k2, k2 = ifnull(tmp_k2, 'x')" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load