This topic describes the working principles, features, and parameters of Elasticsearch Reader.

How it works

  • Elasticsearch Reader reads data from Elasticsearch by slicing scroll queries. The slices are processed by multiple threads of a sync node.
  • Data types are converted based on the mapping configuration of Elasticsearch.

For more information, see documentation in the Elasticsearch official website.

Basic settings

{
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    },
    "setting":{
        "errorLimit":{
            "record":"0" // The maximum number of dirty data records allowed.
         },
        "jvmOption":"",
        "speed":{
            "concurrent":3,
            "throttle":false
        }
    },
    "steps":[
        {
            "category":"reader",
            "name":"Reader",
            "parameter":{
                "column":[ // The columns to be synchronized.
                     "id",
                    "name"
                ],
                "endpoint":"", // The endpoint.
                 "index":"",  // The index name.
                 "password":"",  // The password.
                 "scroll":"",  // The scroll ID.
                 "search":"",  // The search criteria. The value is the same as the Elasticsearch query that uses the _search API.
                 "type":"default",
                "username":""  // The username.
             },
            "stepType":"elasticsearch"
        },
        {
            "category":"writer",
            "name":"Writer",
            "parameter":{ },
            "stepType":"stream"
        }
    ],
    "type":"job",
    "version":"2.0" // The version number.
}

Advanced features

  • Supports storing all data of an Elasticsearch document in one column.

    You can create a column to store all data of an Elasticsearch document.

  • Supports converting semi-structured data to structured data.
    Item Description
    Background Data in Elasticsearch is deeply nested. Elasticsearch may contain fields of various types and lengths and may use Chinese names. To facilitate data computing and storage in downstream businesses, Elasticsearch Reader supports converting semi-structured data to structured data.
    How it works Elasticsearch Reader flattens nested JSON data obtained from Elasticsearch to single-dimensional data based on the paths of properties in the JSON data. Then, Elasticsearch Reader maps the single-dimensional data to structured tables. In this way, Elasticsearch data in a complex structure is converted to multiple structured tables.
    Solution
    • Elasticsearch Reader converts nested JSON data to single-dimensional data by using the following path formats:
      • Property
      • Property. Child property
      • Property[0]. Child property
    • If a property has multiple child properties, Elasticsearch Reader traverses all data of the property and splits the data to multiple tables or multiple rows in the following format:

      Property[*]. Child property

    • Elasticsearch Reader merges data in a string array to one property in the following format and removes duplicates:

      Property[] where duplicates are removed

    • Elasticsearch Reader merges multiple properties to one property in the following format:

      Property 1,Property 2

    • Elasticsearch Reader presents optional properties in the following format:

      Property 1|Property 2

Parameters

Parameter Description Required Default value
endpoint The endpoint of Elasticsearch. Yes None
username The username for HTTP authentication. No NULL
password The password for HTTP authentication. No NULL
index The index name in Elasticsearch. Yes None
type The type name in the index of Elasticsearch. No Index name
pageSize The number of data records to read at a time. No 100
search The query parameter of Elasticsearch. Yes None
scroll The scroll parameter of Elasticsearch, which sets the timestamp of the snapshot taken for a scroll. Yes None
sort The field based on which the returned results are sorted. No None
retryCount The number of retries after a failure. No 300
connTimeOut The connection timeout period of the client. No 600,000
readTimeOut The data reading timeout period of the client. No 600,000
multiThread Specifies whether to use multiple threads for an HTTP request. No true
column The field types supported by Elasticsearch. Yes None
full Specifies whether to create a column to record all data of an Elasticsearch document. No false
multi Specifies whether to split an array to multiple rows. If you enable this feature, you need to specify child properties. No false