This topic describes the data types and parameters supported by Elasticsearch Writer and how to configure it by using the code editor.

Elasticsearch is an open-source product that complies with the Apache open standards. It is the mainstream search engine for enterprise data. Elasticsearch is a Lucene-based data search and analysis tool that provides distributed services. The mappings between Elasticsearch core concepts and database core concepts are as follows:
Relational database (instance) -> database -> table -> row -> column
Elasticsearch -> index -> type -> document -> field

Elasticsearch can contain multiple indexes (databases). Each index can contain multiple types (tables). Each type can contain multiple documents (rows). Each document can contain multiple fields (columns). Elasticsearch Writer uses the RESTful API of Elasticsearch to write multiple data records retrieved by a reader to Elasticsearch at a time.

Parameters

Parameter Description Required Default value
endpoint The endpoint for accessing Elasticsearch, in the format of http://xxxx.com:9999. No None
accessId The AccessKey ID for accessing Elasticsearch, which is used for authorization when a connection with Elasticsearch is established.
Note The accessId and accessKey parameters are required. If you do not set the parameters, an error is returned. If you use on-premises Elasticsearch for which basic authentication is not configured, the AccessKey ID and AccessKey secret are not required. In this case, you can set the accessId and accessKey parameters to random values.
No None
accessKey The AccessKey secret for accessing Elasticsearch. No None
index The index name in Elasticsearch. No None
indexType The type name in the index of Elasticsearch. No Elasticsearch
cleanup Specifies whether to clear existing data in the index. The method used to clear the data is to delete and rebuild the corresponding index. The default value false indicates that the existing data in the index is retained. No false
batchSize The number of data records to write at a time. No 1000
trySize The number of retries after a failure. No 30
timeout The connection timeout of the client. No 600000
discovery Specifies whether to enable Node Discovery. When Node Discovery is enabled, the server list in the client is polled and regularly updated. No false
compression Specifies whether to enable compression for an HTTP request. No true
multiThread Specifies whether to use multiple threads for an HTTP request. No true
ignoreWriteError Specifies whether to ignore write errors and proceed with writing without retries. No false
ignoreParseError Specifies whether to ignore format parsing errors and proceed with writing. No true
alias The alias of the index. The alias feature of Elasticsearch is similar to the view feature of a traditional database. For example, if you create an alias named my_index_alias for the index my_index, the operations on my_index_alias also take effect on my_index.

Configuring alias means that after the data import is completed, an alias is created for the specified index.

No None
aliasMode The mode in which an alias is added after the data is imported. Valid values: append and exclusive.
  • append: adds an alias for the current index. One alias maps multiple indexes.
  • exclusive: deletes the existing alias of the current index and adds a new alias. One alias maps one index.

Elasticsearch Writer can convert aliases to actual index names. Using aliases helps you migrate data from one index to another, search data across multiple indexes in a unified manner, and create a view on a subset of data in an index.

No append
splitter The delimiter (-,-) for splitting the source data if you are inserting an array to Elasticsearch.

For example, the source column stores data a-,-b-,-c-,-d of the String type. Elasticsearch Writer uses the delimiter (-,-) to split the source data and obtains the array ["a", "b", "c", "d"]. Then, Elasticsearch Writer writes the array to the corresponding field in Elasticsearch.

No -,-
settings The settings of an index. The settings must be in accordance with Elasticsearch official specifications. No None
column The fields of the document. The parameters for each field include basic parameters such as name and type and advanced parameters such as analyzer, format, and array.
The field types supported by Elasticsearch are as follows:
- id  // The id type corresponds to the _id type in Elasticsearch, and can be considered as the unique primary key. Data with the same ID will be overwritten and not indexed.
- string
- text
- keyword
- long
- integer
- short
- byte
- double
- float
- date
- boolean
- binary
- integer_range
- float_range
- long_range
- double_range
- date_range
- geo_point
- geo_shape
- ip
- token_count
- array
- object
- nested
  • When the field type is Text, you can specify the analyzer, norms, and index_options parameters. Example:
    {
        "name": "col_text",
        "type": "text",
        "analyzer": "ik_max_word"
        }
  • When the field type is date, you can specify the format and timezone parameters, indicating the date serialization format and the time zone, respectively. Alternatively, you can specify the origin parameter instead of timezone. Example:
    {
        "name": "col_date",
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss",
        "origin": true
        }
    Note You must specify either the Timezone or origin parameter.
    • If you specify the origin parameter, Elasticsearch Writer updates the mappings between aliases and indexes and writes data to Elasticsearch in the original format. We recommend that you specify the origin parameter.
    • If you want Data Integration to convert the time zone for you, delete the origin parameter and specify the Timezone parameter.
  • When the field type is ge_shape, you can specify the tree (geohash or quadtree) and precision parameters. Example:
    {
        "name": "col_geo_shape",
        "type": "geo_shape",
        "tree": "quadtree",
        "precision": "10m"
        }
If you specify the array parameter for a field and set the array parameter to true, the field is an array column. Elasticsearch Writer uses the delimiter specified by the splitter to split the source data, converts the data to an array of strings, and writes the array to the destination. Only one delimiter is supported for one node. Example:
{
    "name": "col_integer_array",
    "type": "integer",
    "array": true
    }
Yes None
dynamic Specifies whether to use the mapping configuration of Elasticsearch. A value of true indicates that the mapping configuration of Elasticsearch, instead of the mapping configuration of Data Integration, is used. No false
actionType The type of the action for writing data to Elasticsearch. Currently, Data Integration supports only the following action types: index and update. Default value: index.
  • index: Data Integration uses Index.Builder of the Elasticsearch SDK to construct a request for writing multiple data records at a time. In index mode, Elasticsearch first checks whether an ID is specified for the document to be inserted.
    • If the ID is not specified, Elasticsearch generates a unique ID by default. In this case, the document is directly inserted to Elasticsearch.
    • If the ID is specified, Elasticsearch replaces the existing document with the document to be inserted.
      Note In this case, you cannot modify specific fields in the document.
  • update: Data Integration uses Update.Builder of the Elasticsearch SDK to construct a request for writing multiple data records at a time. In update mode, Elasticsearch calls the get method of InternalEngine to obtain the information of the original document for each update. In this way, you can modify specific fields. In update mode, you must obtain the information of the original document for each update, which greatly affects the performance. However, you can modify specific fields in this mode. If the original document does not exist, the new document is directly inserted.
No index

Configure Elasticsearch Writer by using the code editor

In the following code, a node is configured to write data to Elasticsearch. For more information about the parameters, see the preceding parameter description.
{
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    },
    "setting": {
        "errorLimit": {
            "record": "0"
        },
        "speed": {
            "concurrent": 1,
            "throttle": false
        }
    },
    "steps": [
        {
            "category": "reader",
            "name": "Reader",
            "parameter": {

            },
            "stepType": "stream"
        },
        {
            "category": "writer",
            "name": "Writer",
            "parameter": {
                "endpoint": "http://xxxx.com:9999",
                "accessId": "xxxx",
                "accessKey": "yyyy",
                "index": "test-1",
                "type": "default",
                "cleanup": true,
                "settings": {
                    "index": {
                        "number_of_shards": 1,
                        "number_of_replicas": 0
                    }
                },
                "discovery": false,
                "batchSize": 1000,
                "splitter": ",",
                "column": [
                    {
                        "name": "pk",
                        "type": "id"
                    },
                    {
                        "name": "col_ip",
                        "type": "ip"
                    },
                    {
                        "name": "col_double",
                        "type": "double"
                    },
                    {
                        "name": "col_long",
                        "type": "long"
                    },
                    {
                        "name": "col_integer",
                        "type": "integer"
                    },
                    {
                        "name": "col_keyword",
                        "type": "keyword"
                    },
                    {
                        "name": "col_text",
                        "type": "text",
                        "analyzer": "ik_max_word"
                    },
                    {
                        "name": "col_geo_point",
                        "type": "geo_point"
                    },
                    {
                        "name": "col_date",
                        "type": "date",
                        "format": "yyyy-MM-dd HH:mm:ss"
                    },
                    {
                        "name": "col_nested1",
                        "type": "nested"
                    },
                    {
                        "name": "col_nested2",
                        "type": "nested"
                    },
                    {
                        "name": "col_object1",
                        "type": "object"
                    },
                    {
                        "name": "col_object2",
                        "type": "object"
                    },
                    {
                        "name": "col_integer_array",
                        "type": "integer",
                        "array": true
                    },
                    {
                        "name": "col_geo_shape",
                        "type": "geo_shape",
                        "tree": "quadtree",
                        "precision": "10m"
                    }
                ]
            },
            "stepType": "elasticsearch"
        }
    ],
    "type": "job",
    "version": "2.0"
}
Note Currently, Elasticsearch that is deployed in a Virtual Private Cloud (VPC) supports only custom resource groups. A sync node that is run on the default resource group may fail to connect to Elasticsearch. To write data to the Elasticsearch database deployed in a VPC, use Exclusive resource groups for Data Integration instances or custom resource groups.