This topic describes the parameters that are supported by Elasticsearch Writer and how to configure Elasticsearch Writer by using the code editor.

Limits

You can add Elasticsearch V5.X, V6.X, and V7.X data sources to DataWorks. Self-managed Elasticsearch data sources are not supported.

Background information

Elasticsearch Writer can write data to Elasticsearch V5.X clusters by using the shared resource group for Data Integration and to Elasticsearch V5.X, V6.X, and V7.X clusters by using exclusive resource groups for Data Integration. For more information about exclusive resource groups for Data Integration, see Create and use an exclusive resource group for Data Integration.

Elasticsearch is an open source product that is released under the Apache License. It is a popular search engine for enterprises. Elasticsearch is a distributed search and analytics engine built on top of Apache Lucene. The following description provides the mappings between the core concepts of Elasticsearch and those of a relational database:
Relational database instance  -> Database  -> Table -> Row        -> Column
Elasticsearch                 -> Index     -> Type  -> Document   -> Field

Elasticsearch can contain multiple indexes (databases). Each index can contain multiple types (tables). Each type can contain multiple documents (rows). Each document can contain multiple fields (columns). Elasticsearch Writer obtains data records from a reader and uses the RESTful API of Elasticsearch to write the data records to Elasticsearch in batches.

Parameters

Parameter Description Required Default value
endpoint The endpoint of Elasticsearch. Specify the endpoint in the http://example.com:9999 format. No No default value
accessId The AccessKey ID that is used to connect to the destination Elasticsearch cluster. The AccessKey ID is used for authentication before a connection to the Elasticsearch cluster can be established.
Note The accessId and accessKey parameters are required. If you do not specify the parameters, an error is returned. If you use a self-managed Elasticsearch cluster for which basic access authentication is not configured, the AccessKey ID and AccessKey secret are not required. In this case, you can set the accessId and accessKey parameters to random values.
No No default value
accessKey The AccessKey secret that is used to connect to the destination Elasticsearch cluster. No No default value
index The name of the index in the destination Elasticsearch cluster. No No default value
indexType The name of the index type in the destination Elasticsearch cluster. No Elasticsearch
cleanup Specifies whether to delete the existing data from the index. To delete the existing data, you must delete and recreate the index. The default value of this parameter is false, which indicates that the existing data in the index is retained. No false
batchSize The number of data records to write at a time. No 1,000
trySize The maximum number of retries allowed after a failure. No 30
timeout The timeout period of the connection to the client. No 600,000
discovery Specifies whether to enable node discovery. If node discovery is enabled, the server list in the client is polled and regularly updated. No false
compression Specifies whether to enable compression for an HTTP request. No true
multiThread Specifies whether to use multiple threads for an HTTP request. No true
ignoreWriteError Specifies whether to ignore write errors and proceed with data write without retries. No false
ignoreParseError Specifies whether to ignore format parsing errors and proceed with data write. No true
alias The alias feature of Elasticsearch is similar to the view feature of a database. For example, if you create an alias named my_index_alias for the index my_index, the operations on my_index_alias also take effect on my_index.

If you configure the alias parameter, the alias that you specify in this parameter is created for the index after data is written to the index.

No No default value
aliasMode The mode in which an alias is added after data is written to the index. Valid values: append and exclusive.
  • If you set the aliasMode parameter to append, an alias is added for the index. One alias maps multiple indexes.
  • If you set the aliasMode parameter to exclusive, the existing alias of the index is deleted and a new alias is added for the index. One alias maps one index.

Elasticsearch Writer can convert aliases to actual index names. You can use aliases to migrate data from one index to another index, search for data across multiple indexes in a unified manner, and create a view on a subset of data in an index.

No append
splitter The delimiter (-,-) based on which Elasticsearch Writer splits the source data if the source data is an array.

For example, the source column stores the ["a", "b", "c", "d"] string. In this case, Elasticsearch Writer splits the data based on the delimiter (-,-), obtains the array a-,-b-,-c-,-d, and then writes the array to the related field in the destination Elasticsearch cluster.

No -,-
settings The settings of the index. The settings must follow official Elasticsearch specifications. No No default value
column The fields of the document. The parameters for each field include basic parameters such as name and type, and advanced parameters such as analyzer, format, and array.
Elasticsearch supports the following field types:
- id  // The id type corresponds to the _id type in Elasticsearch and can be considered as the unique primary key. Data that has the same ID will be overwritten and not be indexed. 
- string
- text
- keyword
- long
- integer
- short
- byte
- double
- float
- date
- boolean
- binary
- integer_range
- float_range
- long_range
- double_range
- date_range
- geo_point
- geo_shape
- ip
- token_count
- array
- object
- nested
The following information describes the field types:
  • If the field type is text, you can configure the analyzer, norms, and index_options parameters. Example:
    {
        "name": "col_text",
        "type": "text",
        "analyzer": "ik_max_word"
        }
  • If the field type is date, you can configure the format and timezone parameters to indicate the date serialization format and the time zone. You can also configure the origin parameter instead of the timezone parameter.
    • If you configure the origin parameter, Elasticsearch Writer updates the mappings of the index and writes data to the index in the original format. We recommend that you configure the origin parameter.
    • If you want to use Data Integration to convert the time zone, delete the origin parameter and configure the timezone parameter.
    Example:
    {
        "name": "col_date",
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss",
        "origin": true
        }
  • If the field type is geo_shape, you can configure the tree (geohash or quadtree) and precision parameters. Example:
    {
        "name": "col_geo_shape",
        "type": "geo_shape",
        "tree": "quadtree",
        "precision": "10m"
        }
If you set the array parameter to true for a field, the field is an array column. In this case, Elasticsearch Writer splits the source data based on the delimiter that is specified by the splitter parameter, converts the data to an array of strings, and writes the array to the index. Only one type of delimiter is supported for one node. Example:
{
    "name": "col_integer_array",
    "type": "integer",
    "array": true
    }
Yes No default value
dynamic If you set this parameter to true, Elasticsearch Writer uses the mapping configuration of the destination Elasticsearch cluster instead of the mapping configuration of Data Integration.

In Elasticsearch V7.X, the default value of the type parameter is _doc. If you use the mapping configuration of the destination Elasticsearch cluster, set the type parameter to _doc and the esVersion parameter to 7.

You must add the following parameter configuration that specifies the version information to the code: "esVersion": "7".

No false
actionType The type of action for writing data to the destination Elasticsearch cluster. Data Integration supports only the following action types: index and update. Default value: index.
  • index: Elasticsearch Writer uses Index.Builder of an Elasticsearch SDK to construct a request for writing multiple data records at a time. In index mode, Elasticsearch Writer first checks whether an ID is specified for the document that you want to insert.
    • If no ID is specified, Elasticsearch Writer generates a unique ID. In this case, the document is directly inserted into the destination Elasticsearch cluster.
    • If an ID is specified, the existing document is replaced with the document that you want to insert. You cannot modify specific fields in the document.
      Note The replace operation in this case is different from that in Elasticsearch where specific fields can be modified.
  • update: Elasticsearch Writer uses Update.Builder of an Elasticsearch SDK to construct a request for writing multiple data records at a time. In update mode, Elasticsearch Writer calls the get method of InternalEngine to obtain the information about the original document for each update. This way, you can modify specific fields. In update mode, you must obtain the information about the original document for each update, which greatly affects the performance. However, you can modify specific fields in this mode. If the original document does not exist, the new document is directly inserted.
No index

Configure Elasticsearch Writer by using the code editor

For more information about how to configure a synchronization node by using the code editor, see Create a sync node by using the code editor.

In the following code, a synchronization node is configured to write data to Elasticsearch. For more information about the parameters, see the preceding parameter description.
{
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    },
    "setting": {
        "errorLimit": {
            "record": "0"
        },
        "speed": {
            "throttle":true,// Specifies whether to enable bandwidth throttling. The value false indicates that bandwidth throttling is disabled, and the value true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":1, // The maximum number of parallel threads. 
            "mbps":"12"// The maximum transmission rate.
        }
    },
    "steps": [
        {
            "category": "reader",
            "name": "Reader",
            "parameter": {

            },
            "stepType": "stream"
        },
        {
            "category": "writer",
            "name": "Writer",
            "parameter": {
                "endpoint": "http://example.com:9999",
                "accessId": "xxxx",
                "accessKey": "yyyy",
                "index": "test-1",
                "type": "default",
                "cleanup": true,
                "settings": {
                    "index": {
                        "number_of_shards": 1,
                        "number_of_replicas": 0
                    }
                },
                "discovery": false,
                "batchSize": 1000,
                "splitter": ",",
                "column": [
                    {
                        "name": "pk",
                        "type": "id"
                    },
                    {
                        "name": "col_ip",
                        "type": "ip"
                    },
                    {
                        "name": "col_double",
                        "type": "double"
                    },
                    {
                        "name": "col_long",
                        "type": "long"
                    },
                    {
                        "name": "col_integer",
                        "type": "integer"
                    },
                    {
                        "name": "col_keyword",
                        "type": "keyword"
                    },
                    {
                        "name": "col_text",
                        "type": "text",
                        "analyzer": "ik_max_word"
                    },
                    {
                        "name": "col_geo_point",
                        "type": "geo_point"
                    },
                    {
                        "name": "col_date",
                        "type": "date",
                        "format": "yyyy-MM-dd HH:mm:ss"
                    },
                    {
                        "name": "col_nested1",
                        "type": "nested"
                    },
                    {
                        "name": "col_nested2",
                        "type": "nested"
                    },
                    {
                        "name": "col_object1",
                        "type": "object"
                    },
                    {
                        "name": "col_object2",
                        "type": "object"
                    },
                    {
                        "name": "col_integer_array",
                        "type": "integer",
                        "array": true
                    },
                    {
                        "name": "col_geo_shape",
                        "type": "geo_shape",
                        "tree": "quadtree",
                        "precision": "10m"
                    }
                ]
            },
            "stepType": "elasticsearch"
        }
    ],
    "type": "job",
    "version": "2.0"
}
Note A connection failure may occur if you use the shared resource group for Data Integration to connect to an Elasticsearch cluster that is deployed in a virtual private cloud (VPC). To write data to an Elasticsearch cluster that is deployed in a VPC, use exclusive or custom resource groups for Data Integration. For more information about how to create an exclusive or custom resource group for Data Integration, see Exclusive resource groups for Data Integration or Create a custom resource group for Data Integration.