This topic describes the parameters that are supported by Elasticsearch Writer and how to configure Elasticsearch Writer by using the codeless user interface (UI) and code editor.
Supported Elasticsearch versions
DataWorks allows you to add only Alibaba Cloud Elasticsearch V5.X, V6.X, and V7.X clusters as data sources. Self-managed Elasticsearch clusters are not supported.
Supported data types
Data type | Elasticsearch Reader for batch data read | Elasticsearch Writer for batch data write | Elasticsearch Writer for real-time data write |
---|---|---|---|
binary | Supported | Supported | Supported |
boolean | Supported | Supported | Supported |
keyword | Supported | Supported | Supported |
constant_keyword | Not supported | Not supported | Not supported |
wildcard | Not supported | Not supported | Not supported |
long | Supported | Supported | Supported |
integer | Supported | Supported | Supported |
short | Supported | Supported | Supported |
byte | Supported | Supported | Supported |
double | Supported | Supported | Supported |
float | Supported | Supported | Supported |
half_float | Not supported | Not supported | Not supported |
scaled_float | Not supported | Not supported | Not supported |
unsigned_long | Not supported | Not supported | Not supported |
date | Supported | Supported | Supported |
date_nanos | Not supported | Not supported | Not supported |
alias | Not supported | Not supported | Not supported |
object | Supported | Supported | Supported |
flattened | Not supported | Not supported | Not supported |
nested | Supported | Supported | Supported |
join | Not supported | Not supported | Not supported |
integer_range | Supported | Supported | Supported |
float_range | Supported | Supported | Supported |
long_range | Supported | Supported | Supported |
double_range | Supported | Supported | Supported |
date_range | Supported | Supported | Supported |
ip_range | Not supported | Supported | Supported |
ip | Supported | Supported | Supported |
version | Supported | Supported | Supported |
murmur3 | Not supported | Not supported | Not supported |
aggregate_metric_double | Not supported | Not supported | Not supported |
histogram | Not supported | Not supported | Not supported |
text | Supported | Supported | Supported |
annotated-text | Not supported | Not supported | Not supported |
completion | Supported | Not supported | Not supported |
search_as_you_type | Not supported | Not supported | Not supported |
token_count | Supported | Not supported | Not supported |
dense_vector | Not supported | Not supported | Not supported |
rank_feature | Not supported | Not supported | Not supported |
rank_features | Not supported | Not supported | Not supported |
geo_point | Supported | Supported | Supported |
geo_shape | Supported | Supported | Supported |
point | Not supported | Not supported | Not supported |
shape | Not supported | Not supported | Not supported |
percolator | Not supported | Not supported | Not supported |
string | Supported | Supported | Supported |
Background information
Elasticsearch Writer can write data to Elasticsearch V5.X data sources by using the shared resource group for Data Integration and to Elasticsearch V5.X, V6.X, and V7.X data sources by using exclusive resource groups for Data Integration. For information about exclusive resource groups for Data Integration, see Create and use an exclusive resource group for Data Integration.
Relational database instance -> Database -> Table -> Row -> Column
Elasticsearch -> Index -> Types -> Documents -> Fields
An Elasticsearch cluster can contain multiple indexes (databases). Each index can contain multiple types (tables). Each type can contain multiple documents (rows). Each document can contain multiple fields (columns). Elasticsearch Writer obtains data records from a reader and uses the RESTful API of Elasticsearch to write the data records to Elasticsearch at a time.
Parameters
Parameter | Description | Required | Default value |
---|---|---|---|
datasource | The name of the data source. If no data sources are available, add an Elasticsearch cluster to DataWorks as a data source. For more information, see Add an Elasticsearch data source. | Yes | No default value |
index | The name of the index in the destination Elasticsearch cluster. | Yes | No default value |
indexType | The name of the index type in the destination Elasticsearch cluster. | No | Elasticsearch |
cleanup | Specifies whether to delete the existing data from the index. Valid values:
| No | false |
batchSize | The number of data records to write at a time. | No | 1,000 |
trySize | The maximum number of retries that can be performed after a failure occurs. | No | 30 |
timeout | The connection timeout of the client. | No | 600,000 |
discovery | Specifies whether to enable node discovery.
| No | false |
compression | Specifies whether to enable compression for an HTTP request. | No | true |
multiThread | Specifies whether to use multiple threads for an HTTP request. | No | true |
ignoreWriteError | Specifies whether to ignore write errors and proceed with data write operations without retries. | No | false |
ignoreParseError | Specifies whether to ignore format parsing errors and proceed with data write operations. | No | true |
alias | The alias feature of Elasticsearch is similar to the view feature of a database. For example, if you create an alias named my_index_alias for the index my_index, the operations that are performed on my_index_alias also take effect on my_index. If you configure the alias parameter, the alias that you specify is created for the index after data is written to the index. | No | No default value |
aliasMode | The mode in which an alias is added after data is written to the index. Valid values: append and exclusive.
Elasticsearch Writer can convert aliases to actual index names. You can use aliases to migrate data from one index to another index, search for data across multiple indexes in a unified manner, and create a view on a subset of data in an index. | No | append |
settings | The settings of the index. The settings must follow the specifications of open source Elasticsearch. | No | No default value |
column | The fields of the document. The parameters for each field include basic parameters such as name and type, and advanced parameters such as analyzer, format, and array. Elasticsearch supports the following types of fields:
The following information describes the field types:
If you want to define attributes related to the Elasticsearch cluster in addition to the field type when you configure the column parameter, you can configure the other_params parameter in column to define the attributes. The other_params parameter is used when Elasticsearch Writer updates the mapping configurations of the Elasticsearch cluster.
If you want to write source data to Elasticsearch as arrays, you can enable Elasticsearch Writer to parse the source data in the JSON format or based on a specified delimiter. For more information, see Appendix: Write data to Elasticsearch as arrays. | Yes | No default value |
dynamic | Specifies whether to use the dynamic mapping mechanism of Elasticsearch to establish mappings for fields that are written to the index.
In Elasticsearch V7.X, the default value of the type parameter is _doc. If you set this parameter to true to use the dynamic mapping mechanism, set the type parameter to _doc and the esVersion parameter to 7. You must add the following parameter configuration that specifies the version information to the code: | No | false |
actionType | The type of the action that can be performed when data is written to the destination Elasticsearch cluster. Data Integration supports only the following types of actions: index and update. Default value: index.
| No | index |
primaryKeyInfo | Specifies the value assignment method of the _id column that is used as the primary key to write data to Elasticsearch.
| Yes | specific |
esPartitionColumn | Specifies whether to enable partitioning for the index when data is written to Elasticsearch. This parameter is used to change the routing setting of the Elasticsearch cluster.
| No | false |
enableWriteNull | Specifies whether source fields whose values are NULL can be synchronized to Elasticsearch. Valid values:
| No | false |
Configure Elasticsearch Writer by using the code editor
For information about how to configure a data synchronization node by using the code editor, see Configure a batch synchronization node by using the code editor.
{
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": "0"
},
"speed": {
"throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"concurrent":1, // The maximum number of parallel threads.
"mbps":"12" // The maximum transmission rate.
}
},
"steps": [
{
"category": "reader",
"name": "Reader",
"parameter": {
},
"stepType": "stream"
},
{
"category": "writer",
"name": "Writer",
"parameter": {
"datasource": "xxx",
"index": "test-1",
"type": "default",
"cleanup": true,
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"discovery": false,
"primaryKeyInfo": {
"type": "pk",
"fieldDelimiter": ",",
"column": []
},
"batchSize": 1000,
"dynamic": false,
"esPartitionColumn": [
{
"name": "col1",
"comment": "xx",
"type": "STRING"
}
],
"column": [
{
"name": "pk",
"type": "id"
},
{
"name": "col_ip",
"type": "ip"
},
{
"name": "col_array",
"type": "long",
"array": true,
},
{
"name": "col_double",
"type": "double"
},
{
"name": "col_long",
"type": "long"
},
{
"name": "col_integer",
"type": "integer"
{
"name": "col_keyword",
"type": "keyword"
},
{
"name": "col_text",
"type": "text",
"analyzer": "ik_max_word",
"other_params":
{
"doc_values": false
},
},
{
"name": "col_geo_point",
"type": "geo_point"
},
{
"name": "col_date",
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
{
"name": "col_nested1",
"type": "nested"
},
{
"name": "col_nested2",
"type": "nested"
},
{
"name": "col_object1",
"type": "object"
},
{
"name": "col_object2",
"type": "object"
},
{
"name": "col_integer_array",
"type": "integer",
"array": true
},
{
"name": "col_geo_shape",
"type": "geo_shape",
"tree": "quadtree",
"precision": "10m"
}
]
},
"stepType": "elasticsearch"
}
],
"type": "job",
"version": "2.0"
}
Configure Elasticsearch Writer by using the codeless UI
Open the created data synchronization node and configure the node. For information about how to configure a data synchronization node by using the codeless UI, see Configure a batch synchronization node by using the codeless UI. In this example, a data synchronization node that is used to synchronize data to Elasticsearch is configured by using the codeless UI.
- Configure data sources. Configure Source and Target for the data synchronization node.
Parameter Description Connection The name of the data source to which you want to write data. This parameter is equivalent to the datasource parameter that is described in the Parameters section. Index The name of the index to which you want to write data. This parameter is equivalent to the index parameter that is described in the Parameters section. Delete the original index Specifies whether to delete the existing data in the index. This parameter is equivalent to the cleanup parameter that is described in the Parameters section. Write Mode The write mode. Valid values: index and update. This parameter is equivalent to the ActionType parameter that is described in the Parameters section. ElasticSearch auto mapping Specifies whether to use the dynamic mapping mechanism of Elasticsearch to establish mappings. This parameter is equivalent to the dynamic parameter that is described in the Parameters section. Primary key value method Specifies the value assignment method of the _id column that is used as the primary key to write data to Elasticsearch. This parameter is equivalent to the primaryKeyInfo parameter that is described in the Parameters section. Write batch size The number of data records to write at a time. This parameter is equivalent to the batchSize parameter that is described in the Parameters section. Enable partition Specifies whether to enable partitioning for the index when data is written to Elasticsearch. This parameter is equivalent to the esPartitionColumn parameter that is described in the Parameters section. Enable node discovery Specifies whether to enable node discovery. This parameter is equivalent to the discovery parameter that is described in the Parameters section. Settings The settings of the index. This parameter is equivalent to the settings parameter that is described in the Parameters section. - Configure field mappings. This operation is equivalent to setting the column parameter that is described in the Parameters section. Fields in the source on the left have a one-to-one mapping with fields in the destination on the right.
- Configure channel control policies.
Parameter Description Expected Maximum Concurrency The maximum number of parallel threads that the data synchronization node uses to read data from the source or write data to the destination. You can configure the parallelism for the data synchronization node on the codeless UI. Bandwidth Throttling Specifies whether to enable throttling. You can enable throttling and specify a maximum transmission rate to prevent heavy read workloads on the source. We recommend that you enable throttling and set the maximum transmission rate to an appropriate value based on the configurations of the source. Dirty Data Records Allowed The maximum number of dirty data records allowed. Distributed Execution The distributed execution mode that can split the node into slices and distributes them to multiple Elastic Compute Service (ECS) instances for parallel running. This speeds up synchronization. If you use a large number of parallel threads to run your data synchronization node in distributed execution mode, excessive access requests are sent to the data sources. Therefore, before you use the distributed execution mode, you must evaluate the access loads on the data sources. You can enable the distributed execution mode only if you use an exclusive resource group for Data Integration to run your data synchronization node. For more information about exclusive resource groups for Data Integration, see Exclusive resource groups for Data Integration and Create and use an exclusive resource group for Data Integration.
Appendix: Write data to Elasticsearch as arrays
You can use one of the following methods to configure Elasticsearch Writer to write data to Elasticsearch as arrays.
- Enable Elasticsearch Writer to parse source data in the JSON format
For example, the value of a source field is
"[1,2,3,4,5]"
. If you want to enable Elasticsearch Writer to write the value to Elasticsearch as an array, you can set the json_array parameter to true in the code of the node. This way, Elasticsearch Writer writes the source data record to Elasticsearch as an array."parameter" : { { "name":"docs_1", "type":"keyword", "json_array":true } }
- Enable Elasticsearch Writer to parse source data based on a specified delimiter
For example, the value of a source field is
"1,2,3,4,5"
. If you want to enable Elasticsearch Writer to write the value to Elasticsearch as an array, you can set the splitter parameter to a comma (,) in the code of the node. This way, Elasticsearch Writer parses the value based on the delimiter and writes the value to Elasticsearch as an array.Note A data synchronization node supports only one type of delimiter. You cannot specify different delimiters for different fields that you want to write to Elasticsearch as arrays. For example, you cannot specify a comma (,) as a delimiter for the source field col1="1,2,3,4,5" and a hyphen (-) as a delimiter for the source field col2="6-7-8-9-10"."parameter" : { "column": [ { "name": "docs_2", "array": true, "type": "long" } ], "splitter":","// You must configure the splitter parameter at the same level as the column parameter. }