The Elasticsearch input component is designed to read data from Elasticsearch data sources. When synchronizing data from Elasticsearch to other data sources, it's essential to configure the Elasticsearch input component first, followed by the target data source for synchronization. This guide describes the configuration process for the Elasticsearch input component.
Prerequisites
An Elasticsearch data source has been created. For more information, see Create Elasticsearch Data Source.
The account configuring the Elasticsearch input component must possess read-through permission for the data source. If you lack this permission, you need to request access to the data source. For more information, see Request Data Source Permission.
Procedure
On the Dataphin home page, select Development > Data Integration from the top menu bar.
In the integration page's top menu bar, select Project (Dev-Prod mode requires selecting an environment).
Click Batch Pipeline in the left-side navigation pane. Then, in the Batch Pipeline list, click the offline pipeline to open its configuration page.
Click Component Library in the upper-right corner to open the Component Library panel.
In the Component Library panel's left-side navigation pane, select Input. Find the Elasticsearch component in the list on the right and drag it to the canvas.
Click the
icon on the Elasticsearch input component card to open the Elasticsearch Input Configuration dialog box.Configure parameters in the Elasticsearch Input Configuration dialog box.
Parameter
Description
Basic Configuration
Step Name
This is the name of the Elasticsearch input component. Dataphin automatically generates the step name, and you can also modify it based on the business scenario. The naming convention is as follows:
Can only contain Chinese characters, letters, underscores (_), and numbers.
Cannot exceed 64 characters.
Datasource
In the data source drop-down list, all Elasticsearch type data sources and project levels in the current Dataphin are displayed, including whether the data source has read-through permission. Click the
icon to copy the current data source name.For data sources without read-through permission, you can click Request after the data source to request read-through permission for the data source. For more information, see Request Data Source Permission.
If you do not have an Elasticsearch type data source, click Create to create a data source. For more information, see Create Elasticsearch Data Source.
Query Type
You can select the index document to read based on the index or index alias. Different query types require different parameters.
Index.
Index Document: The index name in Elasticsearch. Click the
icon to copy the name of the currently selected index document.Index Document Type: The type name of the index in Elasticsearch.
NoteIndex Document and Index Document Type are required in Elasticsearch 6.x and Elasticsearch 7.x versions, and optional in Elasticsearch 8.x version.
Index Alias.
Index Alias: The alias of the index in Elasticsearch.
Index Document Type: The type name of the index in Elasticsearch.
Query Conditions
The query parameter of Elasticsearch, used for full or incremental queries. For example,
{ "match_all": {}}indicates a full query.Cursor Time
Fill in the cursor storage time, which is the paging parameter of Elasticsearch.
If the setting is too small, and the idle time between retrieving two pages of data exceeds the scroll, it will cause the cursor to expire, leading to data loss.
When the setting is too large, if too many queries are initiated at the same time and exceed the server-side
max_open_scroll_contextconfiguration, it will result in a data query error. For example, 5m represents a cursor time of 5 minutes.
Unit: Days (-d), hours (-h), minutes (-m), seconds (-s), milliseconds (-ms), microseconds (-micros), nanoseconds (-nanos).
Advanced Configuration
Batch Read Count
The number of data read at one time, default is 1024. When reading data from the source database, you can configure a specific batch read count instead of reading one by one to reduce the number of interactions with the data source, improve I/O efficiency, and reduce network latency.
Connection Timeout
The client connection timeout, default is 6000 seconds.
Management Timeout
The client read timeout, default is 6000 seconds.
Date Format
When the synchronized field has a date type and the
mappingof the field does not have aformatconfiguration, you need to configure thedateFormatparameter. The default format in ES is:yyyy-MM-dd'T'HH:mm:ssZ.Output Fields
Displays the output fields for you.
Retrieve Field Information.
When the query type is Index, you can click Retrieve Field Information to obtain the field information of the selected Index.
Batch Add Fields.
Click Batch Add.
Configure in JSON format in batches. The following sample code provides an example:
[{"name":"col_integer","type":"integer"}, {"name":"col_long","type":"long"}, {"name":"col_double","type":"double"}]Notename indicates the name of the field to be introduced, and type indicates the type of the field after introduction. For example:
"name":"user_id","type":"String"indicates that the field named user_id is introduced and the field type is set to String.Configure in TEXT format in batches. The following sample code provides an example:
col_long,long col_double,doubleThe row delimiter is used to separate each field's information. The default is a line feed (\n). It supports line feed (\n), semicolon (;), and period (.).
The column delimiter is used to separate the field name and field type. The default is a comma (,).
Click Confirm.
Create New Output Field.
Click Create New Output Field, and fill in Column and select Type according to the page prompts.
Manage Output Fields.
You can perform the following operations on the added fields:
Click and drag the Column next to the
shift icon to change the position of the field.Click the Operation column's
edit icon to edit existing fields.Click the Operation column's
delete icon to delete an existing field.
Click Confirm to finalize the Elasticsearch input component's property configuration.