how to configure elasticsearch input components for reading data from data sources - Dataphin

The Elasticsearch input component is designed to read data from Elasticsearch data sources. When synchronizing data from Elasticsearch to other data sources, it's essential to configure the Elasticsearch input component first, followed by the target data source for synchronization. This guide describes the configuration process for the Elasticsearch input component.

Prerequisites

An Elasticsearch data source has been created. For more information, see Create Elasticsearch Data Source.
The account configuring the Elasticsearch input component must possess read-through permission for the data source. If you lack this permission, you need to request access to the data source. For more information, see Request Data Source Permission.

Procedure

On the Dataphin home page, select Development > Data Integration from the top menu bar.
In the integration page's top menu bar, select Project (Dev-Prod mode requires selecting an environment).
Click Batch Pipeline in the left-side navigation pane. Then, in the Batch Pipeline list, click the offline pipeline to open its configuration page.
Click Component Library in the upper-right corner to open the Component Library panel.
In the Component Library panel's left-side navigation pane, select Input. Find the Elasticsearch component in the list on the right and drag it to the canvas.
Click the icon on the Elasticsearch input component card to open the Elasticsearch Input Configuration dialog box.

Configure parameters in the Elasticsearch Input Configuration dialog box.

Parameter		Description
Basic Configuration	Step Name	This is the name of the Elasticsearch input component. Dataphin automatically generates the step name, and you can also modify it based on the business scenario. The naming convention is as follows: Can only contain Chinese characters, letters, underscores (_), and numbers. Cannot exceed 64 characters.
	Datasource	In the data source drop-down list, all Elasticsearch type data sources and project levels in the current Dataphin are displayed, including whether the data source has read-through permission. Click the icon to copy the current data source name. For data sources without read-through permission, you can click Request after the data source to request read-through permission for the data source. For more information, see Request Data Source Permission. If you do not have an Elasticsearch type data source, click Create to create a data source. For more information, see Create Elasticsearch Data Source.
	Query Type	You can select the index document to read based on the index or index alias. Different query types require different parameters. Index. Index Document: The index name in Elasticsearch. Click the icon to copy the name of the currently selected index document. Index Document Type: The type name of the index in Elasticsearch. Note Index Document and Index Document Type are required in Elasticsearch 6.x and Elasticsearch 7.x versions, and optional in Elasticsearch 8.x version. Index Alias. Index Alias: The alias of the index in Elasticsearch. Index Document Type: The type name of the index in Elasticsearch.
	Query Conditions	The query parameter of Elasticsearch, used for full or incremental queries. For example, `{ "match_all": {}}` indicates a full query.
	Cursor Time	Fill in the cursor storage time, which is the paging parameter of Elasticsearch. If the setting is too small, and the idle time between retrieving two pages of data exceeds the scroll, it will cause the cursor to expire, leading to data loss. When the setting is too large, if too many queries are initiated at the same time and exceed the server-side `max_open_scroll_context` configuration, it will result in a data query error. For example, 5m represents a cursor time of 5 minutes. Unit: Days (-d), hours (-h), minutes (-m), seconds (-s), milliseconds (-ms), microseconds (-micros), nanoseconds (-nanos).
Advanced Configuration	Batch Read Count	The number of data read at one time, default is 1024. When reading data from the source database, you can configure a specific batch read count instead of reading one by one to reduce the number of interactions with the data source, improve I/O efficiency, and reduce network latency.
	Connection Timeout	The client connection timeout, default is 6000 seconds.
	Management Timeout	The client read timeout, default is 6000 seconds.
	Date Format	When the synchronized field has a date type and the `mapping` of the field does not have a `format` configuration, you need to configure the `dateFormat` parameter. The default format in ES is: `yyyy-MM-dd'T'HH:mm:ssZ`.
Output Fields		Displays the output fields for you. Retrieve Field Information. When the query type is Index, you can click Retrieve Field Information to obtain the field information of the selected Index. Batch Add Fields. Click Batch Add. Configure in JSON format in batches. The following sample code provides an example: `[{"name":"col_integer","type":"integer"}, {"name":"col_long","type":"long"}, {"name":"col_double","type":"double"}]` Note name indicates the name of the field to be introduced, and type indicates the type of the field after introduction. For example: `"name":"user_id","type":"String"` indicates that the field named user_id is introduced and the field type is set to String. Configure in TEXT format in batches. The following sample code provides an example: `col_long,long col_double,double` The row delimiter is used to separate each field's information. The default is a line feed (\n). It supports line feed (\n), semicolon (;), and period (.). The column delimiter is used to separate the field name and field type. The default is a comma (,). Click Confirm. Create New Output Field. Click Create New Output Field, and fill in Column and select Type according to the page prompts. Manage Output Fields. You can perform the following operations on the added fields: Click and drag the Column next to the shift icon to change the position of the field. Click the Operation column's edit icon to edit existing fields. Click the Operation column's delete icon to delete an existing field.

Click Confirm to finalize the Elasticsearch input component's property configuration.