Set Up Hive Input for Offline Batch Data Sync - Dataphin

The Hive input component enables reading data from Hive data sources. To synchronize data from Hive to other data sources, configure the Hive input component to read the data source, then set up the target data source for synchronization. This topic describes the configuration process for the Hive input component.

Limits

The Hive input component supports data formats such as orc, parquet, text, rc, seq, and iceberg (the iceberg format is only supported for Hive compute sources or E-MapReduce 5.x data sources). It does not support ORC format transactional tables or Kudu table integration.

Note

To integrate data from a Kudu table, please utilize the Impala input component. For more information, see Configure Impala Input Component.

Prerequisites

A Hive data source has been established. For more information, see Create Hive Data Source.
To configure the Hive input component properties, the account must possess read-through permissions for the data source. If you lack these permissions, you must obtain them from the data source. For more information, see Request Data Source Permissions.

Procedure

Select Development > Data Integration from the top menu bar on the Dataphin home page.
In the integration page's top menu bar, select Project (Dev-Prod mode requires selecting an environment).
In the navigation pane on the left, click on Batch Pipeline. From the Batch Pipeline list, select the offline pipeline you want to develop to access its configuration page.
To open the Component Library panel, click on the Component Library in the upper-right corner of the page.
In the Component Library panel's left-side navigation pane, select Input. Then, from the right-hand list of input components, locate the Hive component and drag it onto the canvas.
Click the icon on the Hive input component card to open the Hive Input Configuration dialog box.

Configure the parameters in the Hive Input Configuration dialog box.

Parameter	Description
Step Name	This is the name of the Hive input component. Dataphin automatically generates the step name, but you can modify it according to the business scenario. The naming convention is as follows: It can only contain Chinese characters, letters, underscores (_), and numbers. It cannot exceed 64 characters.
Datasource	The data source drop-down list displays all Hive-type data sources, including those for which you have read-through permissions and those for which you do not. Click the icon to copy the current data source name. For data sources without read-through permissions, you can click Request after the data source to request read-through permissions for the data source. For more information, see Request Data Source Permissions. If you do not have a Hive-type data source, click Create Data Source to create a data source. For more information, see Create Hive Data Source.
Table	Select the source table for data synchronization. Click the icon to copy the name of the currently selected table. Note When the selected table is a Hudi table or a Paimon table, only partition configuration is supported.
Partition	Supports reading static partitions or range partitions. Examples of static partitions are `ds=20230101` and `ds1=2023,ds2=01`. An example of a range partition is `/query/ds >=20230101 and ds <= 20230107`. Note When the selected table is a Hudi table or a Paimon table, range partitions are not supported.
When Partition Does Not Exist	You can choose the following policies to handle scenarios where the specified partition does not exist: Fail The Task: Terminate the task and mark it as failed. Succeed The Task Without Writing Data: The task runs successfully without writing data to the target table.
File Encoding	Select the codec for reading files stored in Hive. File encoding includes UTF-8 and GBK.
NULL Value Replacement	This option applies only to source tables that use the `textfile` data storage format. Enter the string that you want to replace with `NULL`. For example, if you enter `\N`, the system replaces the `\N` string with `NULL`.
Compression Format	This is optional. If the file is compressed, please select the corresponding compression format for Dataphin to decompress. The default format for orc tables is zlib. If you need another decompression format, you must specify it. Other format tables have no default format. Supported compression formats include zlib, hadoop-snappy, lz4, and none.
Field Separator	The field separator is usually specified when the table is created, for example, with a `ROW FORMAT DELIMITED FIELDS TERMINATED BY` statement. Enter the field separator for the table. If you leave this blank, Dataphin uses `\u0001` as the default separator.
Output Fields	The output fields area displays all fields hit by the selected table and filter criteria. If you do not need to output certain fields to downstream components, you can delete the corresponding fields: Note When the compute engine is Hadoop, the output fields of the Hadoop input component support viewing the classification of fields. Non-Hadoop compute engines do not support this. Single Field Deletion Scenario: To delete a small number of fields, you can click the icon under the operation column to delete the extra fields. Batch Field Deletion Scenario: To delete many fields, you can click Field Management. In the Field Management dialog box, select multiple fields, then click the shift left icon to move the selected input fields to the unselected input fields and click Confirm to complete the batch deletion of fields.

To complete the property configuration for the Hive Input Component, click Confirm.