The Hive input component enables reading data from Hive data sources. To synchronize data from Hive to other data sources, configure the Hive input component to read the data source, then set up the target data source for synchronization. This topic describes the configuration process for the Hive input component.
Limits
The Hive input component supports data formats such as orc, parquet, text, rc, seq, and iceberg (the iceberg format is only supported for Hive compute sources or E-MapReduce 5.x data sources). It does not support ORC format transactional tables or Kudu table integration.
To integrate data from a Kudu table, please utilize the Impala input component. For more information, see Configure Impala Input Component.
Prerequisites
A Hive data source has been established. For more information, see Create Hive Data Source.
To configure the Hive input component properties, the account must possess read-through permissions for the data source. If you lack these permissions, you must obtain them from the data source. For more information, see Request Data Source Permissions.
Procedure
Select Development > Data Integration from the top menu bar on the Dataphin home page.
In the integration page's top menu bar, select Project (Dev-Prod mode requires selecting an environment).
In the navigation pane on the left, click on Batch Pipeline. From the Batch Pipeline list, select the offline pipeline you want to develop to access its configuration page.
To open the Component Library panel, click on the Component Library in the upper-right corner of the page.
In the Component Library panel's left-side navigation pane, select Input. Then, from the right-hand list of input components, locate the Hive component and drag it onto the canvas.
Click the
icon on the Hive input component card to open the Hive Input Configuration dialog box.Configure the parameters in the Hive Input Configuration dialog box.
Parameter
Description
Step Name
This is the name of the Hive input component. Dataphin automatically generates the step name, but you can modify it according to the business scenario. The naming convention is as follows:
It can only contain Chinese characters, letters, underscores (_), and numbers.
It cannot exceed 64 characters.
Datasource
The data source drop-down list displays all Hive-type data sources, including those for which you have read-through permissions and those for which you do not. Click the
icon to copy the current data source name.For data sources without read-through permissions, you can click Request after the data source to request read-through permissions for the data source. For more information, see Request Data Source Permissions.
If you do not have a Hive-type data source, click Create Data Source to create a data source. For more information, see Create Hive Data Source.
Table
Select the source table for data synchronization. Click the
icon to copy the name of the currently selected table.NoteWhen the selected table is a Hudi table or a Paimon table, only partition configuration is supported.
Partition
Supports reading static partitions or range partitions. Examples of static partitions are
ds=20230101andds1=2023,ds2=01. An example of a range partition is/*query*/ds >=20230101 and ds <= 20230107.NoteWhen the selected table is a Hudi table or a Paimon table, range partitions are not supported.
When Partition Does Not Exist
You can choose the following policies to handle scenarios where the specified partition does not exist:
Fail The Task: Terminate the task and mark it as failed.
Succeed The Task Without Writing Data: The task runs successfully without writing data to the target table.
File Encoding
Select the codec for reading files stored in Hive. File encoding includes UTF-8 and GBK.
NULL Value Replacement
This option applies only to source tables that use the
textfiledata storage format. Enter the string that you want to replace withNULL. For example, if you enter\N, the system replaces the\Nstring withNULL.Compression Format
This is optional. If the file is compressed, please select the corresponding compression format for Dataphin to decompress. The default format for orc tables is zlib. If you need another decompression format, you must specify it. Other format tables have no default format. Supported compression formats include zlib, hadoop-snappy, lz4, and none.
Field Separator
The field separator is usually specified when the table is created, for example, with a
ROW FORMAT DELIMITED FIELDS TERMINATED BYstatement. Enter the field separator for the table. If you leave this blank, Dataphin uses\u0001as the default separator.Output Fields
The output fields area displays all fields hit by the selected table and filter criteria. If you do not need to output certain fields to downstream components, you can delete the corresponding fields:
NoteWhen the compute engine is Hadoop, the output fields of the Hadoop input component support viewing the classification of fields. Non-Hadoop compute engines do not support this.
Single Field Deletion Scenario: To delete a small number of fields, you can click the
icon under the operation column to delete the extra fields.Batch Field Deletion Scenario: To delete many fields, you can click Field Management. In the Field Management dialog box, select multiple fields, then click the
shift left icon to move the selected input fields to the unselected input fields and click Confirm to complete the batch deletion of fields.
To complete the property configuration for the Hive Input Component, click Confirm.