Configure the Hive output component - Dataphin - Alibaba Cloud Documentation Center

The Hive output component writes data to a Hive data source. When you sync data from other data sources to a Hive data source, you must configure the Hive output component after you configure the source component. This topic describes how to configure the Hive output component.

Limits

The Hive output component supports writing data to Hive tables that use the orc, parquet, text, Hudi, Iceberg, or Paimon file format. The Hudi format is supported only for Hive compute sources or data sources of Cloudera Data Platform 7.x. The Iceberg and Paimon formats are supported only for Hive compute sources or data sources of E-MapReduce 5.x. Integration with ORC-formatted transactional tables or Kudu tables is not supported.

Note

For Kudu table data integration, use the Impala output component. For more information, see Configure the Impala output component.

Prerequisites

A Hive data source is created. For more information, see Create a Hive data source.
The account used to configure the Hive output component must have read-through permissions on the data source. If the account does not have the required permissions, you must request them. For more information, see Request data source permissions.

Procedure

In the top menu bar on the Dataphin homepage, choose Develop > Data Integration.
In the top menu bar of the Data Integration page, select a project. If you are in Dev-Prod mode, select an environment.
In the navigation pane on the left, click Batch Pipeline. In the Batch Pipeline list, click the offline pipeline that you want to develop to go to its configuration page.
Click Component Library in the upper-right corner of the page to open the Component Library panel.
In the navigation pane of the Component Library panel, select Outputs. Find the Hive component in the list on the right, and drag it to the canvas.
Click and drag the icon on the target input, transform, or flow component to connect it to the current Hive output component.
Click the icon in the Hive output component to open the Hive Output Configuration dialog box.

In the Hive Output Configuration dialog box, configure the parameters.

The required parameters differ for Hive, Hudi, and Paimon tables.

The output target table is a Hive table

Parameter		Description
Basic Settings	Step Name	The name of the Hive output component. Dataphin automatically generates a step name. You can also change the name as needed. The naming conventions are as follows: The name can contain only Chinese characters, letters, underscores (_), and digits. The name can be up to 64 characters in length.
	Datasource	In the data source drop-down list, all Hive data sources are displayed, including the data sources for which you have and do not have write-through permissions. Click the icon to copy the current data source name. For data sources for which you do not have write-through permissions, you can click Request that appears after a data source to request the write-through permissions. For more information, see Request data source permissions. If you do not have a Hive data source, click Create Data Source to create one. For more information, see Create a Hive data source.
	Table	Select the target table (Hive table) for the output data. You can enter a keyword to search for a table, or enter the exact table name and click Exact Match. After you select a table, the system automatically checks the table status. Click the icon to copy the name of the currently selected table. If no target table for data sync exists in the Hive data source, you can use the one-click table creation feature to quickly generate a target table. Perform the following steps: Click One-Click Table Creation. Dataphin automatically generates the code to create the target table. The code includes the target table name, which defaults to the source table name, and field types, which are initially converted based on Dataphin fields. Set Data Lake Table Format to Do Not Select or Iceberg. Note The Iceberg option is available only when the data lake table format is enabled for the selected data source or the compute source used by the current project, and the format is set to Iceberg. Set Execution Engine to Hive or Spark. Note You can select the execution engine only when you set Data Lake Table Format to Iceberg. If the selected data source is configured with Spark, Spark is displayed and selected by default. If not, only Hive is displayed and selected. The Data Definition Language (DDL) statement is automatically generated based on the selected data lake table format and execution engine. You can modify it as needed, and then click Create. After the target table is created, Dataphin automatically uses it as the output table. The one-click table creation feature is used to create target tables for data sync in the development and production environments. By default, Dataphin selects the option to create a table in the production environment. If a table with the same name and structure already exists in the production environment, you do not need to select this option. Note If a table with the same name exists in the development or production environment, Dataphin reports an error when you click Create. If there are no matching items, you can also perform integration based on a manually entered table name.
	File Encoding	Select the encoding for files stored in Hive. The File Encoding can be UTF-8 or GBK.
	Loading Policy	The policy for writing data to the target Hive data source. The loading policies include Overwrite All Data, Append Data, and Only Overwrite Data Written By Integration Tasks. The scenarios are as follows: Overwrite All Data: Deletes all data in the target table or partition, and then adds new data files that start with the table name. Append Data: Appends data directly to the target table. Only Overwrite Data Written By Integration Tasks: Deletes data files that start with the table name in the target table or partition. Data written using SQL or other methods is not deleted.
	NULL Value Replacement (Optional)	This parameter is supported only for source tables that use the `textfile` data storage format. Enter the string to be replaced with `NULL`. For example, if you enter `\N`, the system replaces the `\N` string with `NULL`.
	Field Delimiter (Optional)	This parameter is supported only for source tables that use the `textfile` data storage format. Enter the separator between fields. If you leave this parameter empty, the system uses `\u0001` as the separator.
	Compression Format (Optional)	Select the compression format for the file. The supported compression formats vary based on the data storage format in Hive: If the data storage format is orc, the supported compression formats include zlib and snappy. If the data storage format is parquet, the supported compression formats include snappy and gzip. If the data storage format is textfile, the supported compression formats include gzip, bzip2, lzo, lzo_deflate, hadoop-snappy, and zlib.
	Field Delimiter Processing (Optional)	This parameter is supported only for output tables that use the `textfile` data storage format. If the data contains default or custom field separators, you can configure a Field Delimiter Processing policy to prevent write errors. You can select Keep, Remove, or Replace With.
	Row Delimiter Processing (Optional)	This parameter is supported only for output tables that use the `textfile` data storage format. If the data contains default or custom field separators, you can configure a Row Delimiter Processing policy. The default row delimiter is `\n`. If the data contains line feed characters such as `\r` or `\n`, select a processing policy to prevent write errors. You can select Keep, Remove, or Replace With.
	Hadoop Parameter Configuration (Optional)	Adjusts write parameters. You can enter different parameters for different table types. Separate multiple parameters with commas (,) in the format `{"key1":"value1", "key2":"value2"}`. For example, if the output table is in ORC format and has many fields, you can adjust the `{"hive.exec.orc.default.buffer.size"}` parameter based on the memory size. If the memory is sufficient, increase this value to improve write performance. If the memory is insufficient, decrease this value to reduce garbage collection (GC) time and improve write performance. The default value is 16384 bytes (16 KB). We recommend that the value does not exceed 262144 bytes (256 KB).
	Partition	If the selected target table is a partitioned table, you must enter the partition information. For example, `state_date=20190101`. You can also use parameters to write data incrementally every day. For example, `state_date=${bizdate}`.
	Preparation Statement (Optional)	The SQL script to execute on the database before data import. For example, to ensure continuous service availability, you can create a target table named Target_A before the current step writes data. Then, write data to Target_A. After the data is written, rename the continuously serving table Service_B to Temp_C, rename Target_A to Service_B, and then delete Temp_C.
	Completion Statement (Optional)	The SQL script to execute on the database after data import.
Field Mapping	Input Fields	Displays the input fields based on the output of the upstream component.
	Output Fields	The output fields area displays all fields of the selected table. Important To ensure that data is written to Hive correctly, all output fields must be mapped to the fields of the input component.
	Mapping	Based on the upstream input and the target table fields, you can manually select field mappings. The Mapping options are Same Row Mapping and Same Name Mapping. Same Name Mapping: Maps fields that have the same name. Same Row Mapping: Maps fields in the same row. Use this option when the field names in the source and target tables are different, but the data in the corresponding rows must be mapped.

The output target table is a Hudi table

Parameter		Description
Basic Settings	Step Name	The name of the Hive output component. Dataphin automatically generates a step name. You can also change the name as needed. The naming conventions are as follows: The name can contain only Chinese characters, letters, underscores (_), and digits. The name can be up to 64 characters in length.
	Datasource	In the data source drop-down list, all Hive data sources are displayed, including the data sources for which you have and do not have write-through permissions. Click the icon to copy the current data source name. For data sources for which you do not have write-through permissions, you can click Request that appears after a data source to request the write-through permissions. For more information, see Request data source permissions. If you do not have a Hive data source, click Create Data Source to create one. For more information, see Create a Hive data source.
	Table	Select the target table (Hudi table) for the output data. You can enter a keyword to search for a table, or enter the exact table name and click Exact Match. After you select a table, the system automatically checks the table status. Click the icon to copy the name of the currently selected table. If no target table for data sync exists in the Hive data source, you can use the one-click table creation feature to quickly generate a target table. Perform the following steps: Click One-Click Table Creation. Dataphin automatically generates the code to create the target table. The code includes the target table name, which defaults to the source table name, and field types, which are initially converted based on Dataphin fields. Set Data Lake Table Format to Hudi. Hudi Table Type: You can select MOR(merge on read) or COW(copy on write). The default is MOR(merge on read). Primary Key Fields (Optional): Enter the primary key fields, separated by commas (,). Extended Properties (Optional): Enter the configuration properties supported by Hudi in the format `k=v`. Note If a table with the same name exists in the development or production environment, Dataphin reports an error when you click Create. If there are no matching items, you can also perform integration based on a manually entered table name. Set Execution Engine to Hive or Spark. Note You can select the execution engine only when you set Data Lake Table Format to Hudi. The default execution engine is Hive. If Spark is enabled for the selected data source, you can select Spark. The DDL statement is automatically generated based on the selected data lake table format and execution engine. You can modify it as needed, and then click Create.
	Partition	If the selected target table is a partitioned table, you must enter the partition information. For example, `state_date=20190101`. You can also use parameters to write data incrementally every day. For example, `state_date=${bizdate}`.
	Loading Policy	The policy for writing data to the target Hive data source. The policies include Overwrite Data, Append Data, and Update Data. Overwrite Data: Overwrites existing data with new data. Append Data: Appends data directly to the target table. Update Data: Updates data based on the primary key. If a record does not exist, it is inserted. Note Data written using SQL or other methods is not deleted.
	BulkInsert	Suitable for fast, large-volume batch sync scenarios. It is typically used for initial data import. Note This option is available and enabled by default only when you set the loading policy to Append Data or Overwrite Data.
	Batch Write	Writes data to the target table in batches. If enabled, you must also configure the Batch Ratio.
	Batch Ratio	The ratio of the total JVM memory. The default value is 0.2. You can enter a two-digit decimal number from 0.01 to 0.50.
	Hadoop Parameter Configuration (Optional)	Adjusts write parameters. You can enter different parameters for different table types. Separate multiple parameters with commas (,) in the format `{"key1":"value1", "key2":"value2"}`. You can use the `{"hoodie.parquet.compression.codec":"snappy"}` parameter to change the compression format to snappy.
Field Mapping	Input Fields	Displays the input fields based on the output of the upstream component.
	Output Fields	The output fields area displays all fields of the selected table. Note You do not need to map all fields for Hudi tables.
	Mapping	Based on the upstream input and the target table fields, you can manually select field mappings. The Mapping options are Same Row Mapping and Same Name Mapping. Same Name Mapping: Maps fields that have the same name. Same Row Mapping: Maps fields in the same row. Use this option when the field names in the source and target tables are different, but the data in the corresponding rows must be mapped.

The output target table is a Paimon table

Parameter		Description
Basic Settings	Step Name	The name of the Hive output component. Dataphin automatically generates a step name. You can also change the name as needed. The naming conventions are as follows: The name can contain only Chinese characters, letters, underscores (_), and digits. The name can be up to 64 characters in length.
	Datasource	In the data source drop-down list, all Hive data sources are displayed, including the data sources for which you have and do not have write-through permissions. Click the icon to copy the current data source name. For data sources for which you do not have write-through permissions, you can click Request that appears after a data source to request the write-through permissions. For more information, see Request data source permissions. If you do not have a Hive data source, click Create Data Source to create one. For more information, see Create a Hive data source.
	Table	Select the target table (Paimon table) for the output data. You can enter a keyword to search for a table, or enter the exact table name and click Exact Match. After you select a table, the system automatically checks the table status. Click the icon to copy the name of the currently selected table. If no target table for data sync exists in the Hive data source, you can use the one-click table creation feature to quickly generate a target table. Perform the following steps: Click One-Click Table Creation. Dataphin automatically generates the code to create the target table. The code includes the target table name, which defaults to the source table name, and field types, which are initially converted based on Dataphin fields. Set Data Lake Table Format to Do Not Select, Iceberg, or Paimon. Note The Iceberg option is available only when the data lake table format is enabled for the selected data source or the compute source used by the current project, and the format is set to Iceberg. Set Execution Engine to Hive or Spark. Note You can select the execution engine only when you set Data Lake Table Format to Iceberg or Paimon. If the selected data source is configured with Spark, Spark is displayed and selected by default. If not, only Hive is displayed and selected. For Paimon Table Type, select MOR (merge on read), COW (copy on write), or MOW (merge on write). The default value is MOR. Note You can configure the Paimon table type only when you set Data Lake Table Format to Paimon. The DDL statement is automatically generated based on the selected data lake table format and execution engine. You can modify it as needed, and then click Create. After the target table is created, Dataphin automatically uses it as the output table. The one-click table creation feature is used to create target tables for data sync in the development and production environments. By default, Dataphin selects the option to create a table in the production environment. If a table with the same name and structure already exists in the production environment, you do not need to select this option. Note If a table with the same name exists in the development or production environment, Dataphin reports an error when you click Create. If there are no matching items, you can also perform integration based on a manually entered table name.
	Loading Policy	The policy for writing data to the target Hive data source. The loading policies include Append Data, Overwrite Data, and Update Data. The scenarios are as follows: Append Data: Appends data directly to the target table. Overwrite Data: Overwrites existing data with new data. Update Data: Updates data based on the primary key. If a record does not exist, it is inserted.
	NULL Value Replacement (Optional)	This parameter is supported only for source tables that use the `textfile` data storage format. Enter the string to be replaced with `NULL`. For example, if you enter `\N`, the system replaces the `\N` string with `NULL`.
	Field Delimiter (Optional)	This parameter is supported only for source tables that use the `textfile` data storage format. Enter the separator between fields. If you leave this parameter empty, the system uses `\u0001` as the separator.
	Field Delimiter Processing (Optional)	This parameter is supported only for output tables that use the `textfile` data storage format. If the data contains default or custom field separators, you can configure a Field Delimiter Processing policy to prevent write errors. You can select Keep, Remove, or Replace With.
	Row Delimiter Processing (Optional)	This parameter is supported only for output tables that use the `textfile` data storage format. If the data contains default or custom field separators, you can configure a Row Delimiter Processing policy. The default row delimiter is `\n`. If the data contains line feed characters such as `\r` or `\n`, select a processing policy to prevent write errors. You can select Keep, Remove, or Replace With.
	Hadoop Parameter Configuration (Optional)	Adjusts write parameters. You can enter different parameters for different table types. Separate multiple parameters with commas (,) in the format `{"key1":"value1", "key2":"value2"}`. For example, if the output table is in ORC format and has many fields, you can adjust the `{"hive.exec.orc.default.buffer.size"}` parameter based on the memory size. If the memory is sufficient, increase this value to improve write performance. If the memory is insufficient, decrease this value to reduce GC time and improve write performance. The default value is 16384 bytes (16 KB). We recommend that the value does not exceed 262144 bytes (256 KB).
	Partition	If the selected target table is a partitioned table, you must enter the partition information. For example, `state_date=20190101`. You can also use parameters to write data incrementally every day. For example, `state_date=${bizdate}`.
Field Mapping	Input Fields	Displays the input fields based on the output of the upstream component.
	Output Fields	The output fields area displays all fields of the selected table. Important To ensure that data is written to Hive correctly, all output fields must be mapped to the fields of the input component.
	Mapping	Based on the upstream input and the target table fields, you can manually select field mappings. The Mapping options are Same Row Mapping and Same Name Mapping. Same Name Mapping: Maps fields that have the same name. Same Row Mapping: Maps fields in the same row. Use this option when the field names in the source and target tables are different, but the data in the corresponding rows must be mapped.

Click OK to complete configuring the Hive output component.