Hive Writer allows you to write data to Hadoop Distributed File System (HDFS) and load the data to Hive. This topic describes how Hive Writer works, its parameters, and how to configure it by using the codeless user interface (UI) and code editor.

Background information

Hive is a Hadoop-based data warehouse tool that is used to process large amounts of structured logs. Hive maps structured data files to a table and allows you to execute SQL statements to query data in the table.
Notice Hive Writer can write data only to Hive 2.3.x. Hive Writer supports only exclusive resource groups for Data Integration, but not the default resource group or custom resource groups. For more information, see Use exclusive resource groups for data integration, Use the default resource group and Add a custom resource group.
Essentially, Hive converts Hive Query Language (HQL) or SQL statements to MapReduce programs.
  • Hive stores processed data in HDFS.
  • Hive uses MapReduce programs to analyze data at the underlying layer.
  • Hive runs MapReduce programs on Yarn.

How it works

Hive Writer connects to a Hive metastore and parses the configuration to obtain the storage path, format, and column delimiter of the HDFS file to which data is to be written. Then, Hive Writer writes data to the HDFS file, and loads data in the HDFS file to the destination Hive table by using Java Database Connectivity (JDBC).

The underlying logic of Hive Writer is the same as that of HDFS Writer. You can configure parameters of HDFS Writer in the parameters of Hive Writer. Data Integration transparently transmits the configured parameters to HDFS Writer.

Parameters

Parameter Description Required Default value
datasource The connection name. It must be the same as the name of the added connection. Yes N/A
column The columns to which data is written. Example: "column":["id","name"].
  • Column pruning is supported. You can select specific columns to export.
  • The column parameter must explicitly specify a set of columns to which data is written. The parameter cannot be left empty.
  • The column order cannot be changed.
Yes N/A
table The name of the Hive table to which data is written.
Note The name is case-sensitive.
Yes N/A
partition The partition in the Hive table to which data is written.
  • This parameter is required for a partitioned Hive table. The sync node writes data to the partition that is specified by this parameter.
  • This parameter is not required for a non-partitioned table.
No N/A
writeMode The mode in which data is loaded to the Hive table. After data is written to the HDFS file, Hive Writer executes the LOAD DATA INPATH (overwrite) INTO TABLE statement to load data to the Hive table.
The writeMode parameter specifies the data loading mode.
  • truncate: deletes existing data before loading the data to the Hive table.
  • append: retains the existing data and appends the data to the Hive table.
  • If the writeMode parameter is set to Other, the data is written to the HDFS file but not loaded to the Hive table.
Note Setting the writeMode parameter is a high-risk operation. Pay attention to the destination directory and the value of this parameter to avoid deleting data incorrectly.

This parameter must be used together with the hiveConfig parameter.

Yes N/A
hiveConfig The extended parameters for Hive, including hiveCommand, jdbcUrl, username, and password.
  • hiveCommand: the full path of the Hive client. After you run the hive -e command, the LOAD DATA INPATH statement is executed to load data based on the mode that is specified by the writeMode parameter.

    The client that is specified by the hiveCommand parameter provides access information about Hive.

  • jdbcUrl, username, and password: the information that is required to connect to Hive by using JDBC. After Hive Writer connects to Hive by using JDBC, Hive Writer executes the LOAD DATA INPATH statement to load data based on the mode that is specified by the writeMode parameter.
    "hiveConfig": {
        "hiveCommand": "",
        "jdbcUrl": "",
        "username": "",
        "password": ""
            }
  • Hive Writer allows you to write data to HDFS files by using an HDFS client. You can use the hiveConfig parameter to specify advanced settings for the HDFS client.
Yes N/A

Configure Hive Writer by using the codeless UI

On the DataStudio page, double-click a data sync node, and perform the following operations on the node configuration tab that appears. For more information, see Create a sync node by using the codeless UI.

  1. Configure the connections.
    Configure the connections to the source and destination data stores for the sync node.
    Parameter Description
    Data source The datasource parameter in the preceding parameter description. Select a connection type and select the name of a connection that you have configured in DataWorks.
    Table The table parameter in the preceding parameter description.
    Partition information The partition to which data is written. The last-level partition must be specified. Hive Writer can write data to only one partition.
    Write Mode The writeMode parameter in the preceding parameter description.
  2. Configure field mapping. It is equivalent to setting the column parameter in the preceding parameter description. Fields in the source table on the left have a one-to-one mapping with fields in the destination table on the right. Field Mapping section
    GUI element Description
    The same name mapping Click The same name mapping to establish a mapping between fields with the same name. Note that the data types of the fields must match.
    Peer mapping Click Peer mapping to establish a mapping between fields in the same row. Note that the data types of the fields must match.
    Unmap Click Unmap to remove mappings that have been established.
    Automatic typesetting Click Automatic typesetting to sort the fields based on specified rules.
  3. Configure channel control policies.Channel control section
    Parameter Description
    Maximum number of concurrent tasks expected The maximum number of concurrent threads that the sync node uses to read data from or write data to data stores. You can configure the concurrency for the node on the codeless UI.
    Synchronization rate Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and set a maximum transmission rate to avoid heavy read workload of the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to a proper value.
    The number of error records exceeds The maximum number of dirty data records allowed.

Configure Hive Writer by using the code editor

The following example shows how to configure a sync node to write data to Hive. For more information, see Create a sync node by using the code editor.
{
    "type": "job",
    "steps": [
        {
            "stepType": "hive",
            "parameter": {
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "hive",
            "parameter": {
                "partition": "year=a,month=b,day=c",// The partition to which data is written.
                "datasource": "hive_ha_shanghai", // The connection name.
                "table": "partitiontable2", // The name of the destination table.
                "column": [// The columns in the destination table to which data is written.
                    "id",
                    "name",
                    "age"
                ],
                "writeMode": "append"// The write mode.
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "version": "2.0",
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    },
    "setting": {
        "errorLimit": {
            "record": ""
        },
        "speed": {
            "throttle": false,
            "concurrent": 2
        }
    }
}