Hive Writer writes data to Hadoop Distributed File System (HDFS) and loads the data to Hive. This topic describes how Hive Writer works, the parameters that are supported by Hive Writer, and how to configure Hive Writer by using the codeless user interface (UI) and code editor.

Background information

Hive is a Hadoop-based data warehouse tool that is used to process large amounts of structured logs. Hive maps structured data files to a table and allows you to execute SQL statements to query data in the table.
Notice Hive Writer supports only exclusive resource groups for Data Integration, but not the shared resource group or custom resource groups for Data Integration. For more information, see Create and use an exclusive resource group for Data Integration, Use a shared resource group, and Create a custom resource group for Data Integration. For more information about the Hive versions that are supported by Hive Writer, see the Hive versions supported by Hive Writer section of this topic.
Hive converts Hibernate Query Language (HQL) or SQL statements into MapReduce programs.
  • Hive stores processed data in HDFS.
  • Hive uses MapReduce programs to analyze data at the underlying layer.
  • Hive runs MapReduce programs on Yarn.

How it works

Hive Writer connects to a Hive metastore and obtains the storage path, format, and column delimiter of the HDFS file to which you want to write data. Then, Hive Writer writes data to the HDFS file and loads the data in the HDFS file to the destination Hive table by using Java Database Connectivity (JDBC).

The underlying logic of Hive Writer is the same as that of HDFS Writer. You can set parameters for HDFS Writer in the parameters of Hive Writer. Data Integration transparently transmits the configured parameters to HDFS Writer.

Parameters

Parameter Description Required Default value
datasource The name of the data source. It must be the same as the name of the added data source. Yes N/A
column The names of the columns to which you want to write data, such as "column": ["id", "name"].
  • You can write data to some of the columns.
  • The column parameter must be set to all the columns to which you want to write data. This parameter cannot be left empty.
  • The order of the columns cannot be changed.
Yes N/A
table The name of the Hive table to which you want to write data.
Note The name is case-sensitive.
Yes N/A
partition The partition information in the Hive table.
  • If you want to write data to a partitioned Hive table, this parameter is required. After you specify this parameter, Hive Writer writes data to the partition that is specified by this parameter.
  • If you want to write data to a non-partitioned table, this parameter is not required.
No N/A
writeMode The mode in which Hive Writer loads data to the Hive table. After data is written to an HDFS file, Hive Writer executes the LOAD DATA INPATH (overwrite) INTO TABLE statement to load data to the Hive table.
This parameter specifies the mode in which Hive Writer loads data from the HDFS file to the Hive table. Valid values:
  • truncate: Hive Writer deletes existing data before it loads data to the Hive table.
  • append: Hive Writer retains the existing data and appends data to the Hive table.
  • Other: Hive Writer writes data to the HDFS file but does not load the data to the Hive table.
Note Set this parameter with caution. Pay attention to the destination directory and the value of this parameter to prevent data from being unexpectedly deleted.

This parameter and the hiveConfig parameter must be used in pairs.

Yes N/A
hiveConfig The extended parameters for Hive. Valid values: hiveCommand, jdbcUrl, username, and password.
  • hiveCommand: the full path of the Hive client. After you run the hive -e command, the LOAD DATA INPATH statement is executed to load data based on the mode that is specified by the writeMode parameter.

    The client that is specified by the hiveCommand parameter provides access information about Hive.

  • jdbcUrl, username, and password: the information that is required to connect to Hive by using JDBC. After Hive Writer connects to Hive by using JDBC, Hive Writer executes the LOAD DATA INPATH statement to load data based on the mode that is specified by the writeMode parameter.
    "hiveConfig": {
        "hiveCommand": "",
        "jdbcUrl": "",
        "username": "",
        "password": ""
            }
  • Hive Writer uses an HDFS client to write data to HDFS files. You can use this parameter to specify advanced settings for the HDFS client.
Yes N/A

Configure Hive Writer by using the codeless UI

On the DataStudio page, double-click the sync node that you created. On the node configuration tab that appears, set the parameters for the node. For more information, see Configure a sync node by using the codeless UI.

Perform the following steps on the configuration tab of the sync node:
  1. Configure the source and destination.
    Configure the source and destination data sources for the sync node. Target section
    Parameter Description
    Connection The name of the data source to which you want to write data. This parameter is equivalent to the datasource parameter that is described in the preceding section.
    Table The name of the table to which you want to write data. This parameter is equivalent to the table parameter that is described in the preceding section.
    Partition Key Column The partition to which Hive Writer writes data. You must specify the lowest-level partition. Hive Writer can write data only to a single partition.
    Write Mode The write mode. This parameter is equivalent to the writeMode parameter that is described in the preceding section.
    parquet shchema The schema of the Hive table of the parquet type. The following example shows the configurations:
    message tmp{
    REQUIRED BINARY  id;
    REQUIRED BINARY name;
    REQUIRED BINARY cyctime1;
    }
    This parameter is required only if the Hive table at the underlying layer is of the parquet type.
  2. Configure field mappings. This operation is equivalent to setting the column parameter that is described in the preceding section. Fields in the source on the left have a one-to-one mapping with fields in the destination on the right. Mappings
    Operation Description
    Map Fields with the Same Name Click Map Fields with the Same Name to establish mappings between fields with the same name. The data types of the fields must match.
    Map Fields in the Same Line Click Map Fields in the Same Line to establish mappings between fields in the same row. The data types of the fields must match.
    Delete All Mappings Click Delete All Mappings to remove the mappings that are established.
    Auto Layout Click Auto Layout. Then, the system automatically sorts the fields based on specific rules.
  3. Configure channel control policies. Channel section
    Parameter Description
    Expected Maximum Concurrency The maximum number of parallel threads that the sync node uses to read data from the source or write data to the destination. You can configure the parallelism for the sync node on the codeless UI.
    Bandwidth Throttling Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and specify a maximum transmission rate to prevent heavy read workloads on the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to an appropriate value based on the configurations of the source.
    Dirty Data Records Allowed The maximum number of dirty data records that are allowed.

Configure Hive Writer by using the code editor

In the following code in the JSON format, a sync node is configured to write data to Hive. For more information about how to configure a sync node by using the code editor, see Create a sync node by using the code editor.
{
    "type": "job",
    "steps": [
        {
            "stepType": "hive",
            "parameter": {
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "hive",
            "parameter": {
                "partition": "year=a,month=b,day=c", // The partition information of the Hive table.
                "datasource": "hive_ha_shanghai", // The name of the data source.
                "table": "partitiontable2", // The Hive table to which you want to write data.
                "column": [ // The names of the columns to which you want to write data.
                    "id",
                    "name",
                    "age"
                ],
                "writeMode": "append" // The mode in which Hive Writer loads data to the Hive table.
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "version": "2.0",
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    },
    "setting": {
        "errorLimit": {
            "record": ""
        },
        "speed": {
            "throttle":true,// Specifies whether to enable bandwidth throttling. A value of false indicates that bandwidth throttling is disabled, and a value of true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only if the throttle parameter is set to true. 
            "concurrent": 2, // The maximum number of parallel threads. 
            "mbps":"12"// The maximum transmission rate.
        }
    }
}

Hive versions supported by Hive Writer

Hive Reader supports the following Hive versions:
0.8.0
0.8.1
0.9.0
0.10.0
0.11.0
0.12.0
0.13.0
0.13.1
0.14.0
1.0.0
1.0.1
1.1.0
1.1.1
1.2.0
1.2.1
1.2.2
2.0.0
2.0.1
2.1.0
2.1.1
2.2.0
2.3.0
2.3.1
2.3.2
2.3.3
2.3.4
2.3.5
2.3.6
2.3.7
3.0.0
3.1.0
3.1.1
3.1.2
0.8.1-cdh4.0.0
0.8.1-cdh4.0.1
0.9.0-cdh4.1.0
0.9.0-cdh4.1.1
0.9.0-cdh4.1.2
0.9.0-cdh4.1.3
0.9.0-cdh4.1.4
0.9.0-cdh4.1.5
0.10.0-cdh4.2.0
0.10.0-cdh4.2.1
0.10.0-cdh4.2.2
0.10.0-cdh4.3.0
0.10.0-cdh4.3.1
0.10.0-cdh4.3.2
0.10.0-cdh4.4.0
0.10.0-cdh4.5.0
0.10.0-cdh4.5.0.1
0.10.0-cdh4.5.0.2
0.10.0-cdh4.6.0
0.10.0-cdh4.7.0
0.10.0-cdh4.7.1
0.12.0-cdh5.0.0
0.12.0-cdh5.0.1
0.12.0-cdh5.0.2
0.12.0-cdh5.0.3
0.12.0-cdh5.0.4
0.12.0-cdh5.0.5
0.12.0-cdh5.0.6
0.12.0-cdh5.1.0
0.12.0-cdh5.1.2
0.12.0-cdh5.1.3
0.12.0-cdh5.1.4
0.12.0-cdh5.1.5
0.13.1-cdh5.2.0
0.13.1-cdh5.2.1
0.13.1-cdh5.2.2
0.13.1-cdh5.2.3
0.13.1-cdh5.2.4
0.13.1-cdh5.2.5
0.13.1-cdh5.2.6
0.13.1-cdh5.3.0
0.13.1-cdh5.3.1
0.13.1-cdh5.3.2
0.13.1-cdh5.3.3
0.13.1-cdh5.3.4
0.13.1-cdh5.3.5
0.13.1-cdh5.3.6
0.13.1-cdh5.3.8
0.13.1-cdh5.3.9
0.13.1-cdh5.3.10
1.1.0-cdh5.3.6
1.1.0-cdh5.4.0
1.1.0-cdh5.4.1
1.1.0-cdh5.4.2
1.1.0-cdh5.4.3
1.1.0-cdh5.4.4
1.1.0-cdh5.4.5
1.1.0-cdh5.4.7
1.1.0-cdh5.4.8
1.1.0-cdh5.4.9
1.1.0-cdh5.4.10
1.1.0-cdh5.4.11
1.1.0-cdh5.5.0
1.1.0-cdh5.5.1
1.1.0-cdh5.5.2
1.1.0-cdh5.5.4
1.1.0-cdh5.5.5
1.1.0-cdh5.5.6
1.1.0-cdh5.6.0
1.1.0-cdh5.6.1
1.1.0-cdh5.7.0
1.1.0-cdh5.7.1
1.1.0-cdh5.7.2
1.1.0-cdh5.7.3
1.1.0-cdh5.7.4
1.1.0-cdh5.7.5
1.1.0-cdh5.7.6
1.1.0-cdh5.8.0
1.1.0-cdh5.8.2
1.1.0-cdh5.8.3
1.1.0-cdh5.8.4
1.1.0-cdh5.8.5
1.1.0-cdh5.9.0
1.1.0-cdh5.9.1
1.1.0-cdh5.9.2
1.1.0-cdh5.9.3
1.1.0-cdh5.10.0
1.1.0-cdh5.10.1
1.1.0-cdh5.10.2
1.1.0-cdh5.11.0
1.1.0-cdh5.11.1
1.1.0-cdh5.11.2
1.1.0-cdh5.12.0
1.1.0-cdh5.12.1
1.1.0-cdh5.12.2
1.1.0-cdh5.13.0
1.1.0-cdh5.13.1
1.1.0-cdh5.13.2
1.1.0-cdh5.13.3
1.1.0-cdh5.14.0
1.1.0-cdh5.14.2
1.1.0-cdh5.14.4
1.1.0-cdh5.15.0
1.1.0-cdh5.16.0
1.1.0-cdh5.16.2
1.1.0-cdh5.16.99
2.1.1-cdh6.1.1
2.1.1-cdh6.2.0
2.1.1-cdh6.2.1
2.1.1-cdh6.3.0
2.1.1-cdh6.3.1
2.1.1-cdh6.3.2
2.1.1-cdh6.3.3