All Products
Search
Document Center

DataWorks:Hive data source

Last Updated:Mar 13, 2025

DataWorks offers Hive Reader and Hive Writer to facilitate data reading from and writing to Hive data sources. This topic describes the capabilities of data synchronization with Hive data sources.

Background information

Hive is a data warehouse tool built on Hadoop that processes vast amounts of structured logs. It maps structured data files to tables and enables SQL statement execution for data querying. At its core, Hive is an SQL parsing engine that translates HQL into MapReduce programs, which run on YARN, and uses HDFS for data storage.

Hive Reader connects to a Hive metastore to retrieve metadata for your Hive table. It can read data in two ways:

  • Reading Data Based on HDFS Files

    Hive Reader connects to the Hive metastore, retrieves the storage path, file format, and column delimiter for the corresponding HDFS file, and then reads the data from it.

  • Reading Data Using Hive JDBC

    The Hive Reader plugin uses the Hive JDBC client to connect to the HiveServer2 service and supports data filtering with where conditions and direct data reading through SQL.

Hive Writer connects to the Hive metastore, obtains the storage path, file format, and column delimiter for the HDFS file, and writes data to it. It then uses the Hive JDBC client to execute the LOAD DATA SQL statement to load the data into Hive tables.

The underlying logic of Hive Writer is identical to that of HDFS Writer. You can configure HDFS Writer parameters within Hive Writer settings, and Data Integration will pass them on to HDFS Writer seamlessly.

Supported versions

Hive Versions Supported by Hive Reader and Hive Writer

0.8.0
0.8.1
0.9.0
0.10.0
0.11.0
0.12.0
0.13.0
0.13.1
0.14.0
1.0.0
1.0.1
1.1.0
1.1.1
1.2.0
1.2.1
1.2.2
2.0.0
2.0.1
2.1.0
2.1.1
2.2.0
2.3.0
2.3.1
2.3.2
2.3.3
2.3.4
2.3.5
2.3.6
2.3.7
3.0.0
3.1.0
3.1.1
3.1.2
3.1.3
0.8.1-cdh4.0.0
0.8.1-cdh4.0.1
0.9.0-cdh4.1.0
0.9.0-cdh4.1.1
0.9.0-cdh4.1.2
0.9.0-cdh4.1.3
0.9.0-cdh4.1.4
0.9.0-cdh4.1.5
0.10.0-cdh4.2.0
0.10.0-cdh4.2.1
0.10.0-cdh4.2.2
0.10.0-cdh4.3.0
0.10.0-cdh4.3.1
0.10.0-cdh4.3.2
0.10.0-cdh4.4.0
0.10.0-cdh4.5.0
0.10.0-cdh4.5.0.1
0.10.0-cdh4.5.0.2
0.10.0-cdh4.6.0
0.10.0-cdh4.7.0
0.10.0-cdh4.7.1
0.12.0-cdh5.0.0
0.12.0-cdh5.0.1
0.12.0-cdh5.0.2
0.12.0-cdh5.0.3
0.12.0-cdh5.0.4
0.12.0-cdh5.0.5
0.12.0-cdh5.0.6
0.12.0-cdh5.1.0
0.12.0-cdh5.1.2
0.12.0-cdh5.1.3
0.12.0-cdh5.1.4
0.12.0-cdh5.1.5
0.13.1-cdh5.2.0
0.13.1-cdh5.2.1
0.13.1-cdh5.2.2
0.13.1-cdh5.2.3
0.13.1-cdh5.2.4
0.13.1-cdh5.2.5
0.13.1-cdh5.2.6
0.13.1-cdh5.3.0
0.13.1-cdh5.3.1
0.13.1-cdh5.3.2
0.13.1-cdh5.3.3
0.13.1-cdh5.3.4
0.13.1-cdh5.3.5
0.13.1-cdh5.3.6
0.13.1-cdh5.3.8
0.13.1-cdh5.3.9
0.13.1-cdh5.3.10
1.1.0-cdh5.3.6
1.1.0-cdh5.4.0
1.1.0-cdh5.4.1
1.1.0-cdh5.4.2
1.1.0-cdh5.4.3
1.1.0-cdh5.4.4
1.1.0-cdh5.4.5
1.1.0-cdh5.4.7
1.1.0-cdh5.4.8
1.1.0-cdh5.4.9
1.1.0-cdh5.4.10
1.1.0-cdh5.4.11
1.1.0-cdh5.5.0
1.1.0-cdh5.5.1
1.1.0-cdh5.5.2
1.1.0-cdh5.5.4
1.1.0-cdh5.5.5
1.1.0-cdh5.5.6
1.1.0-cdh5.6.0
1.1.0-cdh5.6.1
1.1.0-cdh5.7.0
1.1.0-cdh5.7.1
1.1.0-cdh5.7.2
1.1.0-cdh5.7.3
1.1.0-cdh5.7.4
1.1.0-cdh5.7.5
1.1.0-cdh5.7.6
1.1.0-cdh5.8.0
1.1.0-cdh5.8.2
1.1.0-cdh5.8.3
1.1.0-cdh5.8.4
1.1.0-cdh5.8.5
1.1.0-cdh5.9.0
1.1.0-cdh5.9.1
1.1.0-cdh5.9.2
1.1.0-cdh5.9.3
1.1.0-cdh5.10.0
1.1.0-cdh5.10.1
1.1.0-cdh5.10.2
1.1.0-cdh5.11.0
1.1.0-cdh5.11.1
1.1.0-cdh5.11.2
1.1.0-cdh5.12.0
1.1.0-cdh5.12.1
1.1.0-cdh5.12.2
1.1.0-cdh5.13.0
1.1.0-cdh5.13.1
1.1.0-cdh5.13.2
1.1.0-cdh5.13.3
1.1.0-cdh5.14.0
1.1.0-cdh5.14.2
1.1.0-cdh5.14.4
1.1.0-cdh5.15.0
1.1.0-cdh5.16.0
1.1.0-cdh5.16.2
1.1.0-cdh5.16.99
2.1.1-cdh6.1.1
2.1.1-cdh6.2.0
2.1.1-cdh6.2.1
2.1.1-cdh6.3.0
2.1.1-cdh6.3.1
2.1.1-cdh6.3.2
2.1.1-cdh6.3.3
3.1.1-cdh7.1.1

Limits

  • Hive data sources are compatible with serverless resource groups (recommended) or exclusive resource groups for Data Integration.

  • Hive Reader can only read files in text, ORC, and Parquet formats.

  • Batch synchronization tasks generate temporary files on the server when synchronizing data to a Hive data source. These files are automatically deleted after the task completes. Ensure the number of temporary files does not exceed the HDFS directory limit to prevent service unavailability.

    Note

    The server can adjust the dfs.namenode.fs-limits.max-directory-items parameter to set the maximum number of directories or files that can exist in a single directory without recursion. The default value is 1,048,576, with a permissible range from 1 to 6,400,000. To prevent this issue, you can either increase the dfs.namenode.fs-limits.max-directory-items parameter value in HDFS or delete unnecessary files.

  • Access to a Hive data source can be secured with Kerberos authentication or SSL authentication. If no identity authentication is needed, select No Authentication under Authentication Options when adding the data source.

  • When using Kerberos authentication for a Hive data source in DataWorks, if both HiveServer2 and the metastore require Kerberos but have different principals, configure the following settings in the advanced parameters:

     {
    "hive.metastore.kerberos.principal": "your metastore principal"
    }

Supported field types

Batch data read

Type classification

Hive data type

String

CHAR, VARCHAR, STRING

Integer

TINYINT, SMALLINT, INT, INTEGER, BIGINT

Floating point

FLOAT, DOUBLE, DECIMAL

Date and time

TIMESTAMP, DATE

Boolean

BOOLEAN

Preparations before data synchronization

Preparation steps vary depending on the data source configuration mode:

Alibaba Cloud instance mode

To synchronize an OSS table under this instance, select the appropriate Access Identity. Options include Alibaba Cloud Account, Alibaba Cloud Sub-account, and RAM Role. Ensure the chosen access identity has the necessary OSS permissions, or data synchronization may fail.

Important

A connectivity test does not verify data read and write permissions.

Connection string mode

Configuration for the use of DLF

If your Hive data source metadata is managed with DLF, include the following in the Advanced Parameters when configuring the Hive data source:

{"dlf.catalog.id" : "my_catalog_xxxx"}

In this example, my_catalog_xxxx is the name given by the dlf.catalog.id parameter in the EMR Hive configuration.

High availability configuration

For an EMR Hive cluster with High Availability enabled, configure the HA information in the Advanced Parameters when setting up the Hive data source. The format is as follows. Visit the EMR console to locate the cluster and click Cluster Service in the Actions column for the necessary configuration values.

{
//Advanced parameters related to high availability.
"dfs.nameservices":"testDfs",
"dfs.ha.namenodes.testDfs":"namenode1,namenode2",
"dfs.namenode.rpc-address.testDfs.namenode1": "",
"dfs.namenode.rpc-address.testDfs.namenode2": "",
"dfs.client.failover.proxy.provider.testDfs":"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
// If you use OSS as the underlying storage, you can configure the following parameters that are required for connecting to OSS in the advanced parameters.
"fs.oss.accessKeyId":"<yourAccessKeyId>",
"fs.oss.accessKeySecret":"<yourAccessKeySecret>",
"fs.oss.endpoint":"oss-cn-<yourRegion>-internal.aliyuncs.com"
}

Configuration of an OSS external table

When using OSS as the underlying storage, consider the following:

  • Set defaultFS with oss:// as the prefix, such as oss://bucketName.

  • For OSS external tables, include OSS-related information in the Advanced Parameters when configuring the Hive data source.

    {
        "fs.oss.accessKeyId":"<yourAccessKeyId>",
        "fs.oss.accessKeySecret":"<yourAccessKeySecret>",
        "fs.oss.endpoint":"oss-cn-<yourRegion>-internal.aliyuncs.com"
    }
  • For OSS-HDFS external tables, include OSS-HDFS-related information in the Advanced Parameters when configuring the Hive data source.

    {
        "fs.oss.accessKeyId":"<yourAccessKeyId>",
        "fs.oss.accessKeySecret":"<yourAccessKeySecret>",
        "fs.oss.endpoint":"cn-<yourRegion>.oss-dls.aliyuncs.com"
    }

CDH mode

To configure a Hive data source in CDH mode, register the CDH cluster with DataWorks.

Create a data source

When developing a data synchronization task, create the corresponding data source in DataWorks. For more information, see Resource Management. You can view the detailed description of each configuration parameter on the configuration page.

The following section details the parameters for different Authentication Options:

Kerberos authentication

Parameter

Description

keytab file

The .keytab file that is generated when a service principal is registered in a Kerberos environment.

conf file

The conf file is the configuration file of Kerberos. It is used to define various settings of Kerberos clients and servers. The main configuration files are as follows:

  • krb5.conf: The configuration file that is used by clients and libraries. It defines global default settings, realm configurations, domain mappings, application default settings, and logging options.

  • kdc.conf: The configuration file of the Key Distribution Center (KDC) server. It defines the database location, log file location, and other KDC-specific settings.

principal

An identity entity that can be a user or a service. It has a unique name and an associated encryption key.

  • The format of a user principal is username@REALM.

  • The format of a service principal is service/hostname@REALM.

SSL authentication

Parameter

Description

Truststore certificate file

The Truststore certificate file that is generated when SSL authentication is enabled. For example, the truststore.jks file.

Truststore password

The password that is set when the Truststore certificate file is generated by using SSL authentication.

Keystore certificate file

The Keystore certificate file that is generated when SSL authentication is enabled. For example, the keystore.jks file.

Keystore password

The password that is set when the Keystore certificate file is generated by using SSL authentication.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Configure a batch synchronization task to synchronize data of a single table

Configure synchronization settings to implement batch synchronization of all data in a database

For the operation process, see Configure a synchronization task on the Data Integration side.

Appendix: code and parameters

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Code for Hive reader

Hive Reader can read data from HDFS files or use Hive JDBC:

  • Reading Data from HDFS Files

    {
        "type": "job",
        "steps": [
            {
                "stepType": "hive",
                "parameter": {
                    "partition": "pt1=a,pt2=b,pt3=c", // The partitions in the Hive table.
                    "datasource": "hive_not_ha_****", // The name of the data source.
                    "column": [ // The names of the columns from which you want to read data.
                        "id",
                        "pt2",
                        "pt1"
                    ],
                    "readMode": "hdfs", // The mode in which Hive Reader reads data.
                    "table": "part_table_1",
                    "hdfsUsername" : "hdfs",
                    "hivePartitionColumn": [
                        {
                          "type": "string",
                          "value": "The name of the first partition."
                        },
                        {
                          "type": "string",
                          "value": "The name of the second partition."
                         }
                   ]
                },
                "name": "Reader",
                "category": "reader"
            },
            {
                "stepType": "hive",
                "parameter": {
                },
                "name": "Writer",
                "category": "writer"
            }
        ],
        "version": "2.0",
        "order": {
            "hops": [
                {
                    "from": "Reader",
                    "to": "Writer"
                }
            ]
        },
        "setting": {
            "errorLimit": {
                "record": "" // The number of error records.
            },
            "speed": {
                "concurrent": 2, // The maximum number of parallel threads.
                "throttle": true,// Specifies whether to enable bandwidth throttling. The value false indicates that bandwidth throttling is disabled, and the value true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
                "mbps":"12"// The maximum transmission rate. Unit: MB/s.
            }
        }
    }
  • Reading Data Using Hive JDBC

    {
        "type": "job",
        "steps": [
            {
                "stepType": "hive",
                "parameter": {
                    "querySql": "select id,name,age from part_table_1 where pt2='B'",
                    "datasource": "hive_not_ha_****",  // The name of the data source.
                    "column": [ // The names of the columns from which you want to read data.
                        "id",
                        "name",
                        "age"
                    ],
                    "where": "",
                    "table": "part_table_1",
                    "readMode": "jdbc" // The mode in which Hive Reader reads data.
                },
                "name": "Reader",
                "category": "reader"
            },
            {
                "stepType": "hive",
                "parameter": {
                },
                "name": "Writer",
                "category": "writer"
            }
        ],
        "version": "2.0",
        "order": {
            "hops": [
                {
                    "from": "Reader",
                    "to": "Writer"
                }
            ]
        },
        "setting": {
            "errorLimit": {
                "record": ""
            },
            "speed": {
                "concurrent": 2,  // The maximum number of parallel threads.
                "throttle": true,// Specifies whether to enable bandwidth throttling. The value false indicates that bandwidth throttling is disabled, and the value true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
                "mbps":"12"// The maximum transmission rate. Unit: MB/s.
                
            }
        }
    }

Parameters in code for Hive reader

Parameter

Description

Required

Default value

datasource

The name of the data source. It must be the same as the name of the added data source.

Yes

None

table

The name of the table from which you want to read data.

Note

The name is case-sensitive.

Yes

None

readMode

The mode in which Hive Reader reads data:

  • If Hive Reader reads data based on HDFS files, set this parameter to "readMode":"hdfs".

  • If Hive Reader reads data by using Hive JDBC, set this parameter to "readMode":"jdbc".

Note
  • If Hive Reader reads data by using Hive JDBC, you can specify a WHERE clause to filter data. In this case, MapReduce tasks may be generated at the underlying layer of the Hive database, which results in low data reading efficiency.

  • If Hive Reader reads data based on HDFS files, you cannot specify a WHERE clause to filter data. In this case, Hive Reader directly reads data from the HDFS file that corresponds to your Hive table, which improves data reading efficiency.

  • If Hive Reader reads data based on HDFS files, data of views cannot be read.

No

None

partition

The partitions in the Hive table:

  • This parameter is not required if Hive Reader reads data by using Hive JDBC.

  • If the Hive table from which you want to read data is a partitioned table, you must configure the partition parameter. The synchronization task reads data from the partitions that correspond to the partition parameter.

    For single-level partitions, you can use asterisks (*) as wildcards to configure this parameter. For multi-level partitions, wildcards are not supported to configure this parameter.

  • If the Hive table from which you want to read data is not a partitioned table, you do not need to configure the partition parameter.

No

None

column

The names of the columns from which you want to read data. Example: "column": ["id", "name"].

  • You can select specific columns to read.

  • The order of the columns can be changed. You can configure Hive Reader to synchronize data from specific columns in an order different from the order specified in the schema of the table.

  • You can specify partition key columns.

  • You can configure constants.

  • You must specify the columns from which you want to read data. The column parameter cannot be left empty.

Yes

None

querySql

If Hive Reader reads data by using Hive JDBC, you can configure the querySql parameter to read data.

No

None

where

If Hive Reader reads data by using Hive JDBC, you can specify a where clause to filter data.

No

None

hdfsUsername

If Hive Reader reads data based on HDFS files, the user that is specified on the Hive data source page is used by default. If anonymous logon is configured on the data source page, the admin account is used by default. If a permission issue occurs during a synchronization task, you must configure the hdfsUsername parameter in script mode.

No

None

hivePartitionColumn

If you want to synchronize the values of partition key columns to downstream, you can configure the hivePartitionColumn parameter in script mode.

No

None

Code for Hive writer

{
    "type": "job",
    "steps": [
        {
            "stepType": "hive",
            "parameter": {
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "hive",
            "parameter": {
                "partition": "year=a,month=b,day=c", // The partition to which you want to write data.
                "datasource": "hive_ha_shanghai", // The name of the data source.
                "table": "partitiontable2", // The name of the table to which you want to write data.
                "column": [ // The names of the columns to which you want to write data.
                    "id",
                    "name",
                    "age"
                ],
                "writeMode": "append" ,// The write mode.
                "hdfsUsername" : "hdfs"
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "version": "2.0",
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    },
    "setting": {
        "errorLimit": {
            "record": ""
        },
        "speed": {
            "throttle":true,// Specifies whether to enable bandwidth throttling. The value false indicates that bandwidth throttling is disabled, and the value true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
            "concurrent":2, // The maximum number of parallel threads.
            "mbps":"12"// The maximum transmission rate. Unit: MB/s.
        }
    }
}

Parameters in code for Hive writer

Parameter

Description

Required

Default value

datasource

The name of the data source. It must be the same as the name of the added data source.

Yes

None

column

The names of the columns to which you want to write data. Example: "column": ["id", "name"].

  • You can write data to some of the columns.

  • You must specify the columns to which you want to write data. The column parameter cannot be left empty.

  • The order of the columns cannot be changed.

Yes

None

table

The name of the Hive table to which you want to write data.

Note

The name is case-sensitive.

Yes

None

partition

The partitions in the Hive table:

  • If the Hive table to which you want to write data is a partitioned table, you must configure the partition parameter. The synchronization task writes data to the partitions that correspond to the partition parameter.

  • If the Hive table to which you want to write data is not a partitioned table, you do not need to configure the partition parameter.

No

None

writeMode

The mode in which Hive Writer loads data to the Hive table. After data is written to HDFS files, Hive Writer executes the LOAD DATA INPATH (overwrite) INTO TABLE statement to load data to the Hive table.

The writeMode parameter specifies the data loading behavior:

  • If the writeMode parameter is set to truncate, data is cleared before it is loaded.

  • If the writeMode parameter is set to append, existing data is retained.

  • If the writeMode parameter is set to another value, data is written to HDFS files without being loaded to the Hive table.

Note

The writeMode parameter is important. Pay attention to the data write directory and the behavior specified by the writeMode parameter to avoid data loss.

The data loading behavior must be used together with the hiveConfig parameter. Pay attention to your configuration.

Yes

None

hiveConfig

You can configure advanced Hive parameters in the hiveConfig parameter, including hiveCommand, jdbcUrl, username, and password:

  • hiveCommand: Represents the full path of the Hive client tool. After executing hive -e, the data loading operation associated with writeMode will be executed using LOAD DATA INPATH.

    The Hive-related access information is ensured by the client that corresponds to the hiveCommand parameter.

  • jdbcUrl, username, and password represent the JDBC access information for Hive. HiveWriter accesses Hive through the Hive JDBC driver and then executes the data loading operation associated with writeMode using LOAD DATA INPATH.

    "hiveConfig": {
        "hiveCommand": "",
        "jdbcUrl": "",
        "username": "",
        "password": ""
            }
  • Hive Writer uses an HDFS client to write data to HDFS files. You can also configure advanced parameters for the HDFS client in the hiveConfig parameter.

Yes

None

hdfsUsername

When you write data to a Hive table, the user that is specified on the Hive data source page is used by default. If anonymous logon is configured on the data source page, the admin account is used by default. If a permission issue occurs during a synchronization task, you must configure the hdfsUsername parameter in script mode.

No

None

enableColumnExchange

If this parameter is set to True, the column order can be changed.

Note

Only the text format is supported.

No

None

nullFormat

Data Integration provides the nullFormat parameter to define which strings can be considered as null.

For example, if you set nullFormat:"null" and the source data is null, Data Integration considers it as a null field.

Note

The string "null" (n, u, l, l) is different from the actual null value.

No

None