Connect DataWorks Data Integration to Apache Hive for bidirectional offline data synchronization. This topic explains how reading and writing work, which Hive versions are supported, and how to configure a data source and sync tasks.
Supported operations
| Operation | Single-table sync | Full-database sync |
|---|---|---|
| Read (Hive as source) | Yes | Yes |
| Write (Hive as destination) | Yes | No |
How it works
The Hive Reader plugin and Hive Writer plugin use the Hive Metastore to access table metadata — including the Hadoop Distributed File System (HDFS) file storage path, file format, and separator.
Reading
Two read modes are available. Choose based on your filtering requirements.
| Mode | How it works | Supports WHERE filter | Performance |
|---|---|---|---|
hdfs |
Reads table data directly from HDFS files | No | Higher — bypasses the query engine |
jdbc |
Connects to HiveServer2 through a Hive JDBC client | Yes | Lower — may trigger a MapReduce job |
Writing
The Hive Writer plugin writes data to HDFS files, then runs a LOAD DATA SQL statement through a Hive JDBC client to load the data into the target Hive table. The underlying logic mirrors the HDFS Writer plugin — HDFS Writer parameters can be configured inside Hive Writer and are passed through directly.
Supported versions
Standard releases: 0.8.0, 0.8.1, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 1.0.0, 1.0.1, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.3.5, 2.3.6, 2.3.7, 3.0.0, 3.1.0, 3.1.1, 3.1.2, 3.1.3
CDH releases (selected versions): 0.8.1-cdh4.0.0 through 3.1.1-cdh7.1.1, including CDH 4.x, CDH 5.x (up to 5.16.99), and CDH 6.x/7.x variants.
Limits
| Limit | Detail |
|---|---|
| Supported resource groups | Serverless resource groups (recommended) and exclusive resource groups for Data Integration |
| Readable file formats | TextFile, ORCFile, or ParquetFile |
| Authentication | Kerberos and SSL. If no authentication is required, select No Authentication for Authentication Options when adding the data source |
| Hive views | Not supported in HDFS read mode |
| Multi-level partition wildcard | The * wildcard is supported for single-level partitions only |
Temporary files during sync tasks
Offline sync tasks generate temporary files on the HDFS server. These files are deleted automatically after the task completes. Monitor the HDFS directory file count to prevent the file system from becoming unavailable.
The dfs.namenode.fs-limits.max-directory-items parameter controls the maximum number of files or directories a single HDFS directory can contain. The default is 1,048,576 and the range is 1–6,400,000. To avoid hitting this limit, increase this parameter value or delete unnecessary files.
Kerberos with split principals
If both HiveServer2 and the Hive Metastore have Kerberos enabled but use different principals, add the following to the extension parameters:
{
"hive.metastore.kerberos.principal": "<your-metastore-principal>"
}
Supported field types
The following Hive data types are supported for offline reads.
| Category | Hive data types |
|---|---|
| String | CHAR, VARCHAR, STRING |
| Integer | TINYINT, SMALLINT, INT, INTEGER, BIGINT |
| Floating-point | FLOAT, DOUBLE, DECIMAL |
| Date and time | TIMESTAMP, DATE |
| Boolean | BOOLEAN |
Prerequisites
Before configuring a Hive data source, complete the setup steps for your deployment mode.
Alibaba Cloud instance mode
Select an Access Identity with the necessary OSS permissions for the tables you want to sync. Supported identities are Alibaba Cloud Account, RAM User, and RAM Role.
The connectivity test does not verify data read and write permissions. Insufficient permissions will cause sync tasks to fail.
Connection string mode
DLF configuration
If your Hive data source is from EMR and uses DLF (Data Lake Formation) for metadata management, add the following to the Extension Parameters field:
{"dlf.catalog.id" : "my_catalog_xxxx"}
Replace my_catalog_xxxx with the value of dlf.catalog.id from your EMR Hive configuration.
High availability (HA) configuration
If the EMR Hive cluster has high availability (HA) enabled, turn on High-availability Mode and configure the HA settings in Extension Parameters. Get the configuration values from the EMR console by navigating to the target cluster and clicking Cluster Services in the Actions column.
{
"dfs.nameservices": "testDfs",
"dfs.ha.namenodes.testDfs": "namenode1,namenode2",
"dfs.namenode.rpc-address.testDfs.namenode1": "",
"dfs.namenode.rpc-address.testDfs.namenode2": "",
"dfs.client.failover.proxy.provider.testDfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
"fs.oss.accessKeyId": "<yourAccessKeyId>",
"fs.oss.accessKeySecret": "<yourAccessKeySecret>",
"fs.oss.endpoint": "oss-cn-<yourRegion>-internal.aliyuncs.com"
}
The last three OSS fields (fs.oss.*) are required only if the underlying storage is OSS.
OSS external table configuration
If the underlying storage is OSS:
-
Set defaultFS with the
oss://prefix, for example,oss://bucketName. -
For OSS external tables, add the following to Extension Parameters:
{ "fs.oss.accessKeyId": "<yourAccessKeyId>", "fs.oss.accessKeySecret": "<yourAccessKeySecret>", "fs.oss.endpoint": "oss-cn-<yourRegion>-internal.aliyuncs.com" } -
For OSS-HDFS external tables, use the OSS-HDFS endpoint format instead:
{ "fs.oss.accessKeyId": "<yourAccessKeyId>", "fs.oss.accessKeySecret": "<yourAccessKeySecret>", "fs.oss.endpoint": "cn-<yourRegion>.oss-dls.aliyuncs.com" }
CDH mode
Register the CDH cluster with DataWorks before configuring a Hive data source in CDH mode.
Create a data source
Create a Hive data source in DataWorks before developing sync tasks. For the full procedure, see Data source management. Parameter descriptions are shown as inline hints on the configuration page.
The following tables describe parameters specific to each Authentication Options value.
Kerberos authentication
| Parameter | Description |
|---|---|
| keytab file | The .keytab file generated when a service principal is registered in the Kerberos environment. |
| conf file | The Kerberos configuration file. The main files are krb5.conf (client and library settings: global defaults, realm configurations, domain mappings, logging) and kdc.conf (Key Distribution Center server settings: database location, log file location). |
| principal | The identity entity — a user or a service — with a unique name and an associated encryption key. User format: username@REALM. Service format: service/hostname@REALM. |
SSL authentication
| Parameter | Description |
|---|---|
| Truststore certificate file | The Truststore certificate file generated when SSL authentication is enabled, such as truststore.jks. |
| Truststore password | The password for the Truststore certificate file. |
| Keystore certificate file | The Keystore certificate file generated when SSL authentication is enabled, such as keystore.jks. |
| Keystore password | The password for the Keystore certificate file. |
Develop a sync task
Configure a single-table offline sync task
-
Codeless UI: Configure a task in the codeless UI
-
Code editor: Configure a task in the code editor. For the full parameter reference and script examples, see Script reference below.
Configure a full-database offline read task
Script reference
Use the following script examples and parameter tables when configuring sync tasks in the code editor. For the general script format, see Configure a task in the code editor.
Reader script examples
Two read modes are available. All examples use "stepType": "hive" for the reader step.
Read using HDFS files (faster; no WHERE clause filtering; views not supported)
{
"type": "job",
"steps": [
{
"stepType": "hive",
"parameter": {
"datasource": "hive_not_ha_****",
"table": "part_table_1",
"readMode": "hdfs",
"partition": "pt1=a,pt2=b,pt3=c",
"column": ["id", "pt2", "pt1"],
"fileSystemUsername": "hdfs",
"hivePartitionColumn": [
{"type": "string", "value": "partition name 1"},
{"type": "string", "value": "partition name 2"}
],
"successOnNoFile": true
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "hive",
"parameter": {},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {"hops": [{"from": "Reader", "to": "Writer"}]},
"setting": {
"errorLimit": {"record": ""},
"speed": {
"concurrent": 2,
"throttle": true,
"mbps": "12"
}
}
}
Read using Hive JDBC (supports WHERE clause and SQL queries; may trigger MapReduce)
{
"type": "job",
"steps": [
{
"stepType": "hive",
"parameter": {
"datasource": "hive_not_ha_****",
"table": "part_table_1",
"readMode": "jdbc",
"querySql": "select id,name,age from part_table_1 where pt2='B'",
"column": ["id", "name", "age"],
"where": "",
"session": ["mapred.task.timeout=600000"]
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "hive",
"parameter": {},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {"hops": [{"from": "Reader", "to": "Writer"}]},
"setting": {
"errorLimit": {"record": ""},
"speed": {
"concurrent": 2,
"throttle": true,
"mbps": "12"
}
}
}
Reader parameters
| Parameter | Required | Description |
|---|---|---|
datasource |
Yes | The data source name. Must match the name you registered. |
table |
Yes | The table to read. Case-sensitive. |
column |
Yes | The columns to read, for example ["id", "name"]. Cannot be empty. Supports column cropping, reordering, partition key columns, and constants. |
readMode |
No | Read mode: hdfs to read HDFS files directly, or jdbc to read through Hive JDBC. |
partition |
No | Partition filter for partitioned tables. Supports * as a wildcard for single-level partitions only — not for multi-level partitions. Not required for non-partitioned tables or JDBC mode. |
querySql |
No | A SQL query to use in JDBC mode, for example "SELECT id, name FROM table WHERE pt='2024-01-01'". |
where |
No | A WHERE clause for data filtering in JDBC mode. |
session |
No | Session-level Hive configuration for JDBC reads, for example ["SET hive.exec.parallel=true"]. |
fileSystemUsername |
No | The HDFS user for HDFS-mode reads. Defaults to the user set on the data source page. If anonymous login is configured, defaults to the admin account. Set this parameter when a permission issue occurs. |
hivePartitionColumn |
No | Partition field values to sync to the downstream destination. Switch to the code editor to configure. |
successOnNoFile |
No | In HDFS mode, whether the task runs normally when the HDFS directory is empty. |
Writer script example
{
"type": "job",
"steps": [
{
"stepType": "hive",
"parameter": {},
"name": "Reader",
"category": "reader"
},
{
"stepType": "hive",
"parameter": {
"datasource": "hive_ha_shanghai",
"table": "partitiontable2",
"partition": "year=a,month=b,day=c",
"column": ["id", "name", "age"],
"writeMode": "append",
"fileSystemUsername": "hdfs"
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {"hops": [{"from": "Reader", "to": "Writer"}]},
"setting": {
"errorLimit": {"record": ""},
"speed": {
"concurrent": 2,
"throttle": true,
"mbps": "12"
}
}
}
Writer parameters
| Parameter | Required | Description |
|---|---|---|
datasource |
Yes | The data source name. Must match the name you registered. |
table |
Yes | The Hive table to write to. Case-sensitive. |
column |
Yes | The columns to write, for example ["id", "name"]. Cannot be empty. Column reordering is not supported by default — use enableColumnExchange to enable it for Text format. |
writeMode |
Yes | Controls the LOAD DATA INPATH behavior after writing to HDFS. Allowed values: truncate (clears existing data before loading), append (retains existing data), or any other value (writes to HDFS only, without loading into the Hive table). This is a high-risk parameter — verify the write directory and expected behavior before use. |
hiveConfig |
Yes | Hive connection settings for the LOAD DATA operation. Configure either hiveCommand (full path to the Hive client, used with hive -e) or jdbcUrl + username + password (Hive JDBC access). Also accepts advanced HDFS client parameters. |
partition |
No | Partition for partitioned tables. Required when writing to a partitioned table. |
fileSystemUsername |
No | The HDFS user for write operations. Defaults to the user set on the data source page. If anonymous login is configured, defaults to the admin account. |
enableColumnExchange |
No | Set to true to enable column reordering. Supported for Text format only. |
nullFormat |
No | Defines which string values are treated as null. For example, "nullFormat": "null" interprets the string null as a null value. Note that the string "null" is distinct from an actual null value. |