All Products
Search
Document Center

DataWorks:Hive

Last Updated:Mar 27, 2026

Connect DataWorks Data Integration to Apache Hive for bidirectional offline data synchronization. This topic explains how reading and writing work, which Hive versions are supported, and how to configure a data source and sync tasks.

Supported operations

Operation Single-table sync Full-database sync
Read (Hive as source) Yes Yes
Write (Hive as destination) Yes No

How it works

The Hive Reader plugin and Hive Writer plugin use the Hive Metastore to access table metadata — including the Hadoop Distributed File System (HDFS) file storage path, file format, and separator.

Reading

Two read modes are available. Choose based on your filtering requirements.

Mode How it works Supports WHERE filter Performance
hdfs Reads table data directly from HDFS files No Higher — bypasses the query engine
jdbc Connects to HiveServer2 through a Hive JDBC client Yes Lower — may trigger a MapReduce job

Writing

The Hive Writer plugin writes data to HDFS files, then runs a LOAD DATA SQL statement through a Hive JDBC client to load the data into the target Hive table. The underlying logic mirrors the HDFS Writer plugin — HDFS Writer parameters can be configured inside Hive Writer and are passed through directly.

Supported versions

Standard releases: 0.8.0, 0.8.1, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 1.0.0, 1.0.1, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.3.5, 2.3.6, 2.3.7, 3.0.0, 3.1.0, 3.1.1, 3.1.2, 3.1.3

CDH releases (selected versions): 0.8.1-cdh4.0.0 through 3.1.1-cdh7.1.1, including CDH 4.x, CDH 5.x (up to 5.16.99), and CDH 6.x/7.x variants.

Full CDH version list

0.8.1-cdh4.0.0
0.8.1-cdh4.0.1
0.9.0-cdh4.1.0
0.9.0-cdh4.1.1
0.9.0-cdh4.1.2
0.9.0-cdh4.1.3
0.9.0-cdh4.1.4
0.9.0-cdh4.1.5
0.10.0-cdh4.2.0
0.10.0-cdh4.2.1
0.10.0-cdh4.2.2
0.10.0-cdh4.3.0
0.10.0-cdh4.3.1
0.10.0-cdh4.3.2
0.10.0-cdh4.4.0
0.10.0-cdh4.5.0
0.10.0-cdh4.5.0.1
0.10.0-cdh4.5.0.2
0.10.0-cdh4.6.0
0.10.0-cdh4.7.0
0.10.0-cdh4.7.1
0.12.0-cdh5.0.0
0.12.0-cdh5.0.1
0.12.0-cdh5.0.2
0.12.0-cdh5.0.3
0.12.0-cdh5.0.4
0.12.0-cdh5.0.5
0.12.0-cdh5.0.6
0.12.0-cdh5.1.0
0.12.0-cdh5.1.2
0.12.0-cdh5.1.3
0.12.0-cdh5.1.4
0.12.0-cdh5.1.5
0.13.1-cdh5.2.0
0.13.1-cdh5.2.1
0.13.1-cdh5.2.2
0.13.1-cdh5.2.3
0.13.1-cdh5.2.4
0.13.1-cdh5.2.5
0.13.1-cdh5.2.6
0.13.1-cdh5.3.0
0.13.1-cdh5.3.1
0.13.1-cdh5.3.2
0.13.1-cdh5.3.3
0.13.1-cdh5.3.4
0.13.1-cdh5.3.5
0.13.1-cdh5.3.6
0.13.1-cdh5.3.8
0.13.1-cdh5.3.9
0.13.1-cdh5.3.10
1.1.0-cdh5.3.6
1.1.0-cdh5.4.0
1.1.0-cdh5.4.1
1.1.0-cdh5.4.2
1.1.0-cdh5.4.3
1.1.0-cdh5.4.4
1.1.0-cdh5.4.5
1.1.0-cdh5.4.7
1.1.0-cdh5.4.8
1.1.0-cdh5.4.9
1.1.0-cdh5.4.10
1.1.0-cdh5.4.11
1.1.0-cdh5.5.0
1.1.0-cdh5.5.1
1.1.0-cdh5.5.2
1.1.0-cdh5.5.4
1.1.0-cdh5.5.5
1.1.0-cdh5.5.6
1.1.0-cdh5.6.0
1.1.0-cdh5.6.1
1.1.0-cdh5.7.0
1.1.0-cdh5.7.1
1.1.0-cdh5.7.2
1.1.0-cdh5.7.3
1.1.0-cdh5.7.4
1.1.0-cdh5.7.5
1.1.0-cdh5.7.6
1.1.0-cdh5.8.0
1.1.0-cdh5.8.2
1.1.0-cdh5.8.3
1.1.0-cdh5.8.4
1.1.0-cdh5.8.5
1.1.0-cdh5.9.0
1.1.0-cdh5.9.1
1.1.0-cdh5.9.2
1.1.0-cdh5.9.3
1.1.0-cdh5.10.0
1.1.0-cdh5.10.1
1.1.0-cdh5.10.2
1.1.0-cdh5.11.0
1.1.0-cdh5.11.1
1.1.0-cdh5.11.2
1.1.0-cdh5.12.0
1.1.0-cdh5.12.1
1.1.0-cdh5.12.2
1.1.0-cdh5.13.0
1.1.0-cdh5.13.1
1.1.0-cdh5.13.2
1.1.0-cdh5.13.3
1.1.0-cdh5.14.0
1.1.0-cdh5.14.2
1.1.0-cdh5.14.4
1.1.0-cdh5.15.0
1.1.0-cdh5.16.0
1.1.0-cdh5.16.2
1.1.0-cdh5.16.99
2.1.1-cdh6.1.1
2.1.1-cdh6.2.0
2.1.1-cdh6.2.1
2.1.1-cdh6.3.0
2.1.1-cdh6.3.1
2.1.1-cdh6.3.2
2.1.1-cdh6.3.3
3.1.1-cdh7.1.1

Limits

Limit Detail
Supported resource groups Serverless resource groups (recommended) and exclusive resource groups for Data Integration
Readable file formats TextFile, ORCFile, or ParquetFile
Authentication Kerberos and SSL. If no authentication is required, select No Authentication for Authentication Options when adding the data source
Hive views Not supported in HDFS read mode
Multi-level partition wildcard The * wildcard is supported for single-level partitions only

Temporary files during sync tasks

Offline sync tasks generate temporary files on the HDFS server. These files are deleted automatically after the task completes. Monitor the HDFS directory file count to prevent the file system from becoming unavailable.

The dfs.namenode.fs-limits.max-directory-items parameter controls the maximum number of files or directories a single HDFS directory can contain. The default is 1,048,576 and the range is 1–6,400,000. To avoid hitting this limit, increase this parameter value or delete unnecessary files.

Kerberos with split principals

If both HiveServer2 and the Hive Metastore have Kerberos enabled but use different principals, add the following to the extension parameters:

{
  "hive.metastore.kerberos.principal": "<your-metastore-principal>"
}

Supported field types

The following Hive data types are supported for offline reads.

Category Hive data types
String CHAR, VARCHAR, STRING
Integer TINYINT, SMALLINT, INT, INTEGER, BIGINT
Floating-point FLOAT, DOUBLE, DECIMAL
Date and time TIMESTAMP, DATE
Boolean BOOLEAN

Prerequisites

Before configuring a Hive data source, complete the setup steps for your deployment mode.

Alibaba Cloud instance mode

Select an Access Identity with the necessary OSS permissions for the tables you want to sync. Supported identities are Alibaba Cloud Account, RAM User, and RAM Role.

Important

The connectivity test does not verify data read and write permissions. Insufficient permissions will cause sync tasks to fail.

Connection string mode

DLF configuration

If your Hive data source is from EMR and uses DLF (Data Lake Formation) for metadata management, add the following to the Extension Parameters field:

{"dlf.catalog.id" : "my_catalog_xxxx"}

Replace my_catalog_xxxx with the value of dlf.catalog.id from your EMR Hive configuration.

High availability (HA) configuration

If the EMR Hive cluster has high availability (HA) enabled, turn on High-availability Mode and configure the HA settings in Extension Parameters. Get the configuration values from the EMR console by navigating to the target cluster and clicking Cluster Services in the Actions column.

{
  "dfs.nameservices": "testDfs",
  "dfs.ha.namenodes.testDfs": "namenode1,namenode2",
  "dfs.namenode.rpc-address.testDfs.namenode1": "",
  "dfs.namenode.rpc-address.testDfs.namenode2": "",
  "dfs.client.failover.proxy.provider.testDfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
  "fs.oss.accessKeyId": "<yourAccessKeyId>",
  "fs.oss.accessKeySecret": "<yourAccessKeySecret>",
  "fs.oss.endpoint": "oss-cn-<yourRegion>-internal.aliyuncs.com"
}
The last three OSS fields (fs.oss.*) are required only if the underlying storage is OSS.

OSS external table configuration

If the underlying storage is OSS:

  • Set defaultFS with the oss:// prefix, for example, oss://bucketName.

  • For OSS external tables, add the following to Extension Parameters:

    {
      "fs.oss.accessKeyId": "<yourAccessKeyId>",
      "fs.oss.accessKeySecret": "<yourAccessKeySecret>",
      "fs.oss.endpoint": "oss-cn-<yourRegion>-internal.aliyuncs.com"
    }
  • For OSS-HDFS external tables, use the OSS-HDFS endpoint format instead:

    {
      "fs.oss.accessKeyId": "<yourAccessKeyId>",
      "fs.oss.accessKeySecret": "<yourAccessKeySecret>",
      "fs.oss.endpoint": "cn-<yourRegion>.oss-dls.aliyuncs.com"
    }

CDH mode

Register the CDH cluster with DataWorks before configuring a Hive data source in CDH mode.

Create a data source

Create a Hive data source in DataWorks before developing sync tasks. For the full procedure, see Data source management. Parameter descriptions are shown as inline hints on the configuration page.

The following tables describe parameters specific to each Authentication Options value.

Kerberos authentication

Parameter Description
keytab file The .keytab file generated when a service principal is registered in the Kerberos environment.
conf file The Kerberos configuration file. The main files are krb5.conf (client and library settings: global defaults, realm configurations, domain mappings, logging) and kdc.conf (Key Distribution Center server settings: database location, log file location).
principal The identity entity — a user or a service — with a unique name and an associated encryption key. User format: username@REALM. Service format: service/hostname@REALM.

SSL authentication

Parameter Description
Truststore certificate file The Truststore certificate file generated when SSL authentication is enabled, such as truststore.jks.
Truststore password The password for the Truststore certificate file.
Keystore certificate file The Keystore certificate file generated when SSL authentication is enabled, such as keystore.jks.
Keystore password The password for the Keystore certificate file.

Develop a sync task

Configure a single-table offline sync task

Configure a full-database offline read task

See Offline sync task for an entire database.

Script reference

Use the following script examples and parameter tables when configuring sync tasks in the code editor. For the general script format, see Configure a task in the code editor.

Reader script examples

Two read modes are available. All examples use "stepType": "hive" for the reader step.

Read using HDFS files (faster; no WHERE clause filtering; views not supported)

{
  "type": "job",
  "steps": [
    {
      "stepType": "hive",
      "parameter": {
        "datasource": "hive_not_ha_****",
        "table": "part_table_1",
        "readMode": "hdfs",
        "partition": "pt1=a,pt2=b,pt3=c",
        "column": ["id", "pt2", "pt1"],
        "fileSystemUsername": "hdfs",
        "hivePartitionColumn": [
          {"type": "string", "value": "partition name 1"},
          {"type": "string", "value": "partition name 2"}
        ],
        "successOnNoFile": true
      },
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "hive",
      "parameter": {},
      "name": "Writer",
      "category": "writer"
    }
  ],
  "version": "2.0",
  "order": {"hops": [{"from": "Reader", "to": "Writer"}]},
  "setting": {
    "errorLimit": {"record": ""},
    "speed": {
      "concurrent": 2,
      "throttle": true,
      "mbps": "12"
    }
  }
}

Read using Hive JDBC (supports WHERE clause and SQL queries; may trigger MapReduce)

{
  "type": "job",
  "steps": [
    {
      "stepType": "hive",
      "parameter": {
        "datasource": "hive_not_ha_****",
        "table": "part_table_1",
        "readMode": "jdbc",
        "querySql": "select id,name,age from part_table_1 where pt2='B'",
        "column": ["id", "name", "age"],
        "where": "",
        "session": ["mapred.task.timeout=600000"]
      },
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "hive",
      "parameter": {},
      "name": "Writer",
      "category": "writer"
    }
  ],
  "version": "2.0",
  "order": {"hops": [{"from": "Reader", "to": "Writer"}]},
  "setting": {
    "errorLimit": {"record": ""},
    "speed": {
      "concurrent": 2,
      "throttle": true,
      "mbps": "12"
    }
  }
}

Reader parameters

Parameter Required Description
datasource Yes The data source name. Must match the name you registered.
table Yes The table to read. Case-sensitive.
column Yes The columns to read, for example ["id", "name"]. Cannot be empty. Supports column cropping, reordering, partition key columns, and constants.
readMode No Read mode: hdfs to read HDFS files directly, or jdbc to read through Hive JDBC.
partition No Partition filter for partitioned tables. Supports * as a wildcard for single-level partitions only — not for multi-level partitions. Not required for non-partitioned tables or JDBC mode.
querySql No A SQL query to use in JDBC mode, for example "SELECT id, name FROM table WHERE pt='2024-01-01'".
where No A WHERE clause for data filtering in JDBC mode.
session No Session-level Hive configuration for JDBC reads, for example ["SET hive.exec.parallel=true"].
fileSystemUsername No The HDFS user for HDFS-mode reads. Defaults to the user set on the data source page. If anonymous login is configured, defaults to the admin account. Set this parameter when a permission issue occurs.
hivePartitionColumn No Partition field values to sync to the downstream destination. Switch to the code editor to configure.
successOnNoFile No In HDFS mode, whether the task runs normally when the HDFS directory is empty.

Writer script example

{
  "type": "job",
  "steps": [
    {
      "stepType": "hive",
      "parameter": {},
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "hive",
      "parameter": {
        "datasource": "hive_ha_shanghai",
        "table": "partitiontable2",
        "partition": "year=a,month=b,day=c",
        "column": ["id", "name", "age"],
        "writeMode": "append",
        "fileSystemUsername": "hdfs"
      },
      "name": "Writer",
      "category": "writer"
    }
  ],
  "version": "2.0",
  "order": {"hops": [{"from": "Reader", "to": "Writer"}]},
  "setting": {
    "errorLimit": {"record": ""},
    "speed": {
      "concurrent": 2,
      "throttle": true,
      "mbps": "12"
    }
  }
}

Writer parameters

Parameter Required Description
datasource Yes The data source name. Must match the name you registered.
table Yes The Hive table to write to. Case-sensitive.
column Yes The columns to write, for example ["id", "name"]. Cannot be empty. Column reordering is not supported by default — use enableColumnExchange to enable it for Text format.
writeMode Yes Controls the LOAD DATA INPATH behavior after writing to HDFS. Allowed values: truncate (clears existing data before loading), append (retains existing data), or any other value (writes to HDFS only, without loading into the Hive table). This is a high-risk parameter — verify the write directory and expected behavior before use.
hiveConfig Yes Hive connection settings for the LOAD DATA operation. Configure either hiveCommand (full path to the Hive client, used with hive -e) or jdbcUrl + username + password (Hive JDBC access). Also accepts advanced HDFS client parameters.
partition No Partition for partitioned tables. Required when writing to a partitioned table.
fileSystemUsername No The HDFS user for write operations. Defaults to the user set on the data source page. If anonymous login is configured, defaults to the admin account.
enableColumnExchange No Set to true to enable column reordering. Supported for Text format only.
nullFormat No Defines which string values are treated as null. For example, "nullFormat": "null" interprets the string null as a null value. Note that the string "null" is distinct from an actual null value.

What's next