DataWorks offers Hive Reader and Hive Writer to facilitate data reading from and writing to Hive data sources. This topic describes the capabilities of data synchronization with Hive data sources.
Background information
Hive is a data warehouse tool built on Hadoop that processes vast amounts of structured logs. It maps structured data files to tables and enables SQL statement execution for data querying. At its core, Hive is an SQL parsing engine that translates HQL into MapReduce programs, which run on YARN, and uses HDFS for data storage.
Hive Reader connects to a Hive metastore to retrieve metadata for your Hive table. It can read data in two ways:
-
Reading Data Based on HDFS Files
Hive Reader connects to the Hive metastore, retrieves the storage path, file format, and column delimiter for the corresponding HDFS file, and then reads the data from it.
-
Reading Data Using Hive JDBC
The Hive Reader plugin uses the Hive JDBC client to connect to the HiveServer2 service and supports data filtering with where conditions and direct data reading through SQL.
Hive Writer connects to the Hive metastore, obtains the storage path, file format, and column delimiter for the HDFS file, and writes data to it. It then uses the Hive JDBC client to execute the LOAD DATA SQL
statement to load the data into Hive tables.
The underlying logic of Hive Writer is identical to that of HDFS Writer. You can configure HDFS Writer parameters within Hive Writer settings, and Data Integration will pass them on to HDFS Writer seamlessly.
Supported versions
Hive Versions Supported by Hive Reader and Hive Writer
0.8.0
0.8.1
0.9.0
0.10.0
0.11.0
0.12.0
0.13.0
0.13.1
0.14.0
1.0.0
1.0.1
1.1.0
1.1.1
1.2.0
1.2.1
1.2.2
2.0.0
2.0.1
2.1.0
2.1.1
2.2.0
2.3.0
2.3.1
2.3.2
2.3.3
2.3.4
2.3.5
2.3.6
2.3.7
3.0.0
3.1.0
3.1.1
3.1.2
3.1.3
0.8.1-cdh4.0.0
0.8.1-cdh4.0.1
0.9.0-cdh4.1.0
0.9.0-cdh4.1.1
0.9.0-cdh4.1.2
0.9.0-cdh4.1.3
0.9.0-cdh4.1.4
0.9.0-cdh4.1.5
0.10.0-cdh4.2.0
0.10.0-cdh4.2.1
0.10.0-cdh4.2.2
0.10.0-cdh4.3.0
0.10.0-cdh4.3.1
0.10.0-cdh4.3.2
0.10.0-cdh4.4.0
0.10.0-cdh4.5.0
0.10.0-cdh4.5.0.1
0.10.0-cdh4.5.0.2
0.10.0-cdh4.6.0
0.10.0-cdh4.7.0
0.10.0-cdh4.7.1
0.12.0-cdh5.0.0
0.12.0-cdh5.0.1
0.12.0-cdh5.0.2
0.12.0-cdh5.0.3
0.12.0-cdh5.0.4
0.12.0-cdh5.0.5
0.12.0-cdh5.0.6
0.12.0-cdh5.1.0
0.12.0-cdh5.1.2
0.12.0-cdh5.1.3
0.12.0-cdh5.1.4
0.12.0-cdh5.1.5
0.13.1-cdh5.2.0
0.13.1-cdh5.2.1
0.13.1-cdh5.2.2
0.13.1-cdh5.2.3
0.13.1-cdh5.2.4
0.13.1-cdh5.2.5
0.13.1-cdh5.2.6
0.13.1-cdh5.3.0
0.13.1-cdh5.3.1
0.13.1-cdh5.3.2
0.13.1-cdh5.3.3
0.13.1-cdh5.3.4
0.13.1-cdh5.3.5
0.13.1-cdh5.3.6
0.13.1-cdh5.3.8
0.13.1-cdh5.3.9
0.13.1-cdh5.3.10
1.1.0-cdh5.3.6
1.1.0-cdh5.4.0
1.1.0-cdh5.4.1
1.1.0-cdh5.4.2
1.1.0-cdh5.4.3
1.1.0-cdh5.4.4
1.1.0-cdh5.4.5
1.1.0-cdh5.4.7
1.1.0-cdh5.4.8
1.1.0-cdh5.4.9
1.1.0-cdh5.4.10
1.1.0-cdh5.4.11
1.1.0-cdh5.5.0
1.1.0-cdh5.5.1
1.1.0-cdh5.5.2
1.1.0-cdh5.5.4
1.1.0-cdh5.5.5
1.1.0-cdh5.5.6
1.1.0-cdh5.6.0
1.1.0-cdh5.6.1
1.1.0-cdh5.7.0
1.1.0-cdh5.7.1
1.1.0-cdh5.7.2
1.1.0-cdh5.7.3
1.1.0-cdh5.7.4
1.1.0-cdh5.7.5
1.1.0-cdh5.7.6
1.1.0-cdh5.8.0
1.1.0-cdh5.8.2
1.1.0-cdh5.8.3
1.1.0-cdh5.8.4
1.1.0-cdh5.8.5
1.1.0-cdh5.9.0
1.1.0-cdh5.9.1
1.1.0-cdh5.9.2
1.1.0-cdh5.9.3
1.1.0-cdh5.10.0
1.1.0-cdh5.10.1
1.1.0-cdh5.10.2
1.1.0-cdh5.11.0
1.1.0-cdh5.11.1
1.1.0-cdh5.11.2
1.1.0-cdh5.12.0
1.1.0-cdh5.12.1
1.1.0-cdh5.12.2
1.1.0-cdh5.13.0
1.1.0-cdh5.13.1
1.1.0-cdh5.13.2
1.1.0-cdh5.13.3
1.1.0-cdh5.14.0
1.1.0-cdh5.14.2
1.1.0-cdh5.14.4
1.1.0-cdh5.15.0
1.1.0-cdh5.16.0
1.1.0-cdh5.16.2
1.1.0-cdh5.16.99
2.1.1-cdh6.1.1
2.1.1-cdh6.2.0
2.1.1-cdh6.2.1
2.1.1-cdh6.3.0
2.1.1-cdh6.3.1
2.1.1-cdh6.3.2
2.1.1-cdh6.3.3
3.1.1-cdh7.1.1
Limits
-
Hive data sources are compatible with serverless resource groups (recommended) or exclusive resource groups for Data Integration.
-
Hive Reader can only read files in text, ORC, and Parquet formats.
-
Batch synchronization tasks generate temporary files on the server when synchronizing data to a Hive data source. These files are automatically deleted after the task completes. Ensure the number of temporary files does not exceed the HDFS directory limit to prevent service unavailability.
Note
The server can adjust the dfs.namenode.fs-limits.max-directory-items parameter to set the maximum number of directories or files that can exist in a single directory without recursion. The default value is 1,048,576, with a permissible range from 1 to 6,400,000. To prevent this issue, you can either increase the dfs.namenode.fs-limits.max-directory-items parameter value in HDFS or delete unnecessary files.
-
Access to a Hive data source can be secured with Kerberos authentication or SSL authentication
. If no identity authentication is needed, select No Authentication under Authentication Options when adding the data source.
-
When using Kerberos authentication for a Hive data source in DataWorks, if both HiveServer2 and the metastore require Kerberos but have different principals, configure the following settings in the advanced parameters:
{
"hive.metastore.kerberos.principal": "your metastore principal"
}
Supported field types
Batch data read
Type classification | Hive data type |
String | CHAR, VARCHAR, STRING |
Integer | TINYINT, SMALLINT, INT, INTEGER, BIGINT |
Floating point | FLOAT, DOUBLE, DECIMAL |
Date and time | TIMESTAMP, DATE |
Boolean | BOOLEAN |
Preparations before data synchronization
Preparation steps vary depending on the data source configuration mode:
Alibaba Cloud instance mode
To synchronize an OSS table under this instance, select the appropriate Access Identity. Options include Alibaba Cloud Account, Alibaba Cloud Sub-account, and RAM Role. Ensure the chosen access identity has the necessary OSS permissions, or data synchronization may fail.
Important
A connectivity test does not verify data read and write permissions.
Connection string mode
Configuration for the use of DLF
If your Hive data source metadata is managed with DLF, include the following in the Advanced Parameters when configuring the Hive data source:
{"dlf.catalog.id" : "my_catalog_xxxx"}
In this example, my_catalog_xxxx
is the name given by the dlf.catalog.id
parameter in the EMR Hive configuration.
High availability configuration
For an EMR Hive cluster with High Availability enabled, configure the HA information in the Advanced Parameters when setting up the Hive data source. The format is as follows. Visit the EMR console to locate the cluster and click Cluster Service in the Actions column for the necessary configuration values.
{
//Advanced parameters related to high availability.
"dfs.nameservices":"testDfs",
"dfs.ha.namenodes.testDfs":"namenode1,namenode2",
"dfs.namenode.rpc-address.testDfs.namenode1": "",
"dfs.namenode.rpc-address.testDfs.namenode2": "",
"dfs.client.failover.proxy.provider.testDfs":"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
// If you use OSS as the underlying storage, you can configure the following parameters that are required for connecting to OSS in the advanced parameters.
"fs.oss.accessKeyId":"<yourAccessKeyId>",
"fs.oss.accessKeySecret":"<yourAccessKeySecret>",
"fs.oss.endpoint":"oss-cn-<yourRegion>-internal.aliyuncs.com"
}
Configuration of an OSS external table
When using OSS as the underlying storage, consider the following:
-
Set defaultFS with oss:// as the prefix, such as oss://bucketName
.
-
For OSS external tables, include OSS-related information in the Advanced Parameters when configuring the Hive data source.
{
"fs.oss.accessKeyId":"<yourAccessKeyId>",
"fs.oss.accessKeySecret":"<yourAccessKeySecret>",
"fs.oss.endpoint":"oss-cn-<yourRegion>-internal.aliyuncs.com"
}
-
For OSS-HDFS external tables, include OSS-HDFS-related information in the Advanced Parameters when configuring the Hive data source.
{
"fs.oss.accessKeyId":"<yourAccessKeyId>",
"fs.oss.accessKeySecret":"<yourAccessKeySecret>",
"fs.oss.endpoint":"cn-<yourRegion>.oss-dls.aliyuncs.com"
}
Create a data source
When developing a data synchronization task, create the corresponding data source in DataWorks. For more information, see Resource Management. You can view the detailed description of each configuration parameter on the configuration page.
The following section details the parameters for different Authentication Options:
Kerberos authentication
Parameter | Description |
keytab file | The .keytab file that is generated when a service principal is registered in a Kerberos environment. |
conf file | The conf file is the configuration file of Kerberos. It is used to define various settings of Kerberos clients and servers. The main configuration files are as follows: krb5.conf : The configuration file that is used by clients and libraries. It defines global default settings, realm configurations, domain mappings, application default settings, and logging options.
kdc.conf : The configuration file of the Key Distribution Center (KDC) server. It defines the database location, log file location, and other KDC-specific settings.
|
principal | An identity entity that can be a user or a service. It has a unique name and an associated encryption key. |
SSL authentication
Parameter | Description |
Truststore certificate file | The Truststore certificate file that is generated when SSL authentication is enabled. For example, the truststore.jks file. |
Truststore password | The password that is set when the Truststore certificate file is generated by using SSL authentication. |
Keystore certificate file | The Keystore certificate file that is generated when SSL authentication is enabled. For example, the keystore.jks file. |
Keystore password | The password that is set when the Keystore certificate file is generated by using SSL authentication. |
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Appendix: code and parameters
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Code for Hive reader
Hive Reader can read data from HDFS files or use Hive JDBC:
-
Reading Data from HDFS Files
{
"type": "job",
"steps": [
{
"stepType": "hive",
"parameter": {
"partition": "pt1=a,pt2=b,pt3=c", // The partitions in the Hive table.
"datasource": "hive_not_ha_****", // The name of the data source.
"column": [ // The names of the columns from which you want to read data.
"id",
"pt2",
"pt1"
],
"readMode": "hdfs", // The mode in which Hive Reader reads data.
"table": "part_table_1",
"hdfsUsername" : "hdfs",
"hivePartitionColumn": [
{
"type": "string",
"value": "The name of the first partition."
},
{
"type": "string",
"value": "The name of the second partition."
}
]
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "hive",
"parameter": {
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": "" // The number of error records.
},
"speed": {
"concurrent": 2, // The maximum number of parallel threads.
"throttle": true,// Specifies whether to enable bandwidth throttling. The value false indicates that bandwidth throttling is disabled, and the value true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"mbps":"12"// The maximum transmission rate. Unit: MB/s.
}
}
}
-
Reading Data Using Hive JDBC
{
"type": "job",
"steps": [
{
"stepType": "hive",
"parameter": {
"querySql": "select id,name,age from part_table_1 where pt2='B'",
"datasource": "hive_not_ha_****", // The name of the data source.
"column": [ // The names of the columns from which you want to read data.
"id",
"name",
"age"
],
"where": "",
"table": "part_table_1",
"readMode": "jdbc" // The mode in which Hive Reader reads data.
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "hive",
"parameter": {
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"concurrent": 2, // The maximum number of parallel threads.
"throttle": true,// Specifies whether to enable bandwidth throttling. The value false indicates that bandwidth throttling is disabled, and the value true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"mbps":"12"// The maximum transmission rate. Unit: MB/s.
}
}
}
Parameters in code for Hive reader
Parameter | Description | Required | Default value |
datasource | The name of the data source. It must be the same as the name of the added data source. | Yes | None |
table | The name of the table from which you want to read data.
Note The name is case-sensitive. | Yes | None |
readMode | The mode in which Hive Reader reads data: If Hive Reader reads data based on HDFS files, set this parameter to "readMode":"hdfs" . If Hive Reader reads data by using Hive JDBC, set this parameter to "readMode":"jdbc" .
Note If Hive Reader reads data by using Hive JDBC, you can specify a WHERE clause to filter data. In this case, MapReduce tasks may be generated at the underlying layer of the Hive database, which results in low data reading efficiency. If Hive Reader reads data based on HDFS files, you cannot specify a WHERE clause to filter data. In this case, Hive Reader directly reads data from the HDFS file that corresponds to your Hive table, which improves data reading efficiency. If Hive Reader reads data based on HDFS files, data of views cannot be read.
| No | None |
partition | The partitions in the Hive table: This parameter is not required if Hive Reader reads data by using Hive JDBC. If the Hive table from which you want to read data is a partitioned table, you must configure the partition parameter. The synchronization task reads data from the partitions that correspond to the partition parameter. For single-level partitions, you can use asterisks (*) as wildcards to configure this parameter. For multi-level partitions, wildcards are not supported to configure this parameter. If the Hive table from which you want to read data is not a partitioned table, you do not need to configure the partition parameter.
| No | None |
column | The names of the columns from which you want to read data. Example: "column": ["id", "name"] . You can select specific columns to read. The order of the columns can be changed. You can configure Hive Reader to synchronize data from specific columns in an order different from the order specified in the schema of the table. You can specify partition key columns. You can configure constants. You must specify the columns from which you want to read data. The column parameter cannot be left empty.
| Yes | None |
querySql | If Hive Reader reads data by using Hive JDBC, you can configure the querySql parameter to read data. | No | None |
where | If Hive Reader reads data by using Hive JDBC, you can specify a where clause to filter data. | No | None |
hdfsUsername | If Hive Reader reads data based on HDFS files, the user that is specified on the Hive data source page is used by default. If anonymous logon is configured on the data source page, the admin account is used by default. If a permission issue occurs during a synchronization task, you must configure the hdfsUsername parameter in script mode. | No | None |
hivePartitionColumn | If you want to synchronize the values of partition key columns to downstream, you can configure the hivePartitionColumn parameter in script mode. | No | None |
Code for Hive writer
{
"type": "job",
"steps": [
{
"stepType": "hive",
"parameter": {
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "hive",
"parameter": {
"partition": "year=a,month=b,day=c", // The partition to which you want to write data.
"datasource": "hive_ha_shanghai", // The name of the data source.
"table": "partitiontable2", // The name of the table to which you want to write data.
"column": [ // The names of the columns to which you want to write data.
"id",
"name",
"age"
],
"writeMode": "append" ,// The write mode.
"hdfsUsername" : "hdfs"
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"throttle":true,// Specifies whether to enable bandwidth throttling. The value false indicates that bandwidth throttling is disabled, and the value true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"concurrent":2, // The maximum number of parallel threads.
"mbps":"12"// The maximum transmission rate. Unit: MB/s.
}
}
}
Parameters in code for Hive writer
Parameter | Description | Required | Default value |
datasource | The name of the data source. It must be the same as the name of the added data source. | Yes | None |
column | The names of the columns to which you want to write data. Example: "column": ["id", "name"] . You can write data to some of the columns. You must specify the columns to which you want to write data. The column parameter cannot be left empty. The order of the columns cannot be changed.
| Yes | None |
table | The name of the Hive table to which you want to write data.
Note The name is case-sensitive. | Yes | None |
partition | The partitions in the Hive table: If the Hive table to which you want to write data is a partitioned table, you must configure the partition parameter. The synchronization task writes data to the partitions that correspond to the partition parameter. If the Hive table to which you want to write data is not a partitioned table, you do not need to configure the partition parameter.
| No | None |
writeMode | The mode in which Hive Writer loads data to the Hive table. After data is written to HDFS files, Hive Writer executes the LOAD DATA INPATH (overwrite) INTO TABLE statement to load data to the Hive table. The writeMode parameter specifies the data loading behavior: If the writeMode parameter is set to truncate, data is cleared before it is loaded. If the writeMode parameter is set to append, existing data is retained. If the writeMode parameter is set to another value, data is written to HDFS files without being loaded to the Hive table.
Note The writeMode parameter is important. Pay attention to the data write directory and the behavior specified by the writeMode parameter to avoid data loss. The data loading behavior must be used together with the hiveConfig parameter. Pay attention to your configuration. | Yes | None |
hiveConfig | You can configure advanced Hive parameters in the hiveConfig parameter, including hiveCommand, jdbcUrl, username, and password: hiveCommand: Represents the full path of the Hive client tool. After executing hive -e , the data loading operation associated with writeMode will be executed using LOAD DATA INPATH . The Hive-related access information is ensured by the client that corresponds to the hiveCommand parameter. jdbcUrl, username, and password represent the JDBC access information for Hive. HiveWriter accesses Hive through the Hive JDBC driver and then executes the data loading operation associated with writeMode using LOAD DATA INPATH . "hiveConfig": {
"hiveCommand": "",
"jdbcUrl": "",
"username": "",
"password": ""
}
Hive Writer uses an HDFS client to write data to HDFS files. You can also configure advanced parameters for the HDFS client in the hiveConfig parameter.
| Yes | None |
hdfsUsername | When you write data to a Hive table, the user that is specified on the Hive data source page is used by default. If anonymous logon is configured on the data source page, the admin account is used by default. If a permission issue occurs during a synchronization task, you must configure the hdfsUsername parameter in script mode. | No | None |
enableColumnExchange | If this parameter is set to True, the column order can be changed.
Note Only the text format is supported. | No | None |
nullFormat | Data Integration provides the nullFormat parameter to define which strings can be considered as null. For example, if you set nullFormat:"null" and the source data is null, Data Integration considers it as a null field.
Note The string "null" (n, u, l, l) is different from the actual null value. | No | None |