DataWorks provides Hive Reader and Hive Writer for you to read data from and write data to Hive data sources. This topic describes the capabilities of synchronizing data from or to Hive data sources.
Background information
Hive is a Hadoop-based data warehouse tool that is used to process large amounts of structured logs. Hive maps structured data files to a table and allows you to execute SQL statements to query data in the table. Essentially, Hive is an engine that is used to parse SQL statements. It uses MapReduce to analyze data at the underlying layer and uses Hadoop Distributed File System (HDFS) to store the processed data. Hive converts Hibernate Query Language (HQL) statements to MapReduce programs and runs the programs on YARN.
Hive Reader connects to a Hive metastore to obtain the metadata of your Hive table. Hive Reader can read data based on HDFS files or by using Hive Java Database Connectivity (JDBC):
Read data based on HDFS files
Hive Reader connects to a Hive metastore and obtains the storage path, format, and column delimiter of the HDFS file that corresponds to your Hive table. Then, Hive Reader reads data from the HDFS file.
Read data by using Hive JDBC
Hive Reader connects to HiveServer2 by using Hive JDBC to read data. Hive Reader allows you to specify WHERE clauses to filter data and execute SQL statements to read data.
Hive Writer connects to a Hive metastore and obtains the storage path, format, and column delimiter of the HDFS file to which you want to write data. Then, Hive Writer writes data to the HDFS file and executes the LOAD DATA
statement in the Hive client to load the data in the HDFS file to the destination Hive table by using JDBC.
The underlying logic of Hive Writer is the same as that of HDFS Writer. You can configure parameters for HDFS Writer in the parameters of Hive Writer. Data Integration transparently transmits the configured parameters to HDFS Writer.
Supported Hive versions
Hive versions supported by Hive Reader and Hive Writer
0.8.0
0.8.1
0.9.0
0.10.0
0.11.0
0.12.0
0.13.0
0.13.1
0.14.0
1.0.0
1.0.1
1.1.0
1.1.1
1.2.0
1.2.1
1.2.2
2.0.0
2.0.1
2.1.0
2.1.1
2.2.0
2.3.0
2.3.1
2.3.2
2.3.3
2.3.4
2.3.5
2.3.6
2.3.7
3.0.0
3.1.0
3.1.1
3.1.2
3.1.3
0.8.1-cdh4.0.0
0.8.1-cdh4.0.1
0.9.0-cdh4.1.0
0.9.0-cdh4.1.1
0.9.0-cdh4.1.2
0.9.0-cdh4.1.3
0.9.0-cdh4.1.4
0.9.0-cdh4.1.5
0.10.0-cdh4.2.0
0.10.0-cdh4.2.1
0.10.0-cdh4.2.2
0.10.0-cdh4.3.0
0.10.0-cdh4.3.1
0.10.0-cdh4.3.2
0.10.0-cdh4.4.0
0.10.0-cdh4.5.0
0.10.0-cdh4.5.0.1
0.10.0-cdh4.5.0.2
0.10.0-cdh4.6.0
0.10.0-cdh4.7.0
0.10.0-cdh4.7.1
0.12.0-cdh5.0.0
0.12.0-cdh5.0.1
0.12.0-cdh5.0.2
0.12.0-cdh5.0.3
0.12.0-cdh5.0.4
0.12.0-cdh5.0.5
0.12.0-cdh5.0.6
0.12.0-cdh5.1.0
0.12.0-cdh5.1.2
0.12.0-cdh5.1.3
0.12.0-cdh5.1.4
0.12.0-cdh5.1.5
0.13.1-cdh5.2.0
0.13.1-cdh5.2.1
0.13.1-cdh5.2.2
0.13.1-cdh5.2.3
0.13.1-cdh5.2.4
0.13.1-cdh5.2.5
0.13.1-cdh5.2.6
0.13.1-cdh5.3.0
0.13.1-cdh5.3.1
0.13.1-cdh5.3.2
0.13.1-cdh5.3.3
0.13.1-cdh5.3.4
0.13.1-cdh5.3.5
0.13.1-cdh5.3.6
0.13.1-cdh5.3.8
0.13.1-cdh5.3.9
0.13.1-cdh5.3.10
1.1.0-cdh5.3.6
1.1.0-cdh5.4.0
1.1.0-cdh5.4.1
1.1.0-cdh5.4.2
1.1.0-cdh5.4.3
1.1.0-cdh5.4.4
1.1.0-cdh5.4.5
1.1.0-cdh5.4.7
1.1.0-cdh5.4.8
1.1.0-cdh5.4.9
1.1.0-cdh5.4.10
1.1.0-cdh5.4.11
1.1.0-cdh5.5.0
1.1.0-cdh5.5.1
1.1.0-cdh5.5.2
1.1.0-cdh5.5.4
1.1.0-cdh5.5.5
1.1.0-cdh5.5.6
1.1.0-cdh5.6.0
1.1.0-cdh5.6.1
1.1.0-cdh5.7.0
1.1.0-cdh5.7.1
1.1.0-cdh5.7.2
1.1.0-cdh5.7.3
1.1.0-cdh5.7.4
1.1.0-cdh5.7.5
1.1.0-cdh5.7.6
1.1.0-cdh5.8.0
1.1.0-cdh5.8.2
1.1.0-cdh5.8.3
1.1.0-cdh5.8.4
1.1.0-cdh5.8.5
1.1.0-cdh5.9.0
1.1.0-cdh5.9.1
1.1.0-cdh5.9.2
1.1.0-cdh5.9.3
1.1.0-cdh5.10.0
1.1.0-cdh5.10.1
1.1.0-cdh5.10.2
1.1.0-cdh5.11.0
1.1.0-cdh5.11.1
1.1.0-cdh5.11.2
1.1.0-cdh5.12.0
1.1.0-cdh5.12.1
1.1.0-cdh5.12.2
1.1.0-cdh5.13.0
1.1.0-cdh5.13.1
1.1.0-cdh5.13.2
1.1.0-cdh5.13.3
1.1.0-cdh5.14.0
1.1.0-cdh5.14.2
1.1.0-cdh5.14.4
1.1.0-cdh5.15.0
1.1.0-cdh5.16.0
1.1.0-cdh5.16.2
1.1.0-cdh5.16.99
2.1.1-cdh6.1.1
2.1.1-cdh6.2.0
2.1.1-cdh6.2.1
2.1.1-cdh6.3.0
2.1.1-cdh6.3.1
2.1.1-cdh6.3.2
2.1.1-cdh6.3.3
3.1.1-cdh7.1.1
Limits
Hive data sources support serverless resource groups (recommended) and exclusive resource groups for Data Integration.
Hive Reader can read files only in the text, optical character recognition (ORC), and Parquet formats.
When you run a batch synchronization task to synchronize data to a Hive data source, temporary files are generated on the server. After the batch synchronization task finishes running, the temporary files are automatically deleted. DataWorks cannot ensure that the number of generated temporary files does not exceed the upper limit for the number of files that are allowed in an HDFS directory. You must pay attention to the upper limit allowed in the HDFS directory and make sure that the number of generated temporary files does not exceed the upper limit. This can prevent the HDFS service from being unavailable due to excessive temporary files.
Note You can change the value of the dfs.namenode.fs-limits.max-directory-items parameter on the server to define the maximum number of non-recursive directories or files that can be contained in a single HDFS directory. The value range of this parameter is from 1 to 6400000. The default value of this parameter is 1048576. To prevent the number of temporary files that are generated from exceeding the upper limit for the number of files that are allowed in an HDFS directory, you can specify a larger value for the dfs.namenode.fs-limits.max-directory-items parameter or delete the files that are no longer required from the directory.
Hive data sources support Kerberos authentication and SSL authentication
. If you do not need to perform identity authentication for a Hive data source, you can set the Authentication Method parameter to No Authentication when you add the data source.
If Kerberos authentication is enabled for both HiveServer2 and a metastore for a Hive data source that is accessed by using a Kerberos-authenticated identity in DataWorks, and the principals that are used for the authentication are different, you must add the following configuration to the extended parameters:
{
"hive.metastore.kerberos.principal": "your metastore principal"
}
Data type mappings
Batch data read
Category | Hive data type |
String | CHAR, VARCHAR, and STRING |
Integer | TINYINT, SMALLINT, INT, INTEGER, and BIGINT |
Floating point | FLOAT, DOUBLE, and DECIMAL |
Date and time | TIMESTAMP and DATE |
Boolean | BOOLEAN |
Preparations before data synchronization
The preparations vary based on the data source configuration mode.
Alibaba Cloud Instance Mode
If you select this mode and want to synchronize Object Storage Service (OSS ) tables in the selected instance, you must select an access identity, which can be an Alibaba Cloud account, a RAM user, or a RAM role. Make sure that the selected access identity has required OSS permissions. Otherwise, data synchronization will fail due to insufficient read and write permissions.
Important A connectivity test does not verify the data read and write permissions.
Connection String Mode
Configuration for the use of DLF to manage Hive metadata
If you use Data Lake Formation (DLF) to manage metadata of Hive data sources, you need to add the following configuration in the Extended parameters field of the Add Hive data source dialog box when you configure the Hive data source:
{"dlf.catalog.id" : "my_catalog_xxxx"}
my_catalog_xxxx
is the value of the dlf.catalog.id
parameter that is configured for Hive in EMR.
High availability configuration
If High Service Availability is turned on for the E-MapReduce (EMR) Hive cluster, you must configure items related to the high availability feature in the Extended parameters field of the Add Hive data source dialog box when you configure the Hive data source. The following code provides an example. You can perform the following operations to obtain the values of the configuration items: Log on to the EMR console. Find the desired cluster and click Services in the Actions column.
{
// Advanced parameters related to high availability.
"dfs.nameservices":"testDfs",
"dfs.ha.namenodes.testDfs":"namenode1,namenode2",
"dfs.namenode.rpc-address.testDfs.namenode1": "",
"dfs.namenode.rpc-address.testDfs.namenode2": "",
"dfs.client.failover.proxy.provider.testDfs":"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
// If you use OSS as the underlying storage, you can configure the following parameters that are required for connecting to OSS in the advanced parameters. You can also use another service as the underlying storage.
"fs.oss.accessKeyId":"<yourAccessKeyId>",
"fs.oss.accessKeySecret":"<yourAccessKeySecret>",
"fs.oss.endpoint":"oss-cn-<yourRegion>-internal.aliyuncs.com"
}
Configuration of an OSS external table
If you use OSS as the underlying storage, you must take note of the following items:
The value of the defaultFS parameter must start with oss://. For example, the value can be in the oss://bucketName
format.
If you use the Hive data source as the source of a data synchronization task and you want to synchronize data from an OSS external table, you also need to configure items related to OSS in the Extended parameters field of the Add Hive data source dialog box when you configure the Hive data source. The following code provides an example:
{
"fs.oss.accessKeyId":"<yourAccessKeyId>",
"fs.oss.accessKeySecret":"<yourAccessKeySecret>",
"fs.oss.endpoint":"oss-cn-<yourRegion>-internal.aliyuncs.com"
}
If you use the Hive data source as the source of a data synchronization task and you want to synchronize data from an OSS-HDFS external table, you also need to configure items related to OSS-HDFS in the Extended parameters field of the Add Hive data source dialog box when you configure the Hive data source. The following code provides an example:
{
"fs.oss.accessKeyId":"<yourAccessKeyId>",
"fs.oss.accessKeySecret":"<yourAccessKeySecret>",
"fs.oss.endpoint":"cn-<yourRegion>.oss-dls.aliyuncs.com"
}
CDH Cluster Mode
If you want to add a Hive data source in Cloudera's Distribution Including Apache Hadoop (CDH) mode, you must register a CDH cluster to DataWorks first. For more information, see Register a CDH or CDP cluster to DataWorks.
Add a data source
Before you configure a data synchronization task to synchronize data from or to a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources. For information about the parameter settings, view the infotip of each parameter on the configuration tab of the task.
The following information describes parameter settings for different authentication methods.
Kerberos authentication
Parameter | Description |
Keytab File | The .keytab file that is generated after a service principal is registered in the Kerberos environment. |
CONF File | .conf files refer to Kerberos configuration files that are used to define various settings for Kerberos clients and servers. Main configuration files:
krb5.conf : used by a client or library, and used to define globally default settings, realm configurations, domain name mappings, default application settings, and logging options.
kdc.conf : the configuration file of the Key Distribution Center (KDC) server, which is used to define database locations, log file locations, and other KDC-specific settings.
|
Principal | The identity entity, which can be a user or a service and has a unique name and associated encryption key. |
SSL authentication
Parameter | Description |
Truststore File | The truststore certificate file that is generated after SSL authentication is enabled. Example: truststore.jks . |
Truststore Password | The password specified when the truststore certificate file is generated. |
Keystore File | The keystore certificate file that is generated after SSL authentication is enabled. Example: keystore.jks . |
Keystore Password | The password specified when the keystore certificate file is generated. |
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Configure a batch synchronization task to synchronize data of a single table
Configure synchronization settings to implement batch synchronization of all data in a database
For more information about the configuration procedure, see Configure a synchronization task in Data Integration.
Appendix: Code and parameters
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Code for Hive Reader
Hive Reader can read data based on HDFS files or by using Hive JDBC.
Read data based on HDFS files
{
"type": "job",
"steps": [
{
"stepType": "hive",
"parameter": {
"partition": "pt1=a,pt2=b,pt3=c", // The partitions.
"datasource": "hive_not_ha_****", // The name of the data source.
"column": [ // The names of the columns.
"id",
"pt2",
"pt1"
],
"readMode": "hdfs", // The mode in which Hive Reader reads data.
"table": "part_table_1",
"hdfsUsername" : "hdfs",
"hivePartitionColumn": [
{
"type": "string",
"value": "The name of the first partition."
},
{
"type": "string",
"value": "The name of the second partition."
}
]
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "hive",
"parameter": {
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": "" // The maximum number of dirty data records allowed.
},
"speed": {
"concurrent": 2, // The maximum number of parallel threads.
"throttle": true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"mbps":"12"// The maximum transmission rate.
}
}
}
Read data by using Hive JDBC
{
"type": "job",
"steps": [
{
"stepType": "hive",
"parameter": {
"querySql": "select id,name,age from part_table_1 where pt2='B'",
"datasource": "hive_not_ha_****", // The name of the data source.
"column": [ // The names of the columns.
"id",
"name",
"age"
],
"where": "",
"table": "part_table_1",
"readMode": "jdbc" // The mode in which Hive Reader reads data.
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "hive",
"parameter": {
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"concurrent": 2, // The maximum number of parallel threads.
"throttle": true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"mbps":"12"// The maximum transmission rate.
}
}
}
Parameters in code for Hive Reader
Parameter | Description | Required | Default value |
datasource | The name of the data source. It must be the same as the name of the added data source. | Yes | No default value |
table | The name of the table from which you want to read data.
Note The name is case-sensitive. | Yes | No default value |
readMode | The mode in which Hive Reader reads data. If Hive Reader reads data based on HDFS files, set this parameter to hdfs . If Hive Reader reads data by using Hive JDBC, set this parameter to jdbc .
Note If Hive Reader reads data by using Hive JDBC, you can specify a WHERE clause to filter data. In this case, MapReduce tasks may be generated at the underlying layer of the Hive database, which results in low data reading efficiency. If Hive Reader reads data based on HDFS files, you cannot specify a WHERE clause to filter data. In this case, Hive Reader directly reads data from the HDFS file that corresponds to your Hive table, which improves data reading efficiency. If Hive Reader reads data based on HDFS files, data of views cannot be read.
| No | No default value |
partition | The partitions from which you want to read data. This parameter is not required if Hive Reader reads data by using Hive JDBC. If the Hive table from which you want to read data is a partitioned table, this parameter is required. After you configure this parameter, Hive Reader reads data from the partitions that are specified by this parameter. For single-level partitions, you can use asterisks (*) as wildcards to configure this parameter. For multi-level partitions, wildcards are not supported to configure this parameter. If the Hive table from which you want to read data is a non-partitioned table, this parameter is not required.
| No | No default value |
column | The names of the columns from which you want to read data. Separate the names with commas (,), such as "column": ["id", "name"] . You can read data from some of the columns. The order of the columns can be changed. You can configure Hive Reader to synchronize data from specific columns in an order different from the order specified in the schema of the table. You can specify partition key columns. You can configure constants. The column parameter must explicitly specify all the columns that you want to synchronize. This parameter cannot be left empty.
| Yes | No default value |
querySql | If Hive Reader reads data by using Hive JDBC, you can configure the querySql parameter for Hive Reader to read data. | No | No default value |
where | If Hive Reader reads data by using Hive JDBC, you can specify a WHERE clause for Hive Reader to filter data. | No | No default value |
hdfsUsername | The account that is used when Hive Reader reads data from the Hive table. By default, if Hive Reader reads data based on HDFS files, the account that you configured when you add the Hive data source to DataWorks on the Data Source page is used to read data from the HDFS file. If you set the HIVE Login parameter to Anonymous when you add the Hive data source on the Data Source page, the admin account is automatically used when Hive Reader reads data from the HDFS file. If Hive Reader does not have permissions to read data during data synchronization, you must configure the hdfsUsername parameter for the related synchronization task in the code editor. | No | No default value |
hivePartitionColumn | If you want to synchronize the values of the partition fields to the destination, you can configure the hivePartitionColumn parameter for the related synchronization task in the code editor. | No | No default value |
Code for Hive Writer
{
"type": "job",
"steps": [
{
"stepType": "hive",
"parameter": {
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "hive",
"parameter": {
"partition": "year=a,month=b,day=c", // The partition.
"datasource": "hive_ha_shanghai", // The name of the data source.
"table": "partitiontable2", // The name of the table.
"column": [ // The names of the columns.
"id",
"name",
"age"
],
"writeMode": "append" ,// The write mode.
"hdfsUsername" : "hdfs"
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"throttle": true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"concurrent":2, // The maximum number of parallel threads.
"mbps":"12"// The maximum transmission rate.
}
}
}
Parameters in code for Hive Writer
Parameter | Description | Required | Default value |
datasource | The name of the data source. It must be the same as the name of the added data source. | Yes | No default value |
column | The names of the columns to which you want to write data. Separate the names with commas (,), such as "column": ["id", "name"] . You can write data to some of the columns. The column parameter must explicitly specify all the columns to which you want to write data. The parameter cannot be left empty. The order of the columns cannot be changed.
| Yes | No default value |
table | The name of the Hive table to which you want to write data.
Note The name is case-sensitive. | Yes | No default value |
partition | The partition to which you want to write data. If you want to write data to a partitioned Hive table, you must configure this parameter. After you configure this parameter, Hive Writer writes data to the partition that is specified by this parameter. If you want to write data to a non-partitioned table, you do not need to configure this parameter.
| No | No default value |
writeMode | The mode in which Hive Writer loads data to the Hive table. After data is written to an HDFS file, Hive Writer executes the LOAD DATA INPATH (overwrite) INTO TABLE statement to load data to the Hive table. This parameter specifies the mode in which Hive Writer loads data from the HDFS file to the Hive table. Valid values: truncate: Hive Writer deletes existing data before it loads data to the Hive table. append: Hive Writer retains the existing data and appends data to the Hive table. Other: Hive Writer writes data to the HDFS file but does not load the data to the Hive table.
Note Before you configure this parameter, make sure that the value that will be specified for the parameter does not affect your business. Pay attention to the destination directory and the value of this parameter to prevent data from being unexpectedly deleted. This parameter and the hiveConfig parameter must be used in pairs. | Yes | No default value |
hiveConfig | The extended parameters for Hive. You can configure the following parameters as extended parameters: hiveCommand, jdbcUrl, username, and password. hiveCommand: the full path of the Hive client. After you run the hive -e command, the LOAD DATA INPATH statement is executed to load data in the mode that is specified by the writeMode parameter. The client that is specified by the hiveCommand parameter provides access information about Hive. jdbcUrl, username, and password: the information that is required to connect to Hive by using JDBC. After Hive Writer connects to Hive by using JDBC, the LOAD DATA INPATH statement is executed to load data in the mode that is specified by the writeMode parameter. "hiveConfig": {
"hiveCommand": "",
"jdbcUrl": "",
"username": "",
"password": ""
}
Hive Writer uses an HDFS client to write data to HDFS files. You can use this parameter to configure advanced settings for the HDFS client.
| Yes | No default value |
hdfsUsername | The account that is used when Hive Writer writes data to the Hive table. By default, when Hive Writer writes data to the Hive table, the account that you configured when you add the Hive data source to DataWorks on the Data Source page is used. If you set the HIVE Login parameter to Anonymous when you add the Hive data source on the Data Source page, the admin account is automatically used when Hive Writer writes data to the Hive table. If Hive Writer does not have permissions to write data during data synchronization, you must configure the hdfsUsername parameter for the related data synchronization task in the code editor. | No | No default value |
enableColumnExchange | Specifies whether to enable column order exchange. If this parameter is set to true, column order exchange is enabled.
Note Only the order of text columns can be changed. | No | No default value |
nullFormat | The string that represents a null pointer. You can use this parameter to define which string represents a null pointer. For example, if you set this parameter to null , Data Integration considers null as a null pointer.
Note The string NULL is different from a null pointer. Pay attention to the differences between them. | No | No default value |