The Hive data source provides a bidirectional channel for reading data from and writing data to Hive. This topic describes the data synchronization capabilities that DataWorks supports for Hive.
Features
Apache Hive is a data warehouse tool built on Hadoop that is used for the statistical analysis of massive structured logs. Hive can map structured data files to a table and provides SQL query capabilities. Hive is an SQL parsing engine that uses MapReduce for data analysis, stores processed data in the Hadoop Distributed File System (HDFS), converts HiveQL into MapReduce programs, and runs them on Yet Another Resource Negotiator (YARN).
The Hive Reader plugin accesses the Hive Metastore service to obtain the metadata of your configured data tables. You can read data in two ways: using HDFS files or using Hive Java Database Connectivity (JDBC).
Read data using HDFS files
The Hive Reader plugin accesses the Hive Metastore service to parse information about your configured data table, such as the HDFS file storage path, file format, and separator. Then, the plugin reads the table data from Hive by reading the HDFS files.
Read data using Hive JDBC
The Hive Reader plugin connects to the HiveServer2 service through a Hive JDBC client to read data. The Hive Reader plugin supports data filtering using where clauses and lets you read data directly using SQL statements.
The Hive Writer plugin accesses the Hive Metastore service to parse information about your configured data table, such as the HDFS file storage path, file format, and separator. The plugin writes data to HDFS files and then runs the LOAD DATA SQL statement through a Hive JDBC client to load the data from the HDFS files into the Hive table.
The underlying logic of the Hive Writer plugin is the same as that of the HDFS Writer plugin. You can configure HDFS Writer-related parameters in the Hive Writer plugin. These parameters are passed to the HDFS Writer plugin.
Supported versions
Limits
The Hive data source supports serverless resource groups (recommended) and exclusive resource groups for Data Integration.
You can read only files in TextFile, ORCFile, or ParquetFile format.
When you use Data Integration to perform an offline synchronization to a Hive cluster, temporary files are generated on the server. These files are automatically deleted after the sync task is complete. To prevent the file system from becoming unavailable, monitor the file count limit in the server's HDFS directory. DataWorks does not guarantee that the number of files remains within the HDFS directory limit.
NoteOn the server, you can modify the dfs.namenode.fs-limits.max-directory-items parameter to define the maximum number of directories or files that a single directory can contain. The default value is 1,048,576, and the value can range from 1 to 6,400,000. To prevent this issue, you can increase the value of the HDFS dfs.namenode.fs-limits.max-directory-items parameter or delete unnecessary files.
You can use Kerberos identity authentication or
SSL authenticationto access a Hive data source. If identity authentication is not required, select No Authentication for Authentication Options when you add a data source.When you use Kerberos authentication to access a Hive data source in DataWorks, if both HiveServer2 and the metastore have Kerberos authentication enabled but use different principals, you must add the following configuration to the extended parameters:
{ "hive.metastore.kerberos.principal": "<your metastore principal>" }
Supported field types
The following table lists the field types supported by the Hive data source for offline reads.
Category | Hive data type |
String | CHAR, VARCHAR, STRING |
Integer | TINYINT, SMALLINT, INT, INTEGER, BIGINT |
Floating-point | FLOAT, DOUBLE, DECIMAL |
Date and time | TIMESTAMP, DATE |
Boolean | BOOLEAN |
Preparations
The required preparations vary based on the data source configuration mode.
Alibaba Cloud instance mode
If you want to synchronize the OSS tables in this instance, select the corresponding Access Identity. The supported access identities are Alibaba Cloud Account, RAM User, and RAM Role. Make sure that the selected access identity has the required OSS permissions. Otherwise, the data synchronization fails due to insufficient read and write permissions.
The connectivity test does not verify data read and write permissions.
Connection string mode
DLF configuration
If your Hive data source is from EMR and uses DLF for metadata management, you must add the following content to the Extension Parameters field:
{"dlf.catalog.id" : "my_catalog_xxxx"}In the code, my_catalog_xxxx is the name that corresponds to the dlf.catalog.id parameter in your EMR Hive configuration.
High availability (HA) configuration
If the EMR Hive cluster that you want to sync has High Availability enabled, you must enable High-availability Mode and configure the relevant HA information in the Extended Parameters section in the following format. You can go to the EMR console, find the target cluster, and then click Cluster Services in the Actions column to obtain the relevant configuration values.
{
// The following code provides an example of HA configurations.
"dfs.nameservices":"testDfs",
"dfs.ha.namenodes.testDfs":"namenode1,namenode2",
"dfs.namenode.rpc-address.testDfs.namenode1": "",
"dfs.namenode.rpc-address.testDfs.namenode2": "",
"dfs.client.failover.proxy.provider.testDfs":"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
// (Optional) If the underlying storage is OSS, you must configure the following parameters in the extended parameters to connect to the OSS service.
"fs.oss.accessKeyId":"<yourAccessKeyId>",
"fs.oss.accessKeySecret":"<yourAccessKeySecret>",
"fs.oss.endpoint":"oss-cn-<yourRegion>-internal.aliyuncs.com"
}OSS external table configuration
If the underlying storage is OSS, note the following:
The defaultFS configuration must be prefixed with oss://, for example,
oss://bucketName.If the table that you want to sync is an OSS external table, you must enter the OSS-related information in the Extension Parameters field when you configure the Hive data source.
{ "fs.oss.accessKeyId":"<yourAccessKeyId>", "fs.oss.accessKeySecret":"<yourAccessKeySecret>", "fs.oss.endpoint":"oss-cn-<yourRegion>-internal.aliyuncs.com" }If the table to sync is an OSS-HDFS external table, you must configure the related OSS-HDFS information in the Extension Parameters field when you configure the Hive data source.
{ "fs.oss.accessKeyId":"<yourAccessKeyId>", "fs.oss.accessKeySecret":"<yourAccessKeySecret>", "fs.oss.endpoint":"cn-<yourRegion>.oss-dls.aliyuncs.com" }
CDH mode
To use the CDH mode to configure a Hive data source, you must register the CDH cluster with DataWorks.
Create a data source
When you develop a data synchronization task, you must create a corresponding data source in DataWorks. For more information about the procedure, see Data source management. For detailed explanations of the configuration parameters, see the text prompts for each parameter on the configuration page.
The following describes the parameters for different Authentication Options:
Kerberos authentication
Parameter | Description |
keytab file | The .keytab file generated when a service principal is registered in the Kerberos environment. |
conf file | The
|
principal | An identity entity, which can be a user or a service. It has a unique name and an associated encryption key.
|
SSL authentication
Parameter | Description |
Truststore certificate file | The Truststore certificate file generated when SSL authentication is enabled, such as the |
Truststore password | The password set for the Truststore certificate file when SSL authentication is enabled. |
Keystore certificate file | The Keystore certificate file generated when SSL authentication is enabled, such as the |
Keystore password | The password set for the Keystore certificate file when SSL authentication is enabled. |
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Configure an offline sync task for a single table
For more information about the procedure, see Configure a task in the codeless UI and Configure a task in the code editor.
For all parameters and a script demo for the code editor, see Appendix: Script demo and parameters.
Configure an offline read sync task for an entire database
For more information about the procedure, see Offline sync task for an entire database.
Appendix: Script demo and parameters
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configuration in the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Reader script demo
You can read data using HDFS files or Hive JDBC:
Read data using HDFS files
{ "type": "job", "steps": [ { "stepType": "hive", "parameter": { "partition": "pt1=a,pt2=b,pt3=c", // Partition information "datasource": "hive_not_ha_****", // Data source name "column": [ // Columns to read "id", "pt2", "pt1" ], "readMode": "hdfs", // Read mode "table": "part_table_1", "fileSystemUsername" : "hdfs", "hivePartitionColumn": [ { "type": "string", "value": "partition name 1" }, { "type": "string", "value": "partition name 2" } ], "successOnNoFile":true }, "name": "Reader", "category": "reader" }, { "stepType": "hive", "parameter": { }, "name": "Writer", "category": "writer" } ], "version": "2.0", "order": { "hops": [ { "from": "Reader", "to": "Writer" } ] }, "setting": { "errorLimit": { "record": "" // Number of error records }, "speed": { "concurrent": 2, // Job concurrency "throttle": true,// If throttle is set to false, the mbps parameter does not take effect, which means that traffic is not throttled. If throttle is set to true, traffic is throttled. "mbps":"12"// Throttling } } }Read data using Hive JDBC
{ "type": "job", "steps": [ { "stepType": "hive", "parameter": { "querySql": "select id,name,age from part_table_1 where pt2='B'", "datasource": "hive_not_ha_****", // Data source name "session": [ "mapred.task.timeout=600000" ], "column": [ // Columns to read "id", "name", "age" ], "where": "", "table": "part_table_1", "readMode": "jdbc" // Read mode }, "name": "Reader", "category": "reader" }, { "stepType": "hive", "parameter": { }, "name": "Writer", "category": "writer" } ], "version": "2.0", "order": { "hops": [ { "from": "Reader", "to": "Writer" } ] }, "setting": { "errorLimit": { "record": "" }, "speed": { "concurrent": 2, // Job concurrency "throttle": true,// If throttle is set to false, the mbps parameter does not take effect, which means that traffic is not throttled. If throttle is set to true, traffic is throttled. "mbps":"12"// Throttling } } }
Reader script parameters
Parameter | Description | Required | Default value |
datasource | The name of the data source. The name must be the same as the one you added. | Yes | None |
table | The name of the table to synchronize. Note This is case-sensitive. | Yes | None |
readMode | The read mode:
Note
| No | None |
partition | The partition information of the Hive table:
| No | None |
session | The session-level configuration for Hive JDBC reads. You can set client parameters. For example, SET hive.exec.parallel=true | No | None |
column | The columns to read. For example,
| Yes | None |
querySql | When you read data using Hive JDBC, you can directly configure the querySql parameter to read data. | No | None |
where | When you read data using Hive JDBC, you can set the where parameter to filter data. | No | None |
fileSystemUsername | When you read data using HDFS, the user configured on the Hive data source page is used by default. If anonymous logon is configured on the data source page, the admin account is used by default. If a permission issue occurs during the sync task, you must switch to the code editor and configure the fileSystemUsername parameter. | No | None |
hivePartitionColumn | If you want to synchronize the values of partition fields to a downstream destination, you can switch to the code editor and configure the hivePartitionColumn parameter. | No | None |
successOnNoFile | When you read data in HDFS mode, this parameter specifies whether the sync task runs as normal if the directory is empty. | No | None |
Writer script demo
{
"type": "job",
"steps": [
{
"stepType": "hive",
"parameter": {
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "hive",
"parameter": {
"partition": "year=a,month=b,day=c", // Partition configuration
"datasource": "hive_ha_shanghai", // Data source
"table": "partitiontable2", // Destination table
"column": [ // Column configuration
"id",
"name",
"age"
],
"writeMode": "append" ,// Write mode
"fileSystemUsername" : "hdfs"
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"throttle":true,// If throttle is set to false, the mbps parameter does not take effect, which means that traffic is not throttled. If throttle is set to true, traffic is throttled.
"concurrent":2, // Job concurrency.
"mbps":"12"// Throttling
}
}
}Writer script parameters
Parameter | Description | Required | Default value |
datasource | The name of the data source. The name must be the same as the one you added. | Yes | None |
column | The columns to write. For example,
| Yes | None |
table | The name of the Hive table to write to. Note This is case-sensitive. | Yes | None |
partition | The partition information of the Hive table:
| No | None |
writeMode | The mode for writing data to the Hive table. After data is written to an HDFS file, the Hive Writer plugin runs The writeMode parameter specifies the data loading behavior:
Note The writeMode parameter is a high-risk parameter. Pay attention to the write directory and the behavior of the writeMode parameter to prevent accidental data deletion. The data loading behavior must be used with the hiveConfig parameter. Pay attention to your configuration. | Yes | None |
hiveConfig | You can configure advanced Hive extension parameters in hiveConfig, including hiveCommand, jdbcUrl, username, and password:
| Yes | None |
fileSystemUsername | When you write data to a Hive table, the user configured on the Hive data source page is used by default. If anonymous logon is configured on the data source page, the admin account is used by default. If a permission issue occurs during the sync task, you must switch to the code editor and configure the fileSystemUsername parameter. | No | None |
enableColumnExchange | If this parameter is set to True, column reordering is enabled. Note Only the Text format is supported. | No | None |
nullFormat | Data Integration provides the nullFormat parameter to define which strings can be interpreted as null. For example, if you configure Note The string "null" (the four characters n, u, l, l) is different from an actual null value. | No | None |