DataWorks provides Hive Reader and Hive Writer for you to read data from and write data to Hive data sources. You can use the codeless user interface (UI) or code editor to configure data synchronization nodes for Hive data sources. This topic describes how to add a Hive data source.

Background information

Workspaces in standard mode support the data source isolation feature. You can add data sources separately for the development and production environments to isolate the data sources. This helps keep your data secure. For more information about the feature, see Isolate a data source in the development and production environments.
If you use Object Storage Service (OSS) as the underlying storage, you must take note of the following items:
  • The value of the defaultFS parameter must start with oss://. For example, the value can be in the oss://bucketName format.
  • You must configure the parameters that are required for connecting to OSS in the advanced parameters. The following code provides an example:
    {
        "fs.oss.accessKeyId":"<yourAccessKeyId>",
        "fs.oss.accessKeySecret":"<yourAccessKeySecret>",
        "fs.oss.endpoint":"oss-cn-<yourRegion>-internal.aliyuncs.com"
        }

Limits

  • You can use only exclusive resource groups for Data Integration to read data from or write data to Hive data sources. For more information about exclusive resource groups for Data Integration, see Create and use an exclusive resource group for Data Integration.
  • Hive data sources support only Kerberos authentication. If you do not need to perform identity authentication for a Hive data source, you can set the Special Authentication Method parameter to None when you add the data source.
  • If Kerberos authentication is enabled for both HiveServer2 and metastore for a Hive data source that is accessed by using a Kerberos-authenticated identity in DataWorks, and the principals that are used for the authentication are different, you must add the following configuration to the extended parameters:
     {
    "hive.metastore.kerberos.principal": "your metastore principal"
    }

Add a Hive data source

  1. Go to the Data Source page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. After you select the region where the required workspace resides, find the workspace and click Data Integration in the Actions column.
    4. In the left-side navigation pane of the Data Integration page, choose Data Source > Data Sources to go to the Data Source page.
  2. On the Data Source page, click Create Data Source in the upper-left corner.
  3. In the Add data source dialog box, click Hive in the Big Data Storage section.
  4. In the Add Hive data source dialog box, configure the parameters.
    You can add a Hive data source in one of the following modes: Alibaba Cloud instance mode, connection string mode, and built-in mode of CDH.
    • The following table describes the parameters you can configure when you add a Hive data source in Alibaba Cloud instance mode. Add a Hive data source in Alibaba Cloud instance mode
      ParameterDescription
      Data Source TypeThe mode in which you want to add the data source to DataWorks. Set this parameter to Alibaba Cloud Instance Mode.
      Data Source NameThe name of the data source. The name can contain only letters, digits, and underscores (_), and must start with a letter.
      Data Source DescriptionThe description of the data source. The description cannot exceed 80 characters in length.
      EnvironmentThe environment in which the data source is used. Valid values: Development and Production.
      Note This parameter is displayed only if the workspace is in standard mode.
      RegionThe region where the data source resides.
      Cluster IDThe ID of the EMR cluster that you want to add to DataWorks as a data source. You can log on to the EMR console to view the ID.
      EMR instance account IDThe ID of the Alibaba Cloud account that is used to purchase the EMR cluster. You can log on to the Alibaba Cloud Management Console with the Alibaba Cloud account, move the pointer over the profile picture in the upper-right corner, and then select Security Settings to view the ID of the account.
      Database NameThe name of the Hive metadatabase that you want to access.
      HIVE LoginThe mode in which you want to connect to the Hive metadatabase. Valid values: Login with username and password(LDAP) and Anonymous.

      If you select Login with username and password(LDAP), enter the username and password that you can use to connect to the Hive metadatabase in the Hive username and Hive password fields.

      Metadata Storage TypeThe metadata storage type that you select when you purchase the EMR cluster. Valid values: DLF and Hive MetaStore. The value DLF indicates that the metadata of your EMR cluster is stored in Alibaba Cloud Data Lake Formation (DLF). For information about Alibaba Cloud DLF, see Overview.
      Hive VersionThe Hive version that you want to use.
      defaultFSThe address of the NameNode node in the Active state in Hadoop Distributed File System (HDFS). Configure this parameter in the format of hdfs://IP address of the host:Port number.
      Extended parametersThe advanced parameters of Hive, such as those related to high availability. The following code provides an example:
      {
      "dfs.nameservices": "testDfs",
      "dfs.ha.namenodes.testDfs": "namenode1,namenode2",
      "dfs.namenode.rpc-address.testDfs.namenode1": "",
      "dfs.namenode.rpc-address.testDfs.namenode2": "",
      "dfs.client.failover.proxy.provider.testDfs":"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
      }
      Special Authentication Method

      Specifies whether to enable identity authentication. Default value: None. You can alternatively set this parameter to Kerberos Authentication. For more information about Kerberos authentication, see Configure Kerberos authentication.

      Keytab File

      If you set the Special Authentication Method parameter to Kerberos Authentication, you must select the specified keytab from the Keytab File drop-down list.

      If no keytab is available, you can click Add Authentication File to upload a keytab.

      CONF File

      If you set the Special Authentication Method parameter to Kerberos Authentication, you must select the specified CONF file from the CONF File drop-down list.

      If no CONF file is available, you can click Add Authentication File to upload a CONF file.

      principal

      The Kerberos principal. Specify this parameter in the format of Principal name/Instance name@Domain name, such as ****/hadoopclient@**.***.

    • The following table describes the parameters you can configure when you add a Hive data source in connection string mode. Add a Hive data source in connection string mode
      ParameterDescription
      Data Source TypeThe mode in which you want to add the data source to DataWorks. Set this parameter to Connection String Mode.
      Data Source NameThe name of the data source. The name can contain only letters, digits, and underscores (_), and must start with a letter.
      Data Source DescriptionThe description of the data source. The description cannot exceed 80 characters in length.
      EnvironmentThe environment in which the data source is used. Valid values: Development and Production.
      Note This parameter is displayed only if the workspace is in standard mode.
      HIVE JDBC URLThe Java Database Connectivity (JDBC) URL of the Hive metadatabase.

      If you set Special Authentication Method to Kerberos Authentication, you must append the Kerberos principal that you specify to the value of the HIVE JDBC URL parameter. Example: jdbc:hive2://***.**.*.***:10000/default;principal=<your principal>.

      Database NameThe name of the Hive metadatabase. You can run the show databases command on the Hive client to query the created metadatabases.
      HIVE LoginThe mode in which you want to connect to the Hive metadatabase. Valid values: Login with username and password(LDAP) and Anonymous.

      If you select Login with username and password(LDAP), enter the username and password that you can use to connect to the Hive metadatabase in the Hive username and Hive password fields.

      Metadata Storage TypeThe metadata storage type that you select when you purchase the EMR cluster. Valid values: DLF and Hive MetaStore. The value DLF indicates that the metadata of your EMR cluster is stored in Alibaba Cloud DLF. For information about Alibaba Cloud DLF, see Overview.
      Hive VersionThe Hive version that you want to use.
      metastoreUrisThe Uniform Resource Identifiers (URIs) of the Hive metadatabase. Configure this parameter in the format of thrift://ip1:port1,thrift://ip2:port2.
      defaultFSThe address of the NameNode node in the Active state in HDFS. Configure this parameter in the format of hdfs://IP address of the host:Port number.
      Extended parametersThe advanced parameters of Hive. The following code provides an example:
      {
      //Advanced parameters related to high availability. 
      "dfs.nameservices":"testDfs",
      "dfs.ha.namenodes.testDfs":"namenode1,namenode2",
      "dfs.namenode.rpc-address.testDfs.namenode1": "",
      "dfs.namenode.rpc-address.testDfs.namenode2": "",
      "dfs.client.failover.proxy.provider.testDfs":"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
      //If you use OSS as the underlying storage, you can configure the following parameters that are required for connecting to OSS in the advanced parameters. You can also use another service as the underlying storage. 
      "fs.oss.accessKeyId":"<yourAccessKeyId>",
      "fs.oss.accessKeySecret":"<yourAccessKeySecret>",
      "fs.oss.endpoint":"oss-cn-<yourRegion>-internal.aliyuncs.com"
      }
      Special Authentication Method

      Specifies whether to enable identity authentication. Default value: None. You can alternatively set this parameter to Kerberos Authentication. For more information about Kerberos authentication, see Configure Kerberos authentication.

      If you set Special Authentication Method to Kerberos Authentication, you must append the Kerberos principal that you specify to the value of the HIVE JDBC URL parameter. Example: jdbc:hive2://***.**.*.***:10000/default;principal=hive/**@**.***.
      Keytab File

      If you set the Special Authentication Method parameter to Kerberos Authentication, you must select the specified keytab from the Keytab File drop-down list.

      If no keytab is available, you can click Add Authentication File to upload a keytab.

      CONF File

      If you set the Special Authentication Method parameter to Kerberos Authentication, you must select the specified CONF file from the CONF File drop-down list.

      If no CONF file is available, you can click Add Authentication File to upload a CONF file.

      principal

      The Kerberos principal. Specify this parameter in the format of Principal name/Instance name@Domain name, such as ****/hadoopclient@**.***.

    • The following table describes the parameters you can configure when you add a Hive data source in built-in mode of CDH. Add a Hive data source in built mode of CDH
      ParameterDescription
      Data Source TypeThe mode in which you want to add the data source to DataWorks. Set this parameter to Built-in Mode of CDH.
      Data Source NameThe name of the data source. The name can contain only letters, digits, and underscores (_), and must start with a letter.
      Data Source DescriptionThe description of the data source. The description cannot exceed 80 characters in length.
      EnvironmentThe environment in which the data source is used. Valid values: Development and Production.
      Note This parameter is displayed only if the workspace is in standard mode.
      Select CDH ClusterThe CDH cluster that you want to use.
      Special Authentication Method

      Specifies whether to enable identity authentication. Default value: None. You can alternatively set this parameter to Kerberos Authentication. For more information about Kerberos authentication, see Configure Kerberos authentication.

      Keytab File

      If you set the Special Authentication Method parameter to Kerberos Authentication, you must select the specified keytab from the Keytab File drop-down list.

      If no keytab is available, you can click Add Authentication File to upload a keytab.

      CONF File

      If you set the Special Authentication Method parameter to Kerberos Authentication, you must select the specified CONF file from the CONF File drop-down list.

      If no CONF file is available, you can click Add Authentication File to upload a CONF file.

      principal

      The Kerberos principal. Specify this parameter in the format of Principal name/Instance name@Domain name, such as ****/hadoopclient@**.***.

  5. Set Resource Group connectivity to Data Integration.
  6. Find the desired resource group in the resource group list in the lower part of the dialog box and click Test connectivity in the Actions column.
    A synchronization node can use only one type of resource group. To ensure that your synchronization nodes can be normally run, you must test the connectivity of all the resource groups for Data Integration on which your synchronization nodes will be run. If you want to test the connectivity of multiple resource groups for Data Integration at a time, select the resource groups and click Batch test connectivity. For more information, see Establish a network connection between a resource group and a data source.
    Note
    • By default, the resource group list displays only exclusive resource groups for Data Integration. To ensure the stability and performance of data synchronization, we recommend that you use exclusive resource groups for Data Integration.
    • If you want to test the network connectivity between the shared resource group or a custom resource group and the data source, click Advanced below the resource group list. In the Warning message, click Confirm. Then, all available shared and custom resource groups appear in the resource group list.
  7. After the data source passes the connectivity test, click Complete.

Hive versions supported by Hive Reader and Hive Writer

Hive Reader and Hive Writer support the following Hive versions:
0.8.0
0.8.1
0.9.0
0.10.0
0.11.0
0.12.0
0.13.0
0.13.1
0.14.0
1.0.0
1.0.1
1.1.0
1.1.1
1.2.0
1.2.1
1.2.2
2.0.0
2.0.1
2.1.0
2.1.1
2.2.0
2.3.0
2.3.1
2.3.2
2.3.3
2.3.4
2.3.5
2.3.6
2.3.7
3.0.0
3.1.0
3.1.1
3.1.2
0.8.1-cdh4.0.0
0.8.1-cdh4.0.1
0.9.0-cdh4.1.0
0.9.0-cdh4.1.1
0.9.0-cdh4.1.2
0.9.0-cdh4.1.3
0.9.0-cdh4.1.4
0.9.0-cdh4.1.5
0.10.0-cdh4.2.0
0.10.0-cdh4.2.1
0.10.0-cdh4.2.2
0.10.0-cdh4.3.0
0.10.0-cdh4.3.1
0.10.0-cdh4.3.2
0.10.0-cdh4.4.0
0.10.0-cdh4.5.0
0.10.0-cdh4.5.0.1
0.10.0-cdh4.5.0.2
0.10.0-cdh4.6.0
0.10.0-cdh4.7.0
0.10.0-cdh4.7.1
0.12.0-cdh5.0.0
0.12.0-cdh5.0.1
0.12.0-cdh5.0.2
0.12.0-cdh5.0.3
0.12.0-cdh5.0.4
0.12.0-cdh5.0.5
0.12.0-cdh5.0.6
0.12.0-cdh5.1.0
0.12.0-cdh5.1.2
0.12.0-cdh5.1.3
0.12.0-cdh5.1.4
0.12.0-cdh5.1.5
0.13.1-cdh5.2.0
0.13.1-cdh5.2.1
0.13.1-cdh5.2.2
0.13.1-cdh5.2.3
0.13.1-cdh5.2.4
0.13.1-cdh5.2.5
0.13.1-cdh5.2.6
0.13.1-cdh5.3.0
0.13.1-cdh5.3.1
0.13.1-cdh5.3.2
0.13.1-cdh5.3.3
0.13.1-cdh5.3.4
0.13.1-cdh5.3.5
0.13.1-cdh5.3.6
0.13.1-cdh5.3.8
0.13.1-cdh5.3.9
0.13.1-cdh5.3.10
1.1.0-cdh5.3.6
1.1.0-cdh5.4.0
1.1.0-cdh5.4.1
1.1.0-cdh5.4.2
1.1.0-cdh5.4.3
1.1.0-cdh5.4.4
1.1.0-cdh5.4.5
1.1.0-cdh5.4.7
1.1.0-cdh5.4.8
1.1.0-cdh5.4.9
1.1.0-cdh5.4.10
1.1.0-cdh5.4.11
1.1.0-cdh5.5.0
1.1.0-cdh5.5.1
1.1.0-cdh5.5.2
1.1.0-cdh5.5.4
1.1.0-cdh5.5.5
1.1.0-cdh5.5.6
1.1.0-cdh5.6.0
1.1.0-cdh5.6.1
1.1.0-cdh5.7.0
1.1.0-cdh5.7.1
1.1.0-cdh5.7.2
1.1.0-cdh5.7.3
1.1.0-cdh5.7.4
1.1.0-cdh5.7.5
1.1.0-cdh5.7.6
1.1.0-cdh5.8.0
1.1.0-cdh5.8.2
1.1.0-cdh5.8.3
1.1.0-cdh5.8.4
1.1.0-cdh5.8.5
1.1.0-cdh5.9.0
1.1.0-cdh5.9.1
1.1.0-cdh5.9.2
1.1.0-cdh5.9.3
1.1.0-cdh5.10.0
1.1.0-cdh5.10.1
1.1.0-cdh5.10.2
1.1.0-cdh5.11.0
1.1.0-cdh5.11.1
1.1.0-cdh5.11.2
1.1.0-cdh5.12.0
1.1.0-cdh5.12.1
1.1.0-cdh5.12.2
1.1.0-cdh5.13.0
1.1.0-cdh5.13.1
1.1.0-cdh5.13.2
1.1.0-cdh5.13.3
1.1.0-cdh5.14.0
1.1.0-cdh5.14.2
1.1.0-cdh5.14.4
1.1.0-cdh5.15.0
1.1.0-cdh5.16.0
1.1.0-cdh5.16.2
1.1.0-cdh5.16.99
2.1.1-cdh6.1.1
2.1.1-cdh6.2.0
2.1.1-cdh6.2.1
2.1.1-cdh6.3.0
2.1.1-cdh6.3.1
2.1.1-cdh6.3.2
2.1.1-cdh6.3.3