DataWorks provides HDFS Reader and HDFS Writer for you to read data from and write data to HDFS data sources. You can use the codeless user interface (UI) or code editor to configure synchronization nodes for HDFS data sources.

Background information

Workspaces in standard mode support the data source isolation feature. You can add data sources separately for the development and production environments to isolate the data sources. This helps keep your data secure. For more information about the feature, see Isolate connections between the development and production environments.
If you use Object Storage Service (OSS) as the storage, you must take note of the following items:
  • The value of the defaultFS parameter must start with oss://. For example, the value can be `oss://IP:PORT` or `oss://nameservice`.
  • You must configure the parameters that are required for connecting to OSS in the advanced parameters of Hive. The following code provides an example:
    {
            "hadoopConfig":{
                "fs.oss.accessKeyId":"<yourAccessKeyId>",
                    "fs.oss.accessKeySecret":"<yourAccessKeySecret>",
                    "fs.oss.endpoint":"oss-cn-<yourRegion>-internal.aliyuncs.com"
            }
        }

Procedure

  1. Go to the Data Source page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. After you select the region where the required workspace resides, find the workspace and click Data Integration in the Actions column.
    4. In the left-side navigation pane of the Data Integration page, choose Data Source > Data Sources to go to the Data Source page.
  2. On the Data Source page, click Add data source in the upper-right corner.
  3. In the Add data source dialog box, click HDFS in the Semi-structuredstorage section.
  4. In the Add HDFS data source dialog box, configure the parameters.
    You can use one of the following modes to add an HDFS data source: Connection string mode and Built-in Mode of CDH Cluster.
    • The following table describes the parameters you must configure if you add an HDFS data source by using Connection string mode. HDFS
      Parameter Description
      Data Source Name The name of the data source. The name can contain letters, digits, and underscores (_) and must start with a letter.
      Data source description The description of the data source. The description can be a maximum of 80 characters in length.
      Environment The environment in which the data source is used. Valid values: Development and Production.
      Note This parameter is displayed only when the workspace is in standard mode.
      DefaultFS The address of the NameNode node in HDFS. Specify this parameter in the format of hdfs://IP address of the host:Port number.
      Connection Extension Parameters The advanced parameter hadoopConfig for HDFS Reader and HDFS Writer. You can configure the advanced parameters of Hadoop, such as those related to high availability.
      Special Authentication Method

      Specifies whether identity authentication is required. Default value: None. You can also set this parameter to Kerberos Authentication. For more information about Kerberos authentication, see Configure Kerberos authentication.

      Keytab File

      If you set Special Authentication Method to Kerberos Authentication, you must select the desired keytab file from the Keytab File drop-down list.

      If no keytab file is available, you can click Add Authentication File to upload a keytab file.

      CONF File

      If you set Special Authentication Method to Kerberos Authentication, you must select the desired CONF file from the CONF File drop-down list.

      If no CONF file is available, you can click Add Authentication File to upload a CONF file.

      principal

      The Kerberos principal. Specify this parameter in the format of Principal name/Instance name@Domain name, such as ****/hadoopclient@**.*** .

    • The following table describes the parameters you must configure if you add an HDFS data source by using Built-in Mode of CDH Cluster. HDFS
      Parameter Description
      Data Source Name The name of the data source. The name can contain letters, digits, and underscores (_) and must start with a letter.
      Data source description The description of the data source. The description can be a maximum of 80 characters in length.
      Environment The environment in which the data source is used. Valid values: Development and Production.
      Note This parameter is displayed only when the workspace is in standard mode.
      DefaultFS The address of the NameNode node in HDFS. Specify this parameter in the format of hdfs://IP address of the host:Port number.
      Connection Extension Parameters The advanced parameter hadoopConfig for HDFS Reader and HDFS Writer. You can configure the advanced parameters of Hadoop, such as those related to high availability.
      Special Authentication Method

      Specifies whether identity authentication is required. Default value: None. You can also set this parameter to Kerberos Authentication. For more information about Kerberos authentication, see Configure Kerberos authentication.

      Keytab File

      If you set Special Authentication Method to Kerberos Authentication, you must select the desired keytab file from the Keytab File drop-down list.

      If no keytab file is available, you can click Add Authentication File to upload a keytab file.

      CONF File

      If you set Special Authentication Method to Kerberos Authentication, you must select the desired CONF file from the CONF File drop-down list.

      If no CONF file is available, you can click Add Authentication File to upload a CONF file.

      principal

      The Kerberos principal. Specify this parameter in the format of Principal name/Instance name@Domain name, such as ****/hadoopclient@**.*** .

  5. Set Resource Group connectivity to Data Integration.
  6. Find the desired resource group in the resource group list in the lower part of the dialog box and click Test connectivity in the Actions column.
    A synchronization node can use only one type of resource group. To ensure that your synchronization nodes can be normally run, you must test the connectivity of all the resource groups for Data Integration on which your synchronization nodes will be run. If you want to test the connectivity of multiple resource groups for Data Integration at a time, select the resource groups and click Batch test connectivity. For more information, see Select a network connectivity solution.
    Note
    • By default, the resource group list displays only exclusive resource groups for Data Integration. To ensure the stability and performance of data synchronization, we recommend that you use exclusive resource groups for Data Integration.
    • If you want to test the network connectivity between the shared resource group or a custom resource group and the data source, click Advanced below the resource group list. In the Warning message, click Confirm. Then, all available shared and custom resource groups appear in the resource group list.
  7. After the data source passes the connectivity test, click Complete.

What to do next

You have learned how to add an HDFS data source. You can proceed to subsequent tutorials. In subsequent tutorials, you will learn how to configure HDFS Reader and HDFS Writer. For more information, see HDFS Reader and HDFS Writer.