All Products
Search
Document Center

:Register a CDH or CDP cluster to DataWorks

Last Updated:Dec 18, 2023

Cloudera's Distribution Including Apache Hadoop (CDH) and Cloudera Data Platform (CDP) can be connected to DataWorks. This allows you to register CDH or CDP clusters to DataWorks. This way, you can use the data development and governance features provided by DataWorks to manage CDH or CDP data. The features include task development, task scheduling, metadata management in Data Map, and Data Quality.

Background information

  • CDH is the open source platform distribution of Cloudera. CDH provides out-of-the-box features such as cluster management, cluster monitoring, and cluster diagnostics. CDH also supports a variety of components to help you run end-to-end big data workflows.

  • CDP is a common data platform that collects and integrates customer data across platforms. You can use CDP to collect real-time data and construct real-time data as individual user data.

You can register CDH and CDP clusters to DataWorks. Then, you can use DataWorks features such as task development, task scheduling, metadata management in Data Map, and data quality monitoring to develop and manage data in the clusters based on your business requirements.

Prerequisites

Limits

  • Only exclusive resource groups for scheduling can be used to run CDH or CDP tasks.

  • You can register only clusters of CDH 5.16.2, CDH 6.1.1, CDH 6.2.1, CDH 6.3.2, and CDP 7.1.7 to DataWorks.

  • You can register a CDH or CDP cluster to DataWorks only in the following regions: China (Beijing), China (Shanghai), China (Hangzhou), China (Shenzhen), China (Zhangjiakou), and China (Chengdu).

Step 1: Go to the cluster registration page

  1. Go to the Management Center page.

    Log on to the DataWorks console. In the left-side navigation pane, click Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

  1. In the left-side navigation pane of the SettingCenter page, click Open Source Clusters. On the Open Source Clusters page, click Registering a cluster. In the dialog box that appears, click CDH to go to the cluster registration page.

Step 2: Register a CDH or CDP cluster

Note
  • If you use a workspace in standard mode, you must register the cluster in the development and production environments. For information about the modes of workspaces, see Differences between workspaces in basic mode and workspaces in standard mode.

  • The procedure of registering a CDP cluster to DataWorks is similar to the procedure of registering a CDH cluster to DataWorks. This topic describes how to register a CDH cluster to DataWorks.

  1. Configure the basic information about the cluster.

    Parameter

    Description

    Cluster Display Name

    The name of the cluster in DataWorks. The name must be unique within the current tenant.

    Cluster Version

    The version of the cluster that you want to register.

    Valid values: CDH 5.16.2, CDH 6.1.1, CDH 6.2.1, CDH 6.3.2, and CDP 7.1.7. The required parameters vary based on the cluster version. You can view the parameters that you must configure in the DataWorks console.

    Cluster Name

    The name of the cluster that you want to register. This parameter is used to determine the source of the configuration information that is required when you register a cluster. You can select a cluster that is registered to another DataWorks workspace or create a cluster.

    • If you select a cluster that is registered to another DataWorks workspace, you can reference the configuration information about the cluster.

    • If you create a cluster, you must configure the cluster before you can register the cluster.

  2. Configure the cluster connection information.

    Select versions for required components that are deployed in the cluster based on your business requirements and enter the component addresses that you obtained. For more information about how to obtain component addresses, see Preparations: Obtain configuration information about a CDH or CDP cluster and configure network connectivity.image.png

  3. Add configuration files.

    You can upload configuration files of required components that are deployed in the cluster based on your business requirements. For more information about how to obtain configuration files, see Preparations: Obtain configuration information about a CDH or CDP cluster and configure network connectivity.

    image.png

  4. Configure the default access identity for the cluster.

    Configure the identity that is used to access the CDH cluster when you run CDH tasks in DataWorks. The supported identities vary based on the runtime environment.

    Note

    If the Default Access Identity parameter is set to a value other than Cluster Account, but no required account mapping is configured or the Mapping Type parameter is set to No Authentication, tasks will fail to run.

    Runtime environment

    Default access identity

    References

    Development environment

    • Cluster Account: A fixed cluster account is used to access the CDH cluster regardless of who runs CDH tasks in DataWorks, such as an Alibaba Cloud account or a RAM user that is assigned the Development role.

    • Cluster account mapped by task performer: You must configure a mapping between a DataWorks tenant member that is used to run CDH tasks and a specific cluster account. After the configuration is complete, the mapped cluster account is used to access the CDH cluster.

    Configure mappings between tenant member accounts and cluster accounts

    Production environment

    • Cluster Account: A fixed cluster account is used to access the CDH cluster regardless of who runs CDH tasks in DataWorks, such as an Alibaba Cloud account or a RAM user that is assigned the Development role.

    • Cluster Account Mapped to Account of Node Owner, Cluster Account Mapped to Alibaba Cloud Account, or Cluster Account Mapped to RAM User: If you select one of these values for the Default Access Identity parameter, you must configure a mapping between the account that runs CDH tasks and a specific CDH cluster account. After the configuration is complete, the mapped CDH cluster account is actually used to run CDH tasks in DataWorks.

  5. Click Complete Registration. The CDH cluster is registered to DataWorks.

Step 3: Initialize a resource group

The first time you register a CDH cluster to DataWorks, or if the service configurations of your CDH cluster change or the version of a component in your CDH cluster is updated, you must initialize the resource group that you use. This ensures that the resource group can access the CDH cluster as expected, and CDH tasks can be run as expected by using the current environment configurations of the resource group. For example, if you modify the core-site.xml configuration file of your CDH cluster, you must initialize the resource group. You must go to the Open Source Clusters page in SettingCenter, find the desired CDH cluster that is registered to DataWorks, and then click Initialize Resource Group in the section that displays the information about the CDH cluster to initialize the resource group that you want to use.

Note
  • DataWorks allows you to use only exclusive resource groups for scheduling to run CDH tasks. Therefore, you can select only an exclusive resource group for scheduling when you initialize a resource group.

  • If no exclusive resource group for scheduling is available, create an exclusive resource group for scheduling based on your business requirements. For more information about how to create an exclusive resource group for scheduling, see Create and use an exclusive resource group for scheduling.

What to do next

  • Configure identity mappings for the CDH cluster: If you set the Default Access Identity parameter to a value other than Cluster Account when you register the CDH cluster to DataWorks, you must configure identity mappings for the CDH cluster. The identity mappings are used to isolate and control the permissions on the CDH cluster in DataWorks.

  • Data development: You can create CDH Hive, CDH Spark, CDH MapReduce, CDH Impala, or CDH Presto nodes in DataStudio for data development. For more information, see Develop CDH nodes in DataWorks.