Use CDH or CDP clusters in DataWorks - DataWorks - Alibaba Cloud Documentation Center

Cloudera's Distribution Including Apache Hadoop (CDH) and Cloudera Data Platform (CDP) can be connected to DataWorks. This allows you to register CDH or CDP clusters to DataWorks. This way, you can use the data development and governance features provided by DataWorks to manage CDH or CDP data. The features include task development, task scheduling, metadata management in Data Map, and Data Quality.

Background information

CDH is the open source platform distribution of Cloudera. CDH provides out-of-the-box features such as cluster management, cluster monitoring, and cluster diagnostics. CDH also supports a variety of components to help you run end-to-end big data workflows.
CDP is a common data platform that collects and integrates customer data across platforms. You can use CDP to collect real-time data and construct real-time data as individual user data.

You can register CDH and CDP clusters to DataWorks. Then, you can use DataWorks features such as task development, task scheduling, metadata management in Data Map, and data quality monitoring to develop and manage data in the clusters based on your business requirements.

Prerequisites

The identity that you want to use is prepared and granted the required permissions. Only the following identities can register a CDH or CDP cluster:
- An Alibaba Cloud account.
- A DataWorks workspace member that is assigned the Workspace Administrator role. For more information about how to assign roles to members, see Add a RAM user to a workspace as a member and assign roles to the member.
- A DataWorks workspace member that is attached the AliyunDataWorksFullAccess policy. For information about how to grant permissions, see Grant permissions to a RAM user and Grant permissions to a RAM role. For information about how to add a user to a DataWorks workspace as a member, see Add a RAM user to a workspace as a member and assign roles to the member.
A CDH or CDP cluster is deployed, and the required configuration information about the cluster is obtained. For more information, see Preparations: Obtain configuration information about a CDH or CDP cluster and configure network connectivity.

Limits

Only exclusive resource groups for scheduling can be used to run CDH or CDP tasks.
You can register only clusters of CDH 5.16.2, CDH 6.1.1, CDH 6.2.1, CDH 6.3.2, and CDP 7.1.7 to DataWorks.
You can register a CDH or CDP cluster to DataWorks only in the following regions: China (Beijing), China (Shanghai), China (Hangzhou), China (Shenzhen), China (Zhangjiakou), and China (Chengdu).

Step 1: Go to the cluster registration page

Go to the Management Center page.
Log on to the DataWorks console. In the left-side navigation pane, click Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

In the left-side navigation pane of the SettingCenter page, click Open Source Clusters. On the Open Source Clusters page, click Registering a cluster. In the dialog box that appears, click CDH to go to the cluster registration page.

Step 2: Register a CDH or CDP cluster

Note

If you use a workspace in standard mode, you must register the cluster in the development and production environments. For information about the modes of workspaces, see Differences between workspaces in basic mode and workspaces in standard mode.
The procedure of registering a CDP cluster to DataWorks is similar to the procedure of registering a CDH cluster to DataWorks. This topic describes how to register a CDH cluster to DataWorks.

Configure the basic information about the cluster.

Parameter	Description
Cluster Display Name	The name of the cluster in DataWorks. The name must be unique within the current tenant.
Cluster Version	The version of the cluster that you want to register. Valid values: CDH 5.16.2, CDH 6.1.1, CDH 6.2.1, CDH 6.3.2, and CDP 7.1.7. The required parameters vary based on the cluster version. You can view the parameters that you must configure in the DataWorks console.
Cluster Name	The name of the cluster that you want to register. This parameter is used to determine the source of the configuration information that is required when you register a cluster. You can select a cluster that is registered to another DataWorks workspace or create a cluster. If you select a cluster that is registered to another DataWorks workspace, you can reference the configuration information about the cluster. If you create a cluster, you must configure the cluster before you can register the cluster.

Configure the cluster connection information.
Select versions for required components that are deployed in the cluster based on your business requirements and enter the component addresses that you obtained. For more information about how to obtain component addresses, see Preparations: Obtain configuration information about a CDH or CDP cluster and configure network connectivity.
Add configuration files.
You can upload configuration files of required components that are deployed in the cluster based on your business requirements. For more information about how to obtain configuration files, see Preparations: Obtain configuration information about a CDH or CDP cluster and configure network connectivity.

Configure the default access identity for the cluster.

Configure the identity that is used to access the CDH cluster when you run CDH tasks in DataWorks. The supported identities vary based on the runtime environment.

Note

If the Default Access Identity parameter is set to a value other than Cluster Account, but no required account mapping is configured or the Mapping Type parameter is set to No Authentication, tasks will fail to run.

Runtime environment

Default access identity

References

Development environment

Cluster Account: A fixed cluster account is used to access the CDH cluster regardless of who runs CDH tasks in DataWorks, such as an Alibaba Cloud account or a RAM user that is assigned the Development role.
Cluster account mapped by task performer: You must configure a mapping between a DataWorks tenant member that is used to run CDH tasks and a specific cluster account. After the configuration is complete, the mapped cluster account is used to access the CDH cluster.

Configure mappings between tenant member accounts and cluster accounts

Production environment

Cluster Account: A fixed cluster account is used to access the CDH cluster regardless of who runs CDH tasks in DataWorks, such as an Alibaba Cloud account or a RAM user that is assigned the Development role.
Cluster Account Mapped to Account of Node Owner, Cluster Account Mapped to Alibaba Cloud Account, or Cluster Account Mapped to RAM User: If you select one of these values for the Default Access Identity parameter, you must configure a mapping between the account that runs CDH tasks and a specific CDH cluster account. After the configuration is complete, the mapped CDH cluster account is actually used to run CDH tasks in DataWorks.

Click Complete Registration. The CDH cluster is registered to DataWorks.

Step 3: Initialize a resource group

The first time you register a CDH cluster to DataWorks, or if the service configurations of your CDH cluster change or the version of a component in your CDH cluster is updated, you must initialize the resource group that you use. This ensures that the resource group can access the CDH cluster as expected, and CDH tasks can be run as expected by using the current environment configurations of the resource group. For example, if you modify the core-site.xml configuration file of your CDH cluster, you must initialize the resource group. You must go to the Open Source Clusters page in SettingCenter, find the desired CDH cluster that is registered to DataWorks, and then click Initialize Resource Group in the section that displays the information about the CDH cluster to initialize the resource group that you want to use.

Note

DataWorks allows you to use only exclusive resource groups for scheduling to run CDH tasks. Therefore, you can select only an exclusive resource group for scheduling when you initialize a resource group.
If no exclusive resource group for scheduling is available, create an exclusive resource group for scheduling based on your business requirements. For more information about how to create an exclusive resource group for scheduling, see Create and use an exclusive resource group for scheduling.

What to do next

Configure identity mappings for the CDH cluster: If you set the Default Access Identity parameter to a value other than Cluster Account when you register the CDH cluster to DataWorks, you must configure identity mappings for the CDH cluster. The identity mappings are used to isolate and control the permissions on the CDH cluster in DataWorks.
Data development: You can create CDH Hive, CDH Spark, CDH MapReduce, CDH Impala, or CDH Presto nodes in DataStudio for data development. For more information, see Develop CDH nodes in DataWorks.