Integrate and use CDH and CDP - DataWorks - Alibaba Cloud Documentation Center

Cloudera's Distribution including Apache Hadoop (CDH) and Cloudera Data Platform (CDP) can be integrated into DataWorks. This allows you to configure your CDH clusters or CDP clusters as storage and compute engine instances in DataWorks. This way, you can use data development and governance features provided by DataWorks, such as task development, task scheduling, metadata management in Data Map, and data quality to manage CDH or CDP data. The operations that are required to integrate CDP into DataWorks and use CDP in DataWorks are similar to those required to integrate CDH into DataWorks and use CDH in DataWorks. This topic describes how to integrate CDH into DataWorks and use CDH in DataWorks.

Prerequisites

A CDH cluster is deployed on an Elastic Compute Service (ECS) instance.
The CDH cluster can also be deployed in an environment other than Alibaba Cloud ECS. You must make sure that the environment can be connected to Alibaba Cloud. You can use Express Connect and VPN Gateway to ensure network connectivity between the environment and Alibaba Cloud.
DataWorks is activated, and a workspace is created. The workspace is used to connect the CDH cluster.
Note
You do not need to associate compute engine instances with the workspace that you want to use to connect the CDH cluster. Therefore, when you create the workspace, you do not need to select an engine. For more information about how to create a workspace, see Create a workspace.
An account to which the Workspace Administrator role is assigned is created. Only accounts to which the Workspace Administrator role is assigned can be used to associate CDH clusters with a DataWorks workspace. For more information about how to assign the Workspace Administrator role to an account, see Manage permissions on workspace-level services.
An exclusive resource group for scheduling is created in the DataWorks workspace. For more information, see Exclusive resource group mode.

Before you use CDH in DataWorks, you must perform the following operations to integrate CDH into DataWorks:

Step 1: Obtain the configuration information of the CDH cluster
Step 2: Configure network connectivity
Step 3: Add the configurations of the CDH cluster to DataWorks

After you complete the preceding operations, you can develop and run CDH tasks in DataWorks and view the status of the tasks in DataWorks Operation Center. For more information, see Use DataWorks for data development and Configure monitoring and alerting settings in Operation Center.

You can also use the Data Quality and Data Map services of DataWorks to manage CDH data and tasks. For more information, see Configure monitoring rules in Data Quality and Use Data Map to collect data.

Limits

To use the features of CDH in DataWorks, you must purchase and use a DataWorks exclusive resource group for scheduling.
The CDH cluster must be connected to the exclusive resource group for scheduling.
DataWorks supports CDH 6.1.1, CDH 5.16.2, CDH 6.2.1, and CDH 6.3.2.

Step 1: Obtain the configuration information of the CDH cluster

Obtain the version information of the CDH cluster. The version information is required when you add the configurations of the CDH cluster to DataWorks.
Log on to the Cloudera Manager Admin Console. On the page that appears, you can view the version information on the right side of the cluster name, as shown in the following figure.
Obtain the host and component addresses of the CDH cluster. The addresses are required when you add the configurations of the CDH cluster to DataWorks. You can use one of the following methods to obtain the addresses:
- Method 1: Use the DataWorks JAR package.
  1. Log on to the Cloudera Manager Admin Console and download the DataWorks JAR package.
```
wget https://dataworks-public-tools.oss-cn-shanghai.aliyuncs.com/dw-tools.jar
```
  2. Run the JAR package.
```
export PATH=$PATH:/usr/java/jdk1.8.0_181-cloudera/bin
java -jar dw-tools.jar <user> <password>
```
    Set user and password to the username and password that are used to log on to the Cloudera Manager Admin Console.
  3. View the host and component addresses of the CDH cluster in the returned results. Then, record the addresses.
- Method 2: Obtain the addresses from the Cloudera Manager Admin Console.
  Log on to the Cloudera Manager Admin Console and select Roles from the Hosts drop-down list. Find the components that you want to configure based on keywords and icons. Then, view and record the hostnames displayed on the left, and complete component addresses based on the hostnames and the address format. For more information about the default port numbers in the addresses, see the returned results in Method 1. Components:
  - HS2: HiveServer2
  - HMS: Hive Metastore
  - ID: Impala Daemon
  - RM: YARN ResourceManager
Obtain the configuration files of the CDH cluster. The configuration files must be uploaded when you add the configurations of the CDH cluster to DataWorks.
1. Log on to the Cloudera Manager Admin Console.
2. On the Status tab, click the drop-down arrow on the right of the cluster name and select View Client Configuration URLs.
3. In the Client Configuration URLs dialog box, download the YARN configuration package.
Obtain the network information of the CDH cluster. The network information is used to configure network connectivity between the CDH cluster and DataWorks exclusive resource group for scheduling.
1. Log on to the ECS console.
2. In the left-side navigation pane, choose Instances & Images > Instances. In the top navigation bar, select the region where the ECS instance that hosts the CDH cluster resides. On the Instances page, find the ECS instance and click its ID. On the Instance Details tab of the page that appears, view the information about the instance, such as security group, VPC, and vSwitch. Then, record the information.

Step 2: Configure network connectivity

By default, DataWorks exclusive resource groups for scheduling are not connected to the networks of resources for other Alibaba Cloud services after the resource groups are created. Therefore, before you use CDH, you must obtain the network information of your CDH cluster and associate your DataWorks exclusive resource group for scheduling with the VPC in which the CDH cluster is deployed. This ensures network connectivity between the CDH cluster and DataWorks exclusive resource group for scheduling.

Go to the network configuration page of the exclusive resource group for scheduling.
1. Log on to the DataWorks console.
2. In the left-side navigation pane, click Resource Groups. The Exclusive Resource Groups tab appears.
3. Find the desired exclusive resource group for scheduling and click Network Settings in the Actions column.
Associate the exclusive resource group for scheduling with the VPC in which the CDH cluster is deployed.
On the VPC Binding tab of the page that appears, click Add Binding. In the Add VPC Binding panel, select the VPC, zone, vSwitch, and security group that are recorded in Substep 4 in the preceding section.
Configure hosts.
Click the Hostname-to-IP Mapping tab. On this tab, click Batch Modify. In the Batch Modify Hostname-to-IP Mappings dialog box, enter the host addresses that are recorded in Substep 2 in the preceding section.

Step 3: Add the configurations of the CDH cluster to DataWorks

Before you can perform data development operations based on the CDH cluster in the workspace, you must register the CDH cluster to the workspace. For more information, see Register a CDH or CDP cluster to DataWorks.

Note

Only an account to which the Workspace Administrator role is assigned can be used to register clusters to a workspace.
If you use a workspace in standard mode, you must separately configure cluster information for the development environment and production environment. For more information about differences between workspace modes, see Differences between workspaces in basic mode and workspaces in standard mode.

Configuration information:

Cluster Display Name: Specify a display name for the CDH cluster based on your business requirements.
Cluster Version: Select the version of the CDH cluster and the versions of the components based on your business requirements.
Connection Information: Configure the parameters in this section based on the component addresses that are recorded in Step 1: Obtain the configuration information of the CDH cluster.
- Jobhistory.Webapp.Address for YARN: Copy the value of Yarn.Resourcemanager.Address to the field for this parameter and change the port number to 8088.
- JDBC URL for Presto: Presto is not a default component for CDH. You must configure this parameter based on your business requirements.

Use DataWorks for data development

After the configurations of the CDH cluster are added to the workspace, you can create CDH Hive, CDH Spark, CDH MR, CDH Impala, or CDH Presto nodes and develop tasks of these types in DataStudio. You can directly run tasks on the nodes or configure scheduling properties for them. In this section, a CDH Hive node is created and a task is run on the node for data development.

Go to the DataStudio page.
Log on to the DataWorks console. In the left-side navigation pane, choose Data Modeling and Development > DataStudio. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.
Create a workflow on the DataStudio page. In the Scheduled Workflow pane of the DataStudio page, move the pointer over Create and click Create Workflow. In the Create Workflow dialog box, configure the parameters and click Create.
In the Scheduled Workflow pane, click Business Flow, find the created workflow, and then click the workflow name. Right-click CDH and choose Create Node > CDH Hive.
In the code editor, write SQL code for the CDH Hive node and click the icon in the top toolbar. In the Parameters dialog box, select the created exclusive resource group for scheduling and click OK. After the running is complete, you can view the results.
If you want to configure properties for the node, click the Properties tab in the right-side navigation pane. On the Properties tab, configure time properties, resource properties, and scheduling dependencies for the node. Then, commit the node. After the node is committed, the system runs a task on the node based on the configured properties. For more information about how to configure properties for a node, see Configure basic properties.
Go to Operation Center and view the status of the node on the Cycle Task page. For more information, see View and manage auto triggered nodes.

Configure monitoring and alerting settings in Operation Center

You can use the intelligent monitoring feature provided by DataWorks Operation Center to monitor CDH nodes. This feature allows you to customize alert rules and configure alerting for CDH nodes. If errors occur on the CDH nodes, the system generates alerts based on the configured alert rules. For more information about how to create custom alert rules, see Create a custom alert rule. For more information about how to configure alerting for nodes, see Manage baselines.

Configure monitoring rules in Data Quality

When you use CDH in DataWorks, you can use the Data Quality service of DataWorks to query and compare data generated by a CDH node, monitor the quality of data generated by a CDH node, and scan SQL code of and perform intelligent alerting on a CDH node. For more information about the Data Quality service, see Overview.

Use Data Map to collect data

When you use CDH in DataWorks, you can use the Data Map service of DataWorks to collect the metadata of Hive databases, tables, fields, and partitions in a CDH cluster. This facilitates global data searches, viewing of metadata details, data preview, data lineage management, and data category management.

Note

You can use Data Map to collect the metadata only of Hive databases in CDH clusters.

For more information about the Data Map service and related configurations, see Overview.