Cloudera's Distribution including Apache Hadoop (CDH) and Cloudera Data Platform (CDP) can be integrated into DataWorks. This allows you to configure your CDH clusters or CDP clusters as storage and compute engine instances in DataWorks. This way, you can use data development and governance features provided by DataWorks, such as node development, node scheduling, metadata management in DataMap, and Data Quality to manage CDH or CDP data. The operations that are required to integrate CDP into DataWorks and use CDP in DataWorks are similar to those required to integrate CDH into DataWorks and use CDH in DataWorks. This topic describes how to integrate CDH into DataWorks and use CDH in DataWorks.
Prerequisites
- A CDH cluster is deployed on an Elastic Compute Service (ECS) instance.
The CDH cluster can also be deployed in an environment other than Alibaba Cloud ECS. You must make sure that the environment can be connected to Alibaba Cloud. You can use Express Connect and VPN Gateway to ensure the network connectivity between the environment and Alibaba Cloud.
- DataWorks is activated, and a workspace is created. The workspace is used to connect the CDH cluster. Note You do not need to associate compute engine instances with the workspace that you want to use to connect the CDH cluster. Therefore, when you create the workspace, you do not need to select an engine. For more information about how to create a workspace, see Create a workspace.
- An account to which the workspace administrator role is assigned is created. Only accounts to which the workspace administrator role is assigned can be used to associate CDH clusters with a DataWorks workspace. For more information about how to assign the workspace administrator role to an account, see Manage permissions on workspace-level services.
- An exclusive resource group for scheduling is created in the DataWorks workspace. For more information, see Exclusive resource group mode.
After you complete the preceding operations, you can develop and run CDH nodes in DataWorks and view the status of the nodes in DataWorks Operation Center. For more information, see Develop CDH nodes in DataWorks and Configure O&M and monitoring and alerting settings for a CDH node.
You can also use the Data Quality and DataMap services of DataWorks to manage CDH data and nodes. For more information, see Configure monitoring rules for a CDH node in Data Quality and Use DataMap to collect data of a CDH cluster.
Limits
- To use the features of CDH in DataWorks, you must purchase and use a DataWorks exclusive resource group for scheduling.
- The CDH cluster must be connected to the exclusive resource group for scheduling.
- DataWorks supports CDH 6.1.1, CDH 5.16.2, CDH 6.2.1, and CDH 6.3.2.
Step 1: Obtain the configuration information of the CDH cluster
- Obtain the version information of the CDH cluster. The version information is required when you add the configurations of the CDH cluster to DataWorks. Log on to the Cloudera Manager Admin Console. On the page that appears, you can view the version information next to the cluster name, as shown in the following figure.
- Obtain the host and component addresses of the CDH cluster. The addresses are required when you add the configurations of the CDH cluster to DataWorks. You can use one of the following methods to obtain the addresses:
- Method 1: Use the DataWorks JAR package.
- Log on to the Cloudera Manager Admin Console and download the DataWorks JAR package.
wget https://dataworks-public-tools.oss-cn-shanghai.aliyuncs.com/dw-tools.jar
- Run the JAR package.
Setexport PATH=$PATH:/usr/java/jdk1.8.0_181-cloudera/bin java -jar dw-tools.jar <user> <password>
<user>
to the username that is used to log on to the Cloudera Manager Admin Console and<password>
to the password that is used to log on to the Cloudera Manager Admin Console. - View the host and component addresses of the CDH cluster in the returned results. Then, record the addresses.
- Log on to the Cloudera Manager Admin Console and download the DataWorks JAR package.
- Method 2: Obtain the addresses from the Cloudera Manager Admin Console. Log on to the Cloudera Manager Admin Console and select Roles from the Hosts drop-down list. Find the components that you want to configure based on keywords and icons. Then, view and record the hostnames displayed on the left, and complete component addresses based on the hostnames and the address format. For more information about the default port numbers in the addresses, see the returned results in Method 1.
Components:
- HS2: HiveServer2
- HMS: Hive Metastore
- ID: Impala Daemon
- RM: YARN ResourceManager
- Method 1: Use the DataWorks JAR package.
- Obtain the configuration files of the CDH cluster. The configuration files must be uploaded when you add the configurations of the CDH cluster to DataWorks.
- Obtain the network information of the CDH cluster. The network information is used to configure network connectivity between the CDH cluster and DataWorks exclusive resource group for scheduling.
Step 2: Configure network connectivity
By default, DataWorks exclusive resource groups for scheduling are not connected to the networks of resources for other Alibaba Cloud services after the resource groups are created. Therefore, before you use CDH, you must obtain the network information of your CDH cluster. Then, associate your DataWorks exclusive resource group for scheduling with the VPC in which the CDH cluster is deployed. This ensures network connectivity between the CDH cluster and DataWorks exclusive resource group for scheduling.
- Go to the network configuration page of the exclusive resource group for scheduling.
- Log on to the DataWorks console.
- In the left-side navigation pane, click Resource Groups. The Exclusive Resource Groups tab appears.
- Find the desired exclusive resource group for scheduling and click Network Settings in the Actions column.
- Associate the exclusive resource group for scheduling with the VPC in which the CDH cluster is deployed. On the VPC Binding tab of the page that appears, click Add Binding. In the Add VPC Binding panel, select the VPC, zone, vSwitch, and security group that are recorded in 4.
- Configure hosts. Click the Hostname-to-IP Mapping tab. On this tab, click Batch Modify. In the Batch Modify Hostname-to-IP Mappings dialog box, enter the host addresses that are recorded in 2.
Step 3: Add the configurations of the CDH cluster to DataWorks
Only workspace administrators can add the configurations of CDH clusters to DataWorks. Therefore, you must use an account to which the workspace administrator role is assigned to perform this operation.
- Go to the Workspace page.
- Log on to the DataWorks console.
- In the left-side navigation pane, click Workspaces.
- On the Workspaces page, find the desired workspace, move the pointer over the
icon in the Actions column, and then select Workspace Settings.
- In the left-side navigation pane, choose .
- On the CDH Cluster Configuration page, click Create Now. In the Create CDH Cluster Configuration dialog box, enter the component addresses that are recorded in Step 2: Configure network connectivity in the related fields.
Configuration information:
- Cluster name: the name of your CDH cluster. You can customize the name.
- Versions: Select the CDH cluster version and component versions based on your business requirements.
- Addresses: Enter the recorded component addresses. Configuration information:
- jobhistory.webapp.address for YARN: Change the port number in the value of yarn.resourcemanager.address to 8088.
- JDBC URL for Presto: Presto is not a default component for CDH. You must configure this parameter based on your business requirements.
- Upload configuration files and associate the CDH cluster with the workspace.
- Configure mappings between Alibaba Cloud accounts or RAM users and Kerberos accounts. If you want to isolate permissions on the data that can be accessed by using different Alibaba Cloud accounts or RAM users in a CDH cluster, enable Kerberos Account Authentication and configure the mappings between Alibaba Cloud accounts or RAM users and Kerberos accounts.Note Kerberos Account specifies an account that you use to access the CDH cluster. You can use the Sentry or Ranger component to configure different permissions for different Kerberos accounts in the CDH cluster to isolate permissions on data. The Alibaba Cloud accounts or RAM users that are mapped to the same Kerberos account have the same permissions on the data in the CDH cluster. Specify a Kerberos account (also referred to as a Kerberos principal) in the format of
Instance name@Domain name
, such as cdn_test@HADOOP.COM. - Click Confirm. After the configurations of the CDH cluster are added to DataWorks, you can associate the CDH cluster with the workspace as a compute engine instance. Then, you can develop and run CDH nodes in the workspace.
Step 4: Associate the CDH cluster with the workspace as a compute engine instance
- On the Workspaces page, click Workspace Settings in the Actions column that corresponds to the workspace.
- In the lower part of the Workspace Settings panel, click More. In the Compute Engine Information section of the Configuration page, click the CDH tab. On the CDH tab, click Add Instance. In the Add CDH Compute Engine dialog box, configure the parameters. You can set Access Mode to Shortcut mode or Security mode. If Security mode is selected, the permissions on the data of a CDH node that is run by using different Alibaba Cloud accounts or RAM users can be isolated. The parameters that need to be configured vary based on the value of the Access Mode parameter.
- The following figure shows the parameters you must configure if you set Access Mode to Shortcut mode.
- The following figure shows the parameters you must configure if you set Access Mode to Security mode.
- The following figure shows the parameters you must configure if you set Access Mode to Shortcut mode.
- Click Confirm. Then, the system starts to initialize the exclusive resource group for scheduling. During the initialization, the system installs the client that is used to access the CDH cluster and uploads the configuration files of the CDH cluster. After the value of Initialization Status of Resource Group on the CDH tab changes from Preparing to Complete, the CDH cluster is associated with the workspace as a compute engine instance.
- Click Test Connectivity next to Test Service Connectivity on the CDH tab. Then, DataWorks runs a test task to check whether the client is installed and the configuration files are uploaded.
Develop CDH nodes in DataWorks
After the CDH compute engine instance is associated with the workspace, you can create and run CDH Hive, CDH Spark, CDH MR, CDH Impala, or CDH Presto nodes in DataStudio. You can also configure properties for the nodes. In this section, a CDH Hive node is created and run to demonstrate how to use a CDH node for data development.
- Go to the DataStudio page.
- Log on to the DataWorks console.
- In the left-side navigation pane, click Workspaces.
- On the Workspaces page, find the desired workspace and click Data Analytics in the Actions column.
- Create a workflow on the DataStudio page. In the Scheduled Workflow pane of the DataStudio page, move the pointer over Create and click Create Workflow. In the Create Workflow dialog box, configure the parameters and click Create.
- In the Scheduled Workflow pane, click Business Flow, find the created workflow, and then click the workflow name. Right-click CDH and choose .
- In the code editor, write SQL code for the CDH Hive node and click the
icon in the top toolbar. In the Parameters dialog box, select the created exclusive resource group for scheduling and click OK. After the code finishes running, you can view the results.
- If you want to configure properties for the node, click the Properties tab in the right-side navigation pane. On the Properties tab, configure time properties, resource properties, and scheduling dependencies for the node. Then, commit the node. After the node is committed, the system runs the node based on the configured properties. For more information about how to configure properties for a node, see Configure basic properties.
- Go to Operation Center and view the status of the node on the Cycle Task page. For more information, see View and manage auto triggered nodes.
Configure O&M and monitoring and alerting settings for a CDH node
You can use the intelligent monitoring feature provided by DataWorks Operation Center to monitor CDH nodes. This feature allows you to customize alert rules and configure alerting for CDH nodes. If errors occur on the CDH nodes, the system generates alerts based on the configured alert rules. For more information about how to create custom alert rules, see Create a custom alert rule. For more information about how to configure alerting for nodes, see Manage baselines.
Configure monitoring rules for a CDH node in Data Quality
When you use CDH in DataWorks, you can use the Data Quality service of DataWorks to query and compare data generated by a CDH node, monitor the quality of data generated by a CDH node, and scan SQL code of and perform intelligent alerting on a CDH node. For more information about the Data Quality service, see Overview.