All Products
Search
Document Center

:Preparations: Obtain configuration information about a CDH or CDP cluster and configure network connectivity

Last Updated:Dec 18, 2023

Cloudera's Distribution Including Apache Hadoop (CDH) and Cloudera Data Platform (CDP) can be connected to DataWorks. This allows you to register CDH or CDP clusters to DataWorks. This way, you can use the data development and governance features provided by DataWorks to manage CDH or CDP data. The features include task development, task scheduling, metadata management in Data Map, and Data Quality. Before you register a CDH or CDP cluster to DataWorks, you must obtain the required configuration information about the cluster and configure network connectivity between the cluster and a specific resource group. This topic describes how to obtain the configuration information about a CDH cluster and configure network connectivity between the CDH cluster and a specific resource group.

Background information

  • CDH is the open source platform distribution of Cloudera. CDH provides out-of-the-box features such as cluster management, cluster monitoring, and cluster diagnostics. CDH also supports a variety of components to help you run end-to-end big data workflows.

  • CDP is a common data platform that collects and integrates customer data across platforms. You can use CDP to collect real-time data and construct real-time data as individual user data.

You can register CDH and CDP clusters to DataWorks. Then, you can use DataWorks features such as task development, task scheduling, metadata management in Data Map, and data quality monitoring to develop and manage data in the clusters based on your business requirements.

Prerequisites

  • A CDH cluster is deployed on an Elastic Compute Service (ECS) instance.

    The CDH cluster can also be deployed in an environment other than Alibaba Cloud ECS. You must make sure that the environment is connected to an Alibaba Cloud virtual private cloud (VPC). You can use Express Connect and VPN Gateway to ensure network connectivity.

  • An exclusive resource group for scheduling is created.

    By default, DataWorks exclusive resource groups for scheduling are not connected to the networks of other cloud services after the resource groups are created. A CDH cluster must be connected to a specific exclusive resource group for scheduling before you can use the CDH cluster. For more information about how to create an exclusive resource group for scheduling, see Create and use an exclusive resource group for scheduling.

Obtain the configuration information about the CDH cluster

Perform the following steps to obtain the configuration information about the CDH cluster. The configuration information is required when you register the CDH cluster to DataWorks.

  1. Obtain the version information about the CDH cluster.

    Log on to the Cloudera Manager Admin Console. On the page that appears, you can view the version information to the right of the cluster name, as shown in the following figure.cdh版本信息

  2. Obtain the host and component addresses of the CDH cluster. The addresses are required when you register the CDH cluster to DataWorks. You can use one of the following methods to obtain the addresses:

    Method 1: Use the DataWorks JAR package

    1. Log on to the Cloudera Manager Admin Console and download the DataWorks JAR package.

      wget https://dataworks-public-tools.oss-cn-shanghai.aliyuncs.com/dw-tools.jar
    2. Run the JAR package.

      export PATH=$PATH:/usr/java/jdk1.8.0_181-cloudera/bin
      java -jar dw-tools.jar <user> <password>

      Set user and password to the username and password that are used to log on to the Cloudera Manager Admin Console.

      View the host and component addresses of the CDH cluster in the returned results. Then, record the addresses.组件信息

    Method 2: Obtain the addresses from the Cloudera Manager Admin Console

    Log on to the Cloudera Manager Admin Console and select Roles from the Hosts drop-down list. Find the components that you want to configure based on the keywords and icons. Then, view and record the hostnames displayed on the left, and complete component addresses based on the hostnames and the address format. For more information about the default port numbers in the addresses, see the returned results in Method 1.方法二

    Components:

    • HS2: HiveServer2

    • HMS: Hive Metastore

    • ID: Impala Daemon

    • RM: YARN ResourceManager

  3. Obtain the configuration files of the CDH cluster. The configuration files must be uploaded when you register the CDH cluster to DataWorks.

    1. Log on to the Cloudera Manager Admin Console.

    2. On the Status tab, click the drop-down arrow to the right of the cluster name and select View Client Configuration URLs.配置文件

    3. In the Client Configuration URLs dialog box, download a specific configuration package. In this example, the YARN configuration package is downloaded.配置文件2

  4. Obtain the network information about the CDH cluster. The network information is used to configure network connectivity between the CDH cluster and DataWorks exclusive resource group for scheduling.

    1. Log on to the ECS console.

    2. In the left-side navigation pane, choose Instances & Images > Instances. In the top navigation bar, select the region where the ECS instance that hosts the CDH cluster resides. On the Instance page, find the ECS instance and click its ID. On the Instance Details tab of the page that appears, view and record the network information about the instance, such as the security group, VPC, and vSwitch.

Configure network connectivity

By default, DataWorks exclusive resource groups for scheduling are not connected to the networks of other cloud services after the resource groups are created. Before you use CDH, you must obtain the network information of your CDH cluster. Then, associate your DataWorks exclusive resource group for scheduling with the VPC in which the CDH cluster is deployed. This ensures network connectivity between the CDH cluster and DataWorks exclusive resource group for scheduling.

  1. Go to the network configuration page of the exclusive resource group for scheduling.

    1. Log on to the DataWorks console.

    2. In the left-side navigation pane, click Resource Groups. The Exclusive Resource Groups tab appears.

    3. Find the desired exclusive resource group for scheduling and click Network Settings in the Actions column.

  2. Associate the exclusive resource group for scheduling with the VPC in which the CDH cluster is deployed.

    On the VPC Binding tab of the page that appears, click Add Binding. In the Add VPC Binding panel, select the VPC, zone, vSwitch, and security group that are recorded in Step 4 in the "Obtain the configuration information about the CDH cluster" section.

  3. Configure hosts.

    Click the Hostname-to-IP Mapping tab. On this tab, click Batch Modify. In the Batch Modify Hostname-to-IP Mappings dialog box, enter the host addresses that are recorded in Step 2 in the "Obtain the configuration information about the CDH cluster" section.host配置

What to do next

After you complete the preparations, you can register the CDH cluster to DataWorks for data development. For more information, see Register a CDH or CDP cluster to DataWorks.