Cloudera's Distribution including Apache Hadoop (CDH) and Cloudera Data Platform (CDP) can be integrated into DataWorks. This allows you to configure your CDH clusters or CDP clusters as storage and compute engine instances in DataWorks. This way, you can use data development and governance features provided by DataWorks, such as node development, node scheduling, metadata management in DataMap, and Data Quality to manage CDH or CDP data. The operations that are required to integrate CDP into DataWorks and use CDP in DataWorks are similar to those required to integrate CDH into DataWorks and use CDH in DataWorks. This topic describes how to integrate CDH into DataWorks and use CDH in DataWorks.

Prerequisites

  • A CDH cluster is deployed on an Elastic Compute Service (ECS) instance.

    The CDH cluster can also be deployed in an environment other than Alibaba Cloud ECS. You must make sure that the environment can be connected to Alibaba Cloud. You can use Express Connect and VPN Gateway to ensure the network connectivity between the environment and Alibaba Cloud.

  • DataWorks is activated, and a workspace is created. The workspace is used to connect the CDH cluster.
    Note You do not need to associate compute engine instances with the workspace that you want to use to connect the CDH cluster. Therefore, when you create the workspace, you do not need to select an engine. For more information about how to create a workspace, see Create a workspace.
  • An account to which the workspace administrator role is assigned is created. Only accounts to which the workspace administrator role is assigned can be used to associate CDH clusters with a DataWorks workspace. For more information about how to assign the workspace administrator role to an account, see Manage permissions on workspace-level services.
  • An exclusive resource group for scheduling is created in the DataWorks workspace. For more information, see Exclusive resource group mode.

After you complete the preceding operations, you can develop and run CDH nodes in DataWorks and view the status of the nodes in DataWorks Operation Center. For more information, see Develop CDH nodes in DataWorks and Configure O&M and monitoring and alerting settings for a CDH node.

You can also use the Data Quality and DataMap services of DataWorks to manage CDH data and nodes. For more information, see Configure monitoring rules for a CDH node in Data Quality and Use DataMap to collect data of a CDH cluster.

Limits

  • To use the features of CDH in DataWorks, you must purchase and use a DataWorks exclusive resource group for scheduling.
  • The CDH cluster must be connected to the exclusive resource group for scheduling.
  • DataWorks supports CDH 6.1.1, CDH 5.16.2, CDH 6.2.1, and CDH 6.3.2.

Step 1: Obtain the configuration information of the CDH cluster

  1. Obtain the version information of the CDH cluster. The version information is required when you add the configurations of the CDH cluster to DataWorks.
    Log on to the Cloudera Manager Admin Console. On the page that appears, you can view the version information next to the cluster name, as shown in the following figure. Version information of the CDH cluster
  2. Obtain the host and component addresses of the CDH cluster. The addresses are required when you add the configurations of the CDH cluster to DataWorks. You can use one of the following methods to obtain the addresses:
    • Method 1: Use the DataWorks JAR package.
      1. Log on to the Cloudera Manager Admin Console and download the DataWorks JAR package.
        wget https://dataworks-public-tools.oss-cn-shanghai.aliyuncs.com/dw-tools.jar
      2. Run the JAR package.
        export PATH=$PATH:/usr/java/jdk1.8.0_181-cloudera/bin
        java -jar dw-tools.jar <user> <password>
        Set <user> to the username that is used to log on to the Cloudera Manager Admin Console and <password> to the password that is used to log on to the Cloudera Manager Admin Console.
      3. View the host and component addresses of the CDH cluster in the returned results. Then, record the addresses. Component addresses
    • Method 2: Obtain the addresses from the Cloudera Manager Admin Console.
      Log on to the Cloudera Manager Admin Console and select Roles from the Hosts drop-down list. Find the components that you want to configure based on keywords and icons. Then, view and record the hostnames displayed on the left, and complete component addresses based on the hostnames and the address format. For more information about the default port numbers in the addresses, see the returned results in Method 1. Method 2Components:
      • HS2: HiveServer2
      • HMS: Hive Metastore
      • ID: Impala Daemon
      • RM: YARN ResourceManager
  3. Obtain the configuration files of the CDH cluster. The configuration files must be uploaded when you add the configurations of the CDH cluster to DataWorks.
    1. Log on to the Cloudera Manager Admin Console.
    2. On the Status tab, click the drop-down arrow on the right of the cluster name and select View Client Configuration URLs.
      Obtain configuration files
    3. In the Client Configuration URLs dialog box, download the YARN configuration package.
      Obtain configuration files 2
  4. Obtain the network information of the CDH cluster. The network information is used to configure network connectivity between the CDH cluster and DataWorks exclusive resource group for scheduling.
    1. Log on to the ECS console.
    2. In the left-side navigation pane, choose Instances & Images > Instances. In the top navigation bar, select the region where the ECS instance that hosts the CDH cluster resides. On the Instances page, find the ECS instance and click its ID. On the Instance Details tab of the page that appears, view the information about the instance, such as security group, VPC, and vSwitch. Then, record the information.
      View the network information of the ECS instance

Step 2: Configure network connectivity

By default, DataWorks exclusive resource groups for scheduling are not connected to the networks of resources for other Alibaba Cloud services after the resource groups are created. Therefore, before you use CDH, you must obtain the network information of your CDH cluster. Then, associate your DataWorks exclusive resource group for scheduling with the VPC in which the CDH cluster is deployed. This ensures network connectivity between the CDH cluster and DataWorks exclusive resource group for scheduling.

  1. Go to the network configuration page of the exclusive resource group for scheduling.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Resource Groups. The Exclusive Resource Groups tab appears.
    3. Find the desired exclusive resource group for scheduling and click Network Settings in the Actions column.
  2. Associate the exclusive resource group for scheduling with the VPC in which the CDH cluster is deployed.
    On the VPC Binding tab of the page that appears, click Add Binding. In the Add VPC Binding panel, select the VPC, zone, vSwitch, and security group that are recorded in 4.
  3. Configure hosts.
    Click the Hostname-to-IP Mapping tab. On this tab, click Batch Modify. In the Batch Modify Hostname-to-IP Mappings dialog box, enter the host addresses that are recorded in 2. Configure hosts

Step 3: Add the configurations of the CDH cluster to DataWorks

Only workspace administrators can add the configurations of CDH clusters to DataWorks. Therefore, you must use an account to which the workspace administrator role is assigned to perform this operation.

  1. Go to the Workspace page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. On the Workspaces page, find the desired workspace, move the pointer over the More icon in the Actions column, and then select Workspace Settings.
  2. In the left-side navigation pane, choose Open Source Clusters > CDH Clusters.
    CDH Clusters
  3. On the CDH Cluster Configuration page, click Create Now. In the Create CDH Cluster Configuration dialog box, enter the component addresses that are recorded in Step 2: Configure network connectivity in the related fields.
    Create CDH Cluster ConfigurationConfiguration information:
    • Cluster name: the name of your CDH cluster. You can customize the name.
    • Versions: Select the CDH cluster version and component versions based on your business requirements.
    • Addresses: Enter the recorded component addresses. Configuration information:
      • jobhistory.webapp.address for YARN: Change the port number in the value of yarn.resourcemanager.address to 8088.
      • JDBC URL for Presto: Presto is not a default component for CDH. You must configure this parameter based on your business requirements.
  4. Upload configuration files and associate the CDH cluster with the workspace.
    Upload configuration files
  5. Configure mappings between Alibaba Cloud accounts or RAM users and Kerberos accounts.
    If you want to isolate permissions on the data that can be accessed by using different Alibaba Cloud accounts or RAM users in a CDH cluster, enable Kerberos Account Authentication and configure the mappings between Alibaba Cloud accounts or RAM users and Kerberos accounts.
    Note Kerberos Account specifies an account that you use to access the CDH cluster. You can use the Sentry or Ranger component to configure different permissions for different Kerberos accounts in the CDH cluster to isolate permissions on data. The Alibaba Cloud accounts or RAM users that are mapped to the same Kerberos account have the same permissions on the data in the CDH cluster. Specify a Kerberos account (also referred to as a Kerberos principal) in the format of Instance name@Domain name, such as cdn_test@HADOOP.COM.
    Mapping Configuration
  6. Click Confirm.
    After the configurations of the CDH cluster are added to DataWorks, you can associate the CDH cluster with the workspace as a compute engine instance. Then, you can develop and run CDH nodes in the workspace.

Step 4: Associate the CDH cluster with the workspace as a compute engine instance

  1. On the Workspaces page, click Workspace Settings in the Actions column that corresponds to the workspace.
  2. In the lower part of the Workspace Settings panel, click More. In the Compute Engine Information section of the Configuration page, click the CDH tab. On the CDH tab, click Add Instance. In the Add CDH Compute Engine dialog box, configure the parameters.
    You can set Access Mode to Shortcut mode or Security mode. If Security mode is selected, the permissions on the data of a CDH node that is run by using different Alibaba Cloud accounts or RAM users can be isolated. The parameters that need to be configured vary based on the value of the Access Mode parameter.
    • The following figure shows the parameters you must configure if you set Access Mode to Shortcut mode. Shortcut mode
    • The following figure shows the parameters you must configure if you set Access Mode to Security mode. Security mode
    1. Configure Instance Display Name.
    2. Configure Access Mode.
      • Shortcut mode

        If this access mode is used, multiple Alibaba Cloud accounts or RAM users map to the same CDH cluster account. These Alibaba Cloud accounts or RAM users can access data within the same CDH cluster account. In this case, permissions on data are not isolated.

      • Security mode

        If this access mode is used, you can configure the mappings between the Alibaba Cloud accounts or RAM users and CDH cluster accounts to isolate the permissions on the data of a CDH node that is run by using the Alibaba Cloud accounts or RAM users.

    3. Select the CDH cluster whose configurations you added.
      If Shortcut mode is selected for Access Mode, you must select a CDH cluster whose Authentication Type is not set to Kerberos Account Authentication. If Security mode is selected for Access Mode, you must select a CDH cluster whose Authentication Type is set to Kerberos Account Authentication. For more information about how to check whether Kerberos Account Authentication is enabled for the CDH cluster, see Create a workspace.
    4. Configure access authentication information for the CDH cluster.
      • Shortcut mode

        You can use only the specified accounts, such as admin and hadoop. These accounts are used only to issue nodes.

      • Security mode
        You can configure Account for Scheduling Nodes based on your business requirements. This identity is used to automatically schedule and run a node after a CDH node is committed. You must configure mappings between the Alibaba Cloud accounts or RAM users and CDH cluster accounts. For more information about how to configure the mappings, see Configure mappings between Alibaba Cloud accounts or RAM users and Kerberos accounts.
        Note On the DataStudio page, the identity used to run CDH nodes is the CDH cluster account that is mapped to the Alibaba Cloud account or RAM user used to log on to the DataWorks console. Therefore, you must configure the identity mappings not only for scheduling access identities but also for workspace developers to prevent nodes from failing to run.
    5. Select the created exclusive resource group for scheduling.
    6. Click Test Connectivity.
      If the connectivity test fails, the exclusive resource group for scheduling is not associated with the VPC in which the CDH cluster is deployed or is not configured with hosts. For more information about how to configure the network settings of the exclusive resource group for scheduling, see Step 2: Configure network connectivity.
  3. Click Confirm.
    Then, the system starts to initialize the exclusive resource group for scheduling. During the initialization, the system installs the client that is used to access the CDH cluster and uploads the configuration files of the CDH cluster. After the value of Initialization Status of Resource Group on the CDH tab changes from Preparing to Complete, the CDH cluster is associated with the workspace as a compute engine instance.
  4. Click Test Connectivity next to Test Service Connectivity on the CDH tab. Then, DataWorks runs a test task to check whether the client is installed and the configuration files are uploaded.

Develop CDH nodes in DataWorks

After the CDH compute engine instance is associated with the workspace, you can create and run CDH Hive, CDH Spark, CDH MR, CDH Impala, or CDH Presto nodes in DataStudio. You can also configure properties for the nodes. In this section, a CDH Hive node is created and run to demonstrate how to use a CDH node for data development.

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. On the Workspaces page, find the desired workspace and click Data Analytics in the Actions column.
  2. Create a workflow on the DataStudio page. In the Scheduled Workflow pane of the DataStudio page, move the pointer over Create and click Create Workflow. In the Create Workflow dialog box, configure the parameters and click Create.
  3. In the Scheduled Workflow pane, click Business Flow, find the created workflow, and then click the workflow name. Right-click CDH and choose Create Node > CDH Hive.
    cdh hive
  4. In the code editor, write SQL code for the CDH Hive node and click the Run icon in the top toolbar. In the Parameters dialog box, select the created exclusive resource group for scheduling and click OK. After the code finishes running, you can view the results.
  5. If you want to configure properties for the node, click the Properties tab in the right-side navigation pane. On the Properties tab, configure time properties, resource properties, and scheduling dependencies for the node. Then, commit the node. After the node is committed, the system runs the node based on the configured properties. For more information about how to configure properties for a node, see Configure basic properties.
  6. Go to Operation Center and view the status of the node on the Cycle Task page. For more information, see View and manage auto triggered nodes.

Configure O&M and monitoring and alerting settings for a CDH node

You can use the intelligent monitoring feature provided by DataWorks Operation Center to monitor CDH nodes. This feature allows you to customize alert rules and configure alerting for CDH nodes. If errors occur on the CDH nodes, the system generates alerts based on the configured alert rules. For more information about how to create custom alert rules, see Create a custom alert rule. For more information about how to configure alerting for nodes, see Manage baselines.

Configure monitoring rules for a CDH node in Data Quality

When you use CDH in DataWorks, you can use the Data Quality service of DataWorks to query and compare data generated by a CDH node, monitor the quality of data generated by a CDH node, and scan SQL code of and perform intelligent alerting on a CDH node. For more information about the Data Quality service, see Overview.

Use DataMap to collect data of a CDH cluster

When you use CDH in DataWorks, you can use the DataMap service of DataWorks to collect the metadata of Hive databases, tables, fields, and partitions in a CDH cluster. This facilitates global data searches, viewing of metadata details, data preview, data lineage management, and data category management.
Note You can use DataMap to collect the metadata only of Hive databases in CDH clusters.
For more information about the Data Map service and related configurations, see Overview.