Cloudera's Distribution including Apache Hadoop (CDH) can be integrated into DataWorks. This allows you to configure your CDH clusters as storage and compute engines in DataWorks. This way, you can use DataWorks features, such as node development, node scheduling, Data Map (metadata management), and Data Quality, to develop and manage CDH data and nodes. This topic describes how to integrate CDH into DataWorks and use CDH in DataWorks.

Prerequisites

  • A CDH cluster is deployed on an Elastic Compute Service (ECS) instance.

    The CDH cluster can also be deployed in an environment other than Alibaba Cloud ECS. You must make sure that the environment can be connected to Alibaba Cloud. You can use Express Connect and VPN Gateway to ensure the network connectivity between the environment and Alibaba Cloud.

  • DataWorks is activated, and a workspace is created to connect to the CDH cluster.
    Note The workspaces that are used to connect to CDH clusters do not need to be associated with compute engines. Therefore, when you create a workspace, you do not need to select an engine. For more information about how to create a workspace, see Create a workspace.
  • An account that has administrative permissions on the workspace is created. Only workspace administrators can add CDH cluster configurations to DataWorks. For more information about how to grant administrative permissions on a workspace to an account, see Manage members and roles.
  • A DataWorks exclusive resource group for scheduling is created. For more information, see Exclusive resource group mode.
Before you use CDH in DataWorks, you must perform the following operations to integrate CDH into DataWorks:
  1. Step 1: Obtain the configuration information of the CDH cluster
  2. Step 2: Configure network connectivity
  3. Step 3: Add the configurations of the CDH cluster to DataWorks

After you complete the preceding operations, you can develop and run CDH nodes in DataWorks and view the status of the nodes in DataWorks Operation Center. For more information, see Use DataWorks to develop nodes and Configure O&M and monitoring settings.

You can also use the Data Quality and Data Map services of DataWorks to manage CDH data and nodes. For more information, see Configure data quality rules and Use Data Map to collect data.

Limits

  • To use CDH features in DataWorks, you must create and use a DataWorks exclusive resource group for scheduling.
  • The CDH cluster must be connected to the exclusive resource group for scheduling.

Step 1: Obtain the configuration information of the CDH cluster

  1. Obtain the version information of the CDH cluster. The version information is required when you add the configurations of the CDH cluster to DataWorks.
    Log on to the Cloudera Manager Admin Console. On the page that appears, you can view the version information on the right side of the cluster name, as shown in the following figure. Version information of the CDH cluster
  2. Obtain the host and component addresses of the CDH cluster. The addresses are required when you add the configurations of the CDH cluster to DataWorks.
    • Method 1: Use the DataWorks JAR package to obtain the addresses.
      1. Log on to the Cloudera Manager Admin Console and download the DataWorks JAR package.
        wget https://dataworks-public-tools.oss-cn-shanghai.aliyuncs.com/dw-tools.jar
      2. Run the JAR package.
        export PATH=$PATH:/usr/java/jdk1.8.0_181-cloudera/bin
        java -jar dw-tools.jar <user> <password>
        Set <user> to the username that you use to log on to the Cloudera Manager Admin Console and <password> to the password that you use to log on to the Cloudera Manager Admin Console.
      3. View the host and component addresses of the CDH cluster in the returned results. Then, record the addresses. Component addresses
    • Method 2: Obtain the addresses from the Cloudera Manager Admin Console.
      Log on to the Cloudera Manager Admin Console and select Roles from the Hosts drop-down list. Find the components that you want to configure based on keywords and icons. Then, view and record the hostnames displayed on the left, and complete component addresses based on the hostnames and the address format. For more information about the default port numbers in the addresses, see the returned results in Method 1. Method 2Components:
      • HS2: HiveServer2
      • HMS: Hive Metastore
      • ID: Impala Daemon
      • RM: YARN ResourceManager
  3. Obtain the configuration files of the CDH cluster. The configuration files must be uploaded when you add the configurations of the CDH cluster to DataWorks.
    1. Log on to the Cloudera Manager Admin Console.
    2. On the Status tab, click the drop-down arrow on the right of the cluster name and select View Client Configuration URLs.
      Obtain configuration files 1
    3. In the Client Configuration URLs dialog box, download the YARN configuration package.
      Obtain configuration files 2
  4. Obtain the network information of the CDH cluster. The network information is used when you configure network connectivity between the CDH cluster and DataWorks exclusive resource group for scheduling.
    1. Log on to the ECS console.
    2. In the left-side navigation pane, choose Instances & Images > Instances. In the top navigation bar, select the region where the ECS instance that hosts the CDH cluster resides. On the Instances page, find the ECS instance and click its ID. On the Instance Details tab of the page that appears, view the security group, VPC, and vSwitch to which the ECS instance belongs. Then, record the information.
      View the network information of the ECS instance

Step 2: Configure network connectivity

By default, DataWorks exclusive resource groups for scheduling are not connected to the networks of resources for other Alibaba Cloud services after the resource groups are created. Therefore, before you use CDH, you must obtain the network information of your CDH cluster. Then, associate your DataWorks exclusive resource group for scheduling with the VPC to which the CDH cluster belongs. This ensures network connectivity between the CDH cluster and DataWorks exclusive resource group for scheduling.

  1. Go to the network configuration page of the exclusive resource group for scheduling.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Resource Groups. The Exclusive Resource Groups tab appears.
    3. Find the desired exclusive resource group for scheduling and click Network Settings in the Actions column.
  2. Associate the exclusive resource group for scheduling with the VPC to which the CDH cluster belongs.
    On the VPC Binding tab, click Add Binding. In the Add VPC Binding panel, select the VPC, vSwitch, and security group that are recorded in 4. Then, click OK.
  3. Configure hosts.
    Click the Hostname-to-IP Mapping tab. On this tab, click Batch Modify. In the Batch Modify Hostname-to-IP Mappings dialog box, enter the host addresses that are recorded in 2. Configure hosts

Step 3: Add the configurations of the CDH cluster to DataWorks

Only workspace administrators can add the configurations of CDH clusters to DataWorks. Therefore, you must use an account that has the administrative permissions on your workspace to perform this operation.

  1. Go to the Workspace Management tab.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. On the Workspaces page, find the desired workspace and click Workspace Settings in the Actions column.
    4. In the Workspace Settings panel, click More.
  2. In the left-side navigation pane of the page that appears, choose Opensource Cluster Management > Hadoop Config.
    CDH Cluster Configuration
  3. On the CDH Cluster Configuration page, click Create Now. In the Create CDH Cluster Configuration dialog box, enter the component addresses that are recorded in Step 2: Configure network connectivity in the related fields.
    Create CDH Cluster ConfigurationConfiguration information:
    • Cluster name: the name of your CDH cluster. You can customize the name.
    • Versions: Select the CDH cluster version and component versions based on actual conditions.
    • Addresses: Enter the recorded component addresses. Configuration information:
      • jobhistory.webapp.address for YARN: Change the port number in the value of yarn.resourcemanager.address to 8088.
      • JDBC URL for Presto: Presto is not a default component for CDH. You must configure this parameter based on actual conditions.
  4. Upload configuration files and associate the CDH cluster with the workspace.
    Upload configuration files
  5. Configure mappings between Alibaba Cloud accounts or RAM users and Kerberos accounts.
    If you want to isolate permissions on the data that can be accessed by different Alibaba Cloud accounts or RAM users in a CDH cluster, enable Kerberos Account Authentication and configure the mappings between Alibaba Cloud accounts or RAM users and Kerberos accounts.
    Note Kerberos Account specifies an account that you use to access the CDH cluster. You can use the Sentry or Ranger component to configure different permissions for different Kerberos accounts in the CDH cluster to isolate data permissions. The Alibaba Cloud accounts or RAM users that are mapped to the same Kerberos account have the same permissions on the data in the CDH cluster. Specify a Kerberos account (also referred to as a Kerberos principal) in the format of Instance name@Domain name, such as cdn_test@HADOOP.COM.
    Mapping Configuration
  6. Click Confirm.
    After the configurations of the CDH cluster are added to DataWorks, you can add the CDH cluster to the associated workspace as a compute engine instance. Then, you can develop and run CDH nodes in the workspace.

Step 4: Add the CDH cluster to the associated workspace as a compute engine instance

  1. On the Workspaces page, click Workspace Settings in the Actions column that corresponds to the associated workspace.
  2. In the lower part of the Workspace Settings panel, click More. In the Compute Engine Information section of the Configuration page, click the CDH tab. On the CDH tab, click Add Instance. In the Add CDH Compute Engine dialog box, configure the parameters.
    You can set Access Mode to Shortcut mode or Security mode. If Security mode is selected, the permissions on the data of the node that is run by different Alibaba Cloud accounts or RAM users can be isolated. The parameters that need to be configured vary based on the value of the Access Mode parameter.
    • The following figure shows the parameters you must configure if you set Access Mode to Shortcut mode. Shortcut mode
    • The following figure shows the parameters you must configure if you set Access Mode to Security mode. Security mode
    1. Specify Instance Display Name.
    2. Specify Access Mode.
      • Shortcut mode

        If this access mode is used, multiple Alibaba Cloud accounts or RAM users map to the same CDH cluster account. These Alibaba Cloud accounts or RAM users can access data in the same CDH cluster account. In this case, data permissions are not isolated.

      • Security mode

        If this access mode is used, you can configure the mappings between the Alibaba Cloud accounts or RAM users and CDH cluster accounts to isolate the permissions on the data of the node that is run by the Alibaba Cloud accounts or RAM users.

    3. Select the CDH cluster whose configurations you added.
      If Shortcut mode is selected for Access Mode, you must select a CDH cluster whose Authentication Type is not set to Kerberos Account Authentication. If Security mode is selected for Access Mode, you must select a CDH cluster whose Authentication Type is set to Kerberos Account Authentication. For more information about how to check whether Kerberos Account Authentication is selected for the CDH cluster, see Go to the Workspace Management page.
    4. Configure access authentication information for the CDH cluster.
      • Shortcut mode

        You can use only the specified accounts, such as admin and hadoop. These accounts are used only to commit nodes.

      • Security mode
        You can set Account for Scheduling Nodes based on your business requirements. This identity is used to automatically schedule and run a node after the node is committed. You must configure mappings between the Alibaba Cloud accounts or RAM users and CDH cluster accounts. For more information about how to configure the mappings, see Configure mappings between Alibaba Cloud accounts or RAM users and Kerberos accounts.
        Note On the DataStudio page, the identity used to run nodes is the CDH cluster account that is mapped to the logon Alibaba Cloud account or RAM user. Therefore, you must configure the identity mappings not only for scheduling access identities, but also for the workspace developers to prevent nodes from failing to run.
    5. Select the created exclusive resource group for scheduling.
    6. Click Test Connectivity.
      If the connectivity test fails, the exclusive resource group for scheduling is not associated with the VPC to which the CDH cluster belongs or is not configured with hosts. For more information about how to configure the network settings of the exclusive resource group for scheduling, see Step 2: Configure network connectivity.
  3. Click Confirm.
    Then, the system starts to initialize the exclusive resource group for scheduling. During the initialization, the system installs the client that is used to access the CDH cluster and uploads the configuration files of the CDH cluster. After the value of Initialization Status of Resource Group on the CDH tab changes from Preparing to Complete, the CDH cluster is added to the workspace as a compute engine instance.
  4. Click Test Connectivity next to Test Service Connectivity on the CDH tab. Then, DataWorks runs a test task to check whether the client is installed and the configuration files are uploaded.
    If the test fails, you can view the logs and submit a ticket to the technical support personnel of DataWorks.

Use DataWorks to develop nodes

After you add the CDH compute engine instance, you can create and run CDH Hive, CDH Spark, CDH MR, CDH Impala, or CDH Presto nodes in DataStudio. You can also configure properties for the nodes. In this section, a CDH Hive node is created and run to demonstrate how to use a CDH node to develop data.

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. On the Workspaces page, find the desired workspace and click Data Analytics in the Actions column.
  2. On the DataStudio page, move the pointer over the Create icon and click Workflow. In the Create Workflow dialog box, configure the parameters and click Create.
  3. In the left-side navigation pane, click Business Flow, find the created workflow, and then click the workflow name. Right-click CDH and choose Create > CDH Hive.
    cdh hive
  4. In the code editor, write SQL code for the CDH Hive node and click the Run icon in the top toolbar. In the Parameters dialog box, select the desired exclusive resource group for scheduling and click OK. After the running of the code is complete, you can view the results.
  5. If you want to configure properties for the node, click the Properties tab in the right-side navigation pane. On the Properties tab, configure time properties, resource properties, and scheduling dependencies for the node. Then, commit the node. After the node is committed, the system runs the node based on the configured properties. For more information about how to configure properties for a node, see Basic properties.
  6. Go to the Operation Center page and view the status of the node on the Cycle Task page. For more information, see View auto triggered nodes.

Configure O&M and monitoring settings

CDH nodes support the intelligent monitoring feature provided by DataWorks Operation Center. This feature allows you to customize alert rules and configure alerting for CDH nodes. The system automatically generates alerts if errors occur on the running of the CDH nodes based on the configured alert rules. For more information about how to customize alert rules, see Manage custom alert rules. For more information about how to configure alerting for nodes, see Manage baselines.

Configure data quality rules

When you use CDH in DataWorks, you can use the Data Quality service of DataWorks to query and compare data, monitor data quality, scan SQL code, and perform intelligent alerting. For more information about the Data Quality service, see Overview.

Use Data Map to collect data

When you use CDH in DataWorks, you can use the Data Map service of DataWorks to collect the metadata of Hive databases, tables, fields, and partitions in the CDH cluster. This facilitates global data searches, viewing of metadata details, data preview, data lineage management, and data category management.
Note You can use Data Map to collect the metadata only of Hive databases in CDH clusters.
For more information about the Data Map service and related configurations, see Overview.

If you want to monitor the metadata changes of Hive databases in a CDH cluster in real time or view lineage and metadata change records in Data Map, associate DataWorks Hive hooks with the CDH cluster. Then, use Log Service to collect the logs generated by the hooks.

After the Hive hooks are configured, metadata changes are recorded in the log file /tmp/hive/hook.event.*.log on the HiveServer2 and Hive Metastore hosts. In this case, you can use Log Service to collect the change records for DataWorks to read. Download the DataWorks tool dw-tools.jar, create a config.json file in the same directory, and then specify the configuration items in the file. Then, run the tool to enable log collection with one click.

To configure Hive hooks and collect logs from the hooks, perform the following steps:

  1. Configure Hive hooks.
    1. Log on to the HiveServer2 and Hive Metastore hosts and go to the /var/lib/hive directory to download DataWorks Hive hooks.
      # Download dataworks-hive-hook-2.1.1.jar for CDH 6.X clusters.
      wget https://dataworks-public-tools.oss-cn-shanghai.aliyuncs.com/dataworks-hive-hook-2.1.1.jar
      # Download dataworks-hive-hook-1.1.0-cdh5.16.2.jar for CDH 5.X clusters.
      wget https://dataworks-public-tools.oss-cn-shanghai.aliyuncs.com/dataworks-hive-hook-1.1.0-cdh5.16.2.jar
    2. Log on to the Cloudera Manager Admin Console and click Hive below the cluster name. On the page that appears, click the Configuration tab. Then, set Hive Auxiliary JARs Directory to /var/lib/hive.
    3. For Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml, specify the Name and Value fields based on the following information:
      <property>
        <name>hive.exec.post.hooks</name>
        <value>com.cloudera.navigator.audit.hive.HiveExecHookContext,org.apache.hadoop.hive.ql.hooks.LineageLogger,com.aliyun.dataworks.meta.hive.hook.LineageLoggerHook</value>
      </property>
    4. For Hive Metastore Server Advanced Configuration Snippet (Safety Valve) for hive-site.xml, specify the Name and Value fields based on the following information:
      <property>
        <name>hive.metastore.event.listeners</name>
        <value>com.aliyun.dataworks.meta.hive.listener.MetaStoreListener</value>
      </property>
      <property>
        <name>hive.metastore.pre.event.listeners</name>
        <value>com.aliyun.dataworks.meta.hive.listener.MetaStorePreAuditListener</value>
      </property>
    5. After the Hive hooks are configured, you must perform configurations on clients as prompted in the Cloudera Manager Admin Console. Then, restart the Hive service.
      Note If the restart fails, retain the logs for troubleshooting. To prevent normal operations from being affected, you can remove the added information and restart the Hive service again. If the restart succeeds after the information is added, check whether the log files whose names start with hook.event, such as hook.event.1608728145871.log, are generated in the /tmp/hive/ directory on the hosts.
  2. Collect logs from the Hive hooks.
    1. Log on to the Cloudera Manager Admin Console and download the DataWorks JAR package.
      wget https://dataworks-public-tools.oss-cn-shanghai.aliyuncs.com/dw-tools.jar
    2. Create a config.json file in the directory in which the DataWorks tool is stored. Then, modify the file based on the following code and save the file:
      // config.json
      {
          "accessId": "<accessId>",
          "accessKey": "<accessKey>",
          "endpoint": "cn-shanghai-intranet.log.aliyuncs.com",
          "project": "onefall-test-pre",
          "clusterId": "1234",
          "ipList": "192.168.0.1,192.168.0.2,192.168.0.3"
      }
      Configuration information:
      • accessId: the AccessKey ID of your Alibaba Cloud account.
      • accessKey: the AccessKey secret of your Alibaba Cloud account.
      • endpoint: the internal endpoint that is used to access your Log Service project. For more information, see Endpoints.
      • project: the name of your Log Service project. For more information about how to obtain the name, see Manage a project.
      • clusterId: the ID of the CDH cluster generated for DataWorks. You can submit a ticket to obtain the ID.
      • ipList: the IP addresses of all HiveServer2 and Hive Metastore hosts. Separate the IP addresses with commas (,). The hosts are those on which the DataWorks Hive hooks are deployed.
    3. Run the config.json file.
      java -cp dw-tools.jar com.aliyun.dataworks.tools.CreateLogConfig config.json
    4. Install the client.
      wget http://logtail-release-cn-shanghai.oss-cn-shanghai.aliyuncs.com/linux64/logtail.sh -O logtail.sh; chmod 755 logtail.sh; ./logtail.sh install cn-shanghai
      Replace cn-shanghai with the region where your Log Service project resides.
  3. After you complete the preceding steps, a Logstore named hive-event, a Logtail configuration named hive-event-config, and a log group named hive-servers are generated in your Log Service project. You can view and record the ID of your Alibaba Cloud account, the endpoint of your Log Service project, and other information about the project. Then, submit a ticket to send the recorded information to the technical support personnel of DataWorks. This way, the technical personnel can perform subsequent configurations.