Cloudera's Distribution including Apache Hadoop (CDH) can be integrated into DataWorks. This allows you to configure your CDH clusters as storage and computing engines in DataWorks. Then, you can use DataWorks features, such as node development, node scheduling, Data Map (metadata management), and Data Quality, to develop and manage CDH data and nodes. This topic describes how to integrate CDH into DataWorks and use CDH in DataWorks.

Prerequisites

  • A CDH cluster is deployed on an Elastic Compute Service (ECS) instance. Network settings such as a virtual private cloud (VPC) and a security group are configured for the CDH cluster.
  • DataWorks is activated, and a workspace is created to connect to the CDH cluster.
    Note The workspaces that are used to connect to CDH clusters do not need to be associated with computing engines. Therefore, when you create the workspace, you do not need to select an engine. For more information about how to create a workspace, see Create a workspace.
  • An account that has administrative permissions on the workspace is created. Only workspace administrators can add CDH cluster configurations to DataWorks. For more information about how to grant administrative permissions on a workspace to an account, see Add members.
  • A DataWorks exclusive resource group for scheduling is created. For more information, see Exclusive resource group mode.
Before you use CDH in DataWorks, you must perform the following operations to integrate CDH into DataWorks:
  1. Step 1: Obtain the configuration information of the CDH cluster
  2. Step 2: Configure network connectivity
  3. Step 3: Add the configurations of the CDH cluster to DataWorks

After you complete the preceding operations, you can develop and run CDH nodes in DataWorks and view the status of the nodes in DataWorks Operation Center. For more information, see Use DataWorks to develop nodes and Configure O&M and monitoring settings.

You can also use the Data Quality and Data Map services of DataWorks to manage CDH data and nodes. For more information, see Configure data quality rules and Use Data Map to collect data.

Limits

  • To use CDH features in DataWorks, you must create a DataWorks exclusive resource group for scheduling.
  • Only CDH clusters that are deployed on Alibaba Cloud ECS instances can be connected to DataWorks.
  • The exclusive resource group for scheduling must be associated with the VPC to which the CDH cluster belongs.
  • You can use the Data Map service of DataWorks to collect the information of CDH clusters, such as database metadata. However, you can collect the information of only Hive databases.

Step 1: Obtain the configuration information of the CDH cluster

  1. Obtain the version of the CDH cluster. The version information is required when you add the configurations of the CDH cluster to DataWorks.
    Log on to the Cloudera Manager Admin Console. On the page that appears, you can view the version information on the right side of the cluster name, as shown in the following figure.Version of the CDH cluster
  2. Obtain the host and component addresses of the CDH cluster. The addresses are required when you add the configurations of the CDH cluster to DataWorks.
    • Method 1: Use the DataWorks JAR package to obtain the addresses.
      1. Log on to the Cloudera Manager Admin Console and download the DataWorks JAR package.
        wget https://dataworks-public-tools.oss-cn-shanghai.aliyuncs.com/dw-tools.jar
      2. Run the JAR package.
        export PATH=$PATH:/usr/java/jdk1.8.0_181-cloudera/bin
        java -jar dw-tools.jar <user> <password>
        Set <user> to the username of the Cloudera Manager Admin Console and <password> to the password of the Cloudera Manager Admin Console.
      3. View the host and component addresses of the CDH cluster in the returned results. Then, record the addresses.Component addresses
    • Method 2: Obtain the addresses from the Cloudera Manager Admin Console.
      Log on to the Cloudera Manager Admin Console and select Roles from the Hosts drop-down list. Find the components that you want to configure based on keywords and icons. Then, view and record the hostnames displayed on the left, and complete component addresses based on the hostnames and the address format. For more information about default port numbers that are required in the addresses, see the returned results in Method 1.Method 2Components:
      • HS2: HiveServer2
      • HMS: Hive Metastore
      • ID: Impala Daemon
      • RM: YARN ResourceManager
  3. Obtain the configuration files of the CDH cluster. The configuration files need to be uploaded when you add the configurations of the CDH cluster to DataWorks.
    1. Log on to the Cloudera Manager Admin Console.
    2. On the Status tab, click the drop-down arrow on the right of the cluster name and select View Client Configuration URLs.
      Obtain configuration files 1
    3. In the Client Configuration URLs dialog box, download the YARN configuration package.
      Obtain configuration files 2
  4. Obtain the network information of the CDH cluster. The network information is used when you configure network connectivity between the CDH cluster and DataWorks exclusive resource group for scheduling.
    Only CDH clusters that are deployed on Alibaba Cloud ECS instances can be connected to DataWorks.
    1. Log on to the ECS console.
    2. In the left-side navigation pane, click Instances. In the top navigation bar, select the region where the ECS instance that hosts the CDH cluster resides. On the Instances page, find the ECS instance and click its ID. On the Instance Details tab of the page that appears, view the security group, VPC, and vSwitch to which the ECS instance belongs. Then, record the information.
      View the network information of the ECS instance

Step 2: Configure network connectivity

By default, DataWorks exclusive resource groups for scheduling are not connected to the networks of resources for other Alibaba Cloud services after the resource groups are created. Therefore, before you use CDH, you must obtain the network information of your CDH cluster. Then, associate your DataWorks exclusive resource group for scheduling with the VPC to which the CDH cluster belongs. This ensures network connectivity between the CDH cluster and DataWorks exclusive resource group for scheduling.

  1. Go to the network configuration page of the exclusive resource group for scheduling.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Resource Groups. The Exclusive Resource Groups tab appears.
    3. Click Network Settings in the Actions column that corresponds to your exclusive resource group for scheduling.
  2. Associate the exclusive resource group for scheduling with the VPC to which the CDH cluster belongs.
    On the VPC Binding tab, click Add Binding. In the Add VPC Binding panel, select the VPC, vSwitch, and security group that are recorded in Step 4. Then, click OK.
  3. Configure hosts.
    Click the Hostname-to-IP Mapping tab. On this tab, click Batch Modify. In the Batch Modify Hostname-to-IP Mappings dialog box, enter the host addresses that are recorded in Step 2.Configure hosts

Step 3: Add the configurations of the CDH cluster to DataWorks

Only workspace administrators can add the configurations of CDH clusters to DataWorks. Therefore, you must use an account that has the administrative permissions on your workspace to perform this operation.

  1. Go to the Workspace Management tab.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. On the Workspaces page, find the desired workspace and click Workspace Settings in the Actions column.
    4. In the Workspace Settings panel, click More.
  2. In the left-side navigation pane of the page that appears, click Hadoop Config.
    CDH Cluster Configuration
  3. On the CDH Cluster Configuration page, click Create Now. In the Create CDH Cluster Configuration dialog box, enter the component addresses that are recorded in Step 2 in the related fields.
    Create CDH Cluster ConfigurationConfiguration information:
    • Cluster name: the name of your CDH cluster. You can customize the name.
    • Versions: Select the CDH cluster version and component versions based on actual conditions.
    • Addresses: Enter the recorded component addresses.
      • jobhistory.webapp.address for YARN: Change the port number in the value of yarn.resourcemanager.address to 8088.
      • JDBC URL for Presto: Presto is not a default component for CDH. You must specify this parameter based on actual conditions.
  4. Upload configuration files and associate the CDH cluster with a workspace.
    Upload configuration files
  5. Click Confirm.
    After the configurations of the CDH cluster are added to DataWorks, you can add the CDH cluster to the associated workspace as a compute engine instance. Then, you can develop and run CDH nodes in the workspace.

Step 4: Add the CDH cluster to the associated workspace as a compute engine instance

  1. On the Workspaces page, click Workspace Settings in the Actions column that corresponds to the associated workspace.
  2. In the lower part of the Workspace Settings panel, click More. In the Computing Engine information section of the Workspace Management tab, click the CDH tab. Then, on the CDH tab, click Add instances. In the dialog box that appears, configure the parameters.
    Create CDH Cluster
    1. Specify Instance display name.
    2. Select the CDH cluster whose configurations you added in Step 3: Add the configurations of the CDH cluster to DataWorks.
    3. Configure access authentication information for the CDH cluster.
      You must specify a specific account. We recommend that you use the admin account. The password is not required.
    4. Select the created exclusive resource group for scheduling.
    5. Click Test connectivity.
      If the connectivity test fails, the exclusive resource group for scheduling is not associated with the VPC to which the CDH cluster belongs or is not configured with hosts. For more information about how to configure the network settings of the exclusive resource group for scheduling, see Step 2: Configure network connectivity.
  3. Click Confirm.
    Then, the system starts to initialize the exclusive resource group for scheduling. During the initialization, the system installs the client that is used to access the CDH cluster and uploads the configuration files of the CDH cluster. After the value of Initialization Status of Resource Group on the CDH tab changes from Preparing to Complete, the CDH cluster is added to the workspace as a compute engine instance.
  4. Click Test connectivity next to Test Service Connectivity on the CDH tab. Then, DataWorks runs a test task to check whether the client is installed and the configuration files are uploaded.
    If the test fails, you can view logs and submit a ticket to the technical support personnel of DataWorks.

Use DataWorks to develop nodes

After you add the CDH compute engine instance, you can create and run CDH Hive, Spark, MapReduce, Impala, or Presto nodes in DataStudio. You can also configure scheduling properties for the nodes. In this section, a CDH Hive node is used as an example.

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. On the Workspaces page, find the desired workspace and click Data Analytics in the Actions column.
  2. On the DataStudio page, move the pointer over the Create icon and click Workflow. In the Create Workflow dialog box, configure the parameters and click Create.
  3. In the left-side navigation pane, click Business Flow, find the created workflow, and then click the workflow name. Right-click CDH and choose Create > CDH Hive.
    cdh hive
  4. In the code editor, compile Hive SQL code. After you compile the code, click the Run icon in the top toolbar. Then, select the exclusive resource group for scheduling and commit the node. After the node is run, you can view the running results of the Hive SQL code.
  5. If you want to configure scheduling properties for the node, click the Properties tab in the right-side navigation pane. In the panel that appears, configure time properties, resource properties, and scheduling dependencies. Then, click Submit in the top toolbar. After the node is committed, the system runs the node based on the configured scheduling properties. For more information about how to configure the scheduling properties, see Basic properties.
  6. Go to the Operation Center page and view the running status of the node on the Cycle Task tab. For more information, see View auto triggered nodes.

Configure O&M and monitoring settings

CDH nodes support the intelligent monitoring feature provided by DataWorks Operation Center. This feature allows you to customize alert rules and configure node alerting. The system automatically generates alerts based on the customized alert rules. For more information about how to customize alert rules, see Manage custom alert rules. For more information about how to configure node alerting, see Manage baselines.

Configure data quality rules

When you use CDH in DataWorks, you can use the Data Quality service of DataWorks to query and compare data, monitor data quality, scan SQL code, and perform intelligent alerting. For more information about the Data Quality service, see Overview.

Use Data Map to collect data

When you use CDH in DataWorks, you can use the Data Map service of DataWorks to collect the metadata of Hive databases, tables, fields, and partitions in the CDH cluster. This facilitates global data searches, viewing of metadata details, data preview, data lineage management, and data category management.
Note You cannot use Data Map to collect the metadata of other types of databases in CDH clusters.
For more information about the Data Map service and related configurations, see Overview.

If you want to monitor the metadata changes of Hive databases in a CDH cluster in real time or view lineage and metadata change records in Data Map, associate DataWorks Hive hooks with the CDH cluster. Then, use Log Service to collect the logs generated by the hooks.

After the Hive hooks are configured, metadata changes are recorded in the log file /tmp/hive/hook.event.*.log on the HiveServer2 and Hive Metastore hosts. In this case, you can use Log Service to collect the change records for DataWorks to read. Download the DataWorks tool dw-tools.jar, create a config.json file in the same directory, and specify the configuration items in the file. Then, run the tool to enable log collection with one click.

To configure Hive hooks and collect logs from the hooks, perform the following steps:

  1. Configure Hive hooks.
    1. Log on to the HiveServer2 and Hive Metastore hosts and go to the /var/lib/hive directory to download DataWorks Hive hooks.
      # Download dataworks-hive-hook-2.1.1.jar for CDH 6.x clusters.
      wget https://dataworks-public-tools.oss-cn-shanghai.aliyuncs.com/dataworks-hive-hook-2.1.1.jar
      # Download dataworks-hive-hook-1.1.0-cdh5.16.2.jar for CDH 5.x clusters.
      wget https://dataworks-public-tools.oss-cn-shanghai.aliyuncs.com/dataworks-hive-hook-1.1.0-cdh5.16.2.jar
    2. Log on to the Cloudera Manager Admin Console and click Hive below the cluster name. On the page that appears, click the Configuration tab. Then, set Hive Auxiliary JARs Directory to /var/lib/hive.
    3. For Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml, specify the Name and Value fields based on the following information:
      <property>
        <name>hive.exec.post.hooks</name>
        <value>com.cloudera.navigator.audit.hive.HiveExecHookContext,org.apache.hadoop.hive.ql.hooks.LineageLogger,com.aliyun.dataworks.meta.hive.hook.LineageLoggerHook</value>
      </property>
    4. For Hive Metastore Server Advanced Configuration Snippet (Safety Valve) for hive-site.xml, specify the Name and Value fields based on the following information:
      <property>
        <name>hive.metastore.event.listeners</name>
        <value>com.aliyun.dataworks.meta.hive.listener.MetaStoreListener</value>
      </property>
      <property>
        <name>hive.metastore.pre.event.listeners</name>
        <value>com.aliyun.dataworks.meta.hive.listener.MetaStorePreAuditListener</value>
      </property>
    5. After the Hive hooks are configured, perform configurations on clients as prompted in the Cloudera Manager Admin Console. Then, restart the Hive service.
      Note If the restart fails, keep the logs for troubleshooting. To prevent normal operations from being affected, you can remove the added information and restart the Hive service again. If the restart is successful after the information is added, check whether log files whose names start with hook.event, such as hook.event.1608728145871.log, are generated in the /tmp/hive/ directory on the hosts.
  2. Collect logs from the Hive hooks.
    1. Log on to the Cloudera Manager Admin Console and download the DataWorks JAR package.
      wget https://dataworks-public-tools.oss-cn-shanghai.aliyuncs.com/dw-tools.jar
    2. Create a config.json file in the directory where the DataWorks tool is located. Then, modify the file based on the following code and save the file:
      // config.json
      {
          "accessId": "<accessId>",
          "accessKey": "<accessKey>",
          "endpoint": "cn-shanghai-intranet.log.aliyuncs.com",
          "project": "onefall-test-pre",
          "clusterId": "1234",
          "ipList": "192.168.0.1,192.168.0.2,192.168.0.3"
      }
      Fields:
      • accessId: the AccessKey ID of your Alibaba Cloud account.
      • accessKey: the AccessKey secret of your Alibaba Cloud account.
      • endpoint: the internal endpoint that is used to access your Log Service project. For more information, see Endpoints.
      • project: the name of your Log Service project. For more information about how to obtain the name, see Manage a project.
      • clusterId: the ID of the CDH cluster generated for DataWorks. You can submit a ticket to obtain the ID.
      • ipList: the IP addresses of all HiveServer2 and Hive Metastore hosts. Separate the IP addresses with commas (,). The hosts are those where the DataWorks Hive hooks are deployed.
    3. Run the config.json file.
      java -cp dw-tools.jar com.aliyun.dataworks.tools.CreateLogConfig config.json
    4. Install the client.
      wget http://logtail-release-cn-shanghai.oss-cn-shanghai.aliyuncs.com/linux64/logtail.sh -O logtail.sh; chmod 755 logtail.sh; ./logtail.sh install cn-shanghai
      Change cn-shanghai to the region where your Log Service project resides.
  3. After you complete the preceding steps, a Logstore named hive-event, a Logtail configuration named hive-event-config, and a log group named hive-servers are generated in your Log Service project. View and record the ID of your Alibaba Cloud account, the endpoint of your Log Service project, and other information about the project. Then, submit a ticket to send the recorded information to the technical support personnel of DataWorks. The technical personnel perform subsequent configurations.