All Products
Search
Document Center

DataWorks:Develop tasks based on a self-managed Hadoop cluster

Last Updated:Feb 07, 2026

This topic describes how to associate a self-managed Hadoop cluster with a workspace in DataWorks to develop tasks. This topic also describes how to configure a custom runtime environment for a self-managed Hadoop cluster.

Background information

When you register a Cloudera's Distribution including Apache Hadoop (CDH) cluster, you can configure the endpoint of a self-managed Hadoop cluster. This way, you can access the self-managed Hadoop cluster in DataWorks. Then, you can use the default CDH image of DataWorks to build a runtime environment that contains the components of the required versions. This way, you can schedule the jobs of the self-managed Hadoop cluster in DataWorks.

Prerequisites

Before a custom image is created, the cluster environment, DataWorks environment, and an Object Storage Service (OSS) bucket are prepared.

  • A self-managed Hadoop cluster is created.

  • DataWorks is activated, a DataWorks workspace is created, and a serverless resource group is created.

    If you want to download an installation package from the OSS public endpoint, make sure that the serverless resource group can access data sources over the Internet. To enable access to data sources over the Internet, you must configure a network address translation (NAT) gateway for the virtual private cloud (VPC) with which the serverless resource group is associated. For more information, see the Overview of network connectivity solutions section in the "Network connectivity solutions" topic.

  • OSS is activated and a bucket is created. The bucket is used to upload and store the Spark installation package and Hadoop installation package that you want to configure. Then, the installation packages are provided for the script of the custom image to read.

Step 1: Associate a self-managed Hadoop cluster with a DataWorks workspace

You can associate your self-managed Hadoop cluster with DataWorks as a computing resource. The association method differs for workspaces that Use Data Studio (New Version) and for those that do not. Follow the instructions in the document that corresponds to your workspace environment.

Step 2: Configure a custom runtime environment for the self-managed cluster

DataWorks lets you build a custom image based on the default CDH image of DataWorks. The custom image is used as the runtime environment where jobs in your self-managed cluster are run in DataWorks. You can perform the operations in the following sections to prepare installation packages and build a custom image.

Prepare the environment installation packages for the custom cluster

Before you create a custom image, you must obtain the installation packages of the required components. The installation packages can be extracted from the existing self-managed Hadoop cluster. You can also directly download the installation packages of the required components. After you obtain the installation packages, upload the packages to an OSS bucket.

  1. Obtain the installation packages of the required components.

    • Locate the installation directory of the required components in the self-managed Hadoop cluster and extract the installation packages.

    • Download the installation packages of the components of the required versions.

      In this example, the open source Spark installation package and open source Hadoop installation package are used. Download URLs of Spark and Hadoop installation packages:

      Note

      This example uses the installation packages for Spark 3.4.2 and Hadoop 3.2.1.

  2. Upload the Spark and Hadoop installation packages to an OSS bucket.

Build a new image based on the installation packages

To create a custom image, you must write a script to download the Spark and Hadoop installation packages that are stored in an OSS bucket, and then install the packages in the CDH image. After the installation is complete, build and publish the custom image for data development.

  1. Create a custom image.

    1. Log on to the DataWorks console. In the top navigation bar, select a desired region. In the navigation pane on the left, click Image Management. The Custom Images page appears.

    2. On the Custom Images tab of the Image Management page, click Create Image. The Create Image panel appears. The following table describes the key parameters used to create a custom image. For more information about the parameters, see Manage images.

      Parameter

      Description

      Example

      Image Name/ID

      You can select various images. If you want to create a custom image based on a Hadoop cluster, select the official CDH image provided by DataWorks.

      From the drop-down list, select dataworks_cdh_custom_task_pod.

      Supported Task Type

      The CDH image supports tasks of the following types: CDH Hive, CDH Spark, CDH Spark SQL, CDH MR, CDH Presto, and CDH Impala. You can select task types as needed.

      In this example, all task types that are supported by the CDH image are selected.

      Installation Package

      • You must write a script to download and install the Spark and Hadoop installation packages that are stored in the OSS bucket.

      • You can replace the installation packages in the sample code as needed.

      Select Script from the drop-down list.

      Configure the script code.

      mkdir -p /opt/taobao/tbdpapp/cdh/custom
      
      wget -O spark-3.4.2-bin-hadoop3.tgz "{Download URL in OSS}"
      tar zxf spark-3.4.2-bin-hadoop3.tgz
      mv spark-3.4.2-bin-hadoop3 /opt/taobao/tbdpapp/cdh/custom
      
      wget -O hadoop-3.2.1.tar.gz "{Download URL in OSS}"
      tar zxf hadoop-3.2.1.tar.gz
      mv hadoop-3.2.1 /opt/taobao/tbdpapp/cdh/custom
      
      echo "\nexport PATH=/opt/taobao/tbdpapp/cdh/custom/hadoop-3.2.1/bin:/opt/taobao/tbdpapp/cdh/custom/spark-3.4.2-bin-hadoop3/bin:$PATH" >> /home/admin/.bashrc
      Note
      • Replace {OSS Download URL} with the actual download URL. For more information about file download URLs, see Use object URLs.

        • If an OSS object can be publicly read, configure the download URL.

        • If an OSS object is private, configure the download URL and make sure that the object is valid.

      • The versions in the sample code are for reference only. The versions of the components that are uploaded to OSS are used.

    3. After the configuration is complete, click OK.

  2. Build and publish the custom image.

    After you configure the custom image, you must build and publish the image so that you can use the image in Data Studio. You can perform the following steps to build and publish the custom image:

    1. After the custom image is created, click Publish in the Actions column to test the custom image.

      image

    2. In the Publish Image panel, select a resource group from the Test Resource Group drop-down list to test the image. If the test succeeds, click Publish to publish the image.

      image

    Note

    If you want to download an installation package from the OSS public endpoint, make sure that the test resource group can access data sources over the Internet. To enable access to data sources over the Internet, you must configure a NAT gateway for the VPC with which the resource group is associated. For more information, see the Overview of network connectivity solutions section in the "Network connectivity solutions" topic.

Step 3: Use the custom environment to run tasks

After the image is published, you can use it for data development. The method for using a custom image varies depending on whether your workspace uses the new version of Data Studio.

Log on to the DataWorks console. Switch to the required region, and in the navigation pane on the left, choose Data Development & O&M > Data Development. From the drop-down list, select the required workspace and click Go to Data Studio.

Run tasks in Data Studio

  1. Create a CDH node.

    On the Data Studio page, click the image icon and choose Create Node > CDH > CDH Hive. In the popover that appears, enter the node name, and press Enter to create a CDH Hive node.

    image

  2. Configure an image for the CDH Hive node.

    • Debugging Configurations

      1. Double-click the name of the CDH Hive node. The configuration tab of the node appears. In the right-side navigation pane of the configuration tab, click Run Configuration. The Debugging Configurations tab appears.

      2. Click DataWorks Configuration to navigate to the DataWorks Configurations section. In this section, select the created image for the Image parameter.

        • Resource Group: Select a serverless resource group.

        • Mirror: Select a custom image that is published and associated with the current workspace.

      image

    • Scheduling Configuration

      1. Double-click the name of the CDH Hive node. The configuration tab of the node appears. In the right-side navigation pane of the configuration tab, click Scheduling Confiiuration. The Properties tab appears.

      2. Click Scheduling Policy to navigate to the Scheduling Policies section. In this section, configure the parameters.

        • Scheduling Resource Groups: Select a serverless resource group.

        • Mirror: Select a custom image that is published and associated with the current workspace.

Note
  • The CDH image supports the following node types: CDH Hive, CDH Spark, CDH Spark SQL, CDH MR, CDH Presto, and CDH Impala.

  • For the task node to run smoothly, the Scheduling Resource Groups must match the Test Resource Group that you select when you Publish Image.

  • If the desired resource group is not displayed, it may not be associated with the current workspace. To associate the resource group, go to the Resource Groups page, find the resource group, and click Associate Workspace in the Actions column.

Run tasks in DataStudio

  1. Create a CDH node.

    1. Click the image icon and choose Create Node > CDH > CDH Hive.

      Parameter

      Configuration details

      Engine Instance

      Select the CDH cluster that is registered when you associate the self-managed Hadoop cluster with the workspace.

      Node Type

      CDH Hive.

      Path

      • You can select the workflow in which the CDH Hive node resides.

      • In this example, select Workflow.

      Name

      You can customize the node name.

      After the configuration is complete, click Confirm.

    2. Double-click the name of the CDH Hive node. The configuration tab of the node appears.

      After you develop the code for the CDH Hive node, you can test the node and configure an image for it.

      • Run the node with parameters.

        Click the image button and configure dw_cdh_mirroring in the Node Test Run Configuration Parameters dialog box.

        • Resource Group Name: Select a serverless resource group.

        • Image: Select a custom image that is published and associated with the current workspace.

        image

      • Configure the scheduling properties.

        Click the image icon in the top toolbar. In the parameter configuration dialog box that appears, select the Image that you created, such as dw_cdh_mirroring.

        • Resource Group: Select a serverless resource group.

        • Image: Select a custom image that is published and associated with the current workspace.

        image

Note
  • The CDH image supports the following node types: CDH Hive, CDH Spark, CDH Spark SQL, CDH MR, CDH Presto, and CDH Impala.

  • For the task node to run smoothly, the Scheduling Resource Groups must match the Test Resource Group that you select when you Publish Image.

  • If the desired resource group is not displayed, it may not be associated with the current workspace. To associate the resource group, go to the Resource Groups page, find the resource group, and click Associate Workspace in the Actions column.