All Products
Search
Document Center

DataWorks:Develop tasks on a self-managed Hadoop cluster

Last Updated:Jun 20, 2026

This document describes how to connect a self-managed Hadoop cluster to DataWorks for task development and how to create a custom runtime environment for the cluster.

Background

You can connect DataWorks to your self-managed Hadoop cluster by providing its service address when you register it as a CDH cluster. You can then use the default CDH image in DataWorks to create a runtime environment that matches your cluster's component versions. This allows you to schedule and run jobs from your self-managed Hadoop cluster directly within DataWorks.

Prerequisites

Before you create a custom image, you must prepare the cluster, DataWorks, and Object Storage Service (OSS) environments.

  • You have a self-managed Hadoop cluster.

  • You have activated DataWorks and created a DataWorks workspace and a serverless resource group.

    If you need to download installation packages from a public OSS endpoint, the serverless resource group requires internet access. To enable this, configure a NAT gateway for the serverless resource group. For more information, see Overview of network connectivity solutions.

  • You have activated Object Storage Service (OSS) and created a bucket. Use this bucket to upload and store the custom Spark installation package and Hadoop installation package, making them accessible to the image creation script.

Step 1: Connect a self-managed cluster to DataWorks

Please bind your self-managed Hadoop cluster to DataWorks as a compute resource. Because the binding process for workspaces that Use Data Studio (New Version) is different from the process for workspaces that do not Use Data Studio (New Version), refer to the documentation that corresponds to your workspace's actual environment to complete the binding.

Step 2: Customize the runtime environment

DataWorks allows you to build a custom image based on the default official CDH image. This custom image serves as the task runtime environment in DataWorks for your self-managed cluster. Follow the steps below to prepare your installation packages and build the new image.

Prepare installation packages

Before you create a custom image, obtain the required component installation packages. You can either extract these packages from your existing self-managed Hadoop cluster or download them directly. After obtaining the packages, upload them to Object Storage Service (OSS).

  1. Obtain the component installation packages.

    • Locate the installation directory of the required components in your self-managed Hadoop cluster and extract the packages.

    • Download the required versions of the component installation packages.

      This example uses open source Spark and Hadoop installation packages. You can find them at the following locations:

      Note

      This example uses the installation packages for Spark 3.4.2 and Hadoop 3.2.1.

  2. Upload the downloaded Spark and Hadoop installation packages to your OSS bucket.

Build a new image

To create a custom image, you write a script that downloads the Spark and Hadoop installation packages from your OSS bucket and installs them into the base CDH image. After the installation is complete, you build and publish the custom image for use in data development.

  1. Create a custom image.

    1. Log on to the DataWorks console. In the top navigation bar, select the region where your workspace resides. In the left-side navigation pane, click Image Management, and then click the Custom Image tab.

    2. Click Create Image. Configure the parameters for the custom image. For more information, see Image Management. The following table describes the key parameters.

      Parameter

      Description

      Example

      Image name/ID

      You can choose from various base images. To build a custom image for a Hadoop cluster, select the official CDH image provided by DataWorks.

      From the drop-down list, select dataworks_cdh_custom_task_pod.

      Supported task types

      The CDH image supports CDH Hive, CDH Spark, CDH Spark SQL, CDH MR, CDH Presto, and CDH Impala tasks. Select the task types you need.

      For this example, select all task types supported by the CDH image.

      Installation package

      • In this section, you must provide a script to download and install the Hadoop and Spark packages that you uploaded to OSS.

      • You can customize the script to replace the packages as needed.

      From the drop-down list, select Script.

      Custom script

      mkdir -p /opt/taobao/tbdpapp/cdh/custom
      
      wget -O spark-3.4.2-bin-hadoop3.tgz "{Your OSS Download URL}"
      tar zxf spark-3.4.2-bin-hadoop3.tgz
      mv spark-3.4.2-bin-hadoop3 /opt/taobao/tbdpapp/cdh/custom
      
      wget -O hadoop-3.2.1.tar.gz "{Your OSS Download URL}"
      tar zxf hadoop-3.2.1.tar.gz
      mv hadoop-3.2.1 /opt/taobao/tbdpapp/cdh/custom
      
      echo "\nexport PATH=/opt/taobao/tbdpapp/cdh/custom/hadoop-3.2.1/bin:/opt/taobao/tbdpapp/cdh/custom/spark-3.4.2-bin-hadoop3/bin:$PATH" >> /home/admin/.bashrc
      Note
      • Replace {Your OSS Download URL} with the actual download URL of your package. For more information, see Use object URLs.

        • If an OSS object is public: Provide the download URL.

        • If an OSS object is private: Provide a signed URL and ensure it has not expired.

      • The versions in this sample code are for reference only. Use the versions that correspond to the packages you uploaded to OSS.

    3. After configuring the parameters, click Determine to create the image.

  2. Build and publish the custom image.

    After you create the custom image, you must build and publish it before you can use it in Data Studio. Follow these steps to build and publish the image:

    1. After the custom image is created, find it in the list and click Deploy in the Actions column to begin the testing and publishing process.

      The image's status changes to Not Tested.

    2. In the Publish Image panel, select a resource group from the Test Resource Group drop-down list to test the image. After the test is successful, click Deploy.

    Note

    If you need to download installation packages from a public OSS endpoint, the test resource group requires internet access. To enable this, configure a NAT gateway for the serverless resource group. For more information, see Overview of network connectivity solutions.

Step 3: Run tasks with the custom environment

After the image is published, you can use it for data development. The method for using the custom image depends on whether your workspace uses the new version of Data Studio.

Log on to the DataWorks console. Select your region, and in the left-side navigation pane, choose Data Development & O&M > Data Studio. Select your workspace from the drop-down list and enter Data Studio/Data Development.

New Data Studio

  1. Create a CDH node.

    On the Data Studio page, click the image+ icon and choose Create Node > cdh > CDH Hive. Enter a name for the node and press Enter.

  2. Configure the image for the CDH Hive node.

    • Run Configuration

      1. Double-click the CDH Hive node to open the editor tab. On the right side, click Run Configuration.

      2. Click the DataWorks configuration tab and select the image you created.

        • Resource Group: Select the serverless resource group.

        • Image: Select the custom image that you published and associated with the current workspace.

      The default value for Compute CU is 0.25. This node uses the default value, so no changes are needed.

    • Scheduling Configuration

      1. Double-click the CDH Hive node to open the editor tab. On the right side, click Scheduling.

      2. Click the Scheduling Policy tab and configure the parameters.

        • Resource Group for Scheduling: Select the serverless resource group.

        • Image: Select the custom image that you published and associated with the current workspace.

Note
  • The CDH image supports the following node types: CDH Hive, CDH Spark, CDH Spark SQL, CDH MR, CDH Presto, and CDH Impala.

  • To ensure that the task node runs smoothly, make sure the Resource Group for Scheduling is the same as the Test Resource Group you selected when you published the image.

  • If your target resource group is not displayed, check whether it is associated with the current workspace. You can go to the Resource Group List page, find the target resource group, and click Bind to Workspace in the Operation column.

Previous Data Studio

  1. Create a CDH node.

    1. Click the image+ icon and choose Create Node > cdh > CDH Hive.

      Parameter

      Description

      Engine Instance

      Select the CDH cluster that you registered when connecting your self-managed cluster to DataWorks.

      Node Type

      CDH Hive.

      Path

      • Select the workflow where the CDH Hive node is located.

      • Example: Workflow.

      Name

      Enter a custom name for the node.

      After you configure the parameters, click Determine.

    2. Double-click the CDH Hive node to open the editor tab.

      After you finish writing the CDH Hive code, you can configure the image for the node and run a test.

      • Run with parameters.

        In the toolbar, click the imageRun icon. In the configuration dialog box, select your custom image, for example, dw_cdh_image.

        • Resource Group Name: Select a serverless resource group.

        • Image: Select the custom image that you published and associated with the current workspace.

        The Run CU parameter uses the default value of 0.25 and does not need to be changed. In the Custom Parameter area, enter values for any custom parameters, and then click Run.

      • Scheduling configuration.

        In the toolbar, click the imageProperties icon. In the configuration dialog box, select your custom image, for example, dw_cdh_image.

        • Resource Group for Scheduling: Select a serverless resource group.

        • Image: Select the custom image that you published and associated with the current workspace.

        The Scheduling CU parameter defaults to 0.25, and the Effective Date defaults to Perpetual.

Note
  • The CDH image supports the following node types: CDH Hive, CDH Spark, CDH Spark SQL, CDH MR, CDH Presto, and CDH Impala.

  • To ensure that the task node runs smoothly, make sure the Resource Group for Scheduling is the same as the Test Resource Group you selected when you published the image.

  • If your target resource group is not displayed, check whether it is associated with the current workspace. You can go to the Resource Group List page, find the target resource group, and click Bind to Workspace in the Operation column.