This document describes how to connect a self-managed Hadoop cluster to DataWorks for task development and how to create a custom runtime environment for the cluster.
Background
You can connect DataWorks to your self-managed Hadoop cluster by providing its service address when you register it as a CDH cluster. You can then use the default CDH image in DataWorks to create a runtime environment that matches your cluster's component versions. This allows you to schedule and run jobs from your self-managed Hadoop cluster directly within DataWorks.
Prerequisites
Before you create a custom image, you must prepare the cluster, DataWorks, and Object Storage Service (OSS) environments.
-
You have a self-managed Hadoop cluster.
-
You have activated DataWorks and created a DataWorks workspace and a serverless resource group.
If you need to download installation packages from a public OSS endpoint, the serverless resource group requires internet access. To enable this, configure a NAT gateway for the serverless resource group. For more information, see Overview of network connectivity solutions.
-
You have activated Object Storage Service (OSS) and created a bucket. Use this bucket to upload and store the custom Spark installation package and Hadoop installation package, making them accessible to the image creation script.
Step 1: Connect a self-managed cluster to DataWorks
Please bind your self-managed Hadoop cluster to DataWorks as a compute resource. Because the binding process for workspaces that Use Data Studio (New Version) is different from the process for workspaces that do not Use Data Studio (New Version), refer to the documentation that corresponds to your workspace's actual environment to complete the binding.
-
For workspaces using the new Data Studio: Connect a compute engine.
-
For workspaces using the previous version of Data Studio: Previous Data Studio: Connect a CDH compute engine.
Step 2: Customize the runtime environment
DataWorks allows you to build a custom image based on the default official CDH image. This custom image serves as the task runtime environment in DataWorks for your self-managed cluster. Follow the steps below to prepare your installation packages and build the new image.
Prepare installation packages
Before you create a custom image, obtain the required component installation packages. You can either extract these packages from your existing self-managed Hadoop cluster or download them directly. After obtaining the packages, upload them to Object Storage Service (OSS).
-
Obtain the component installation packages.
-
Locate the installation directory of the required components in your self-managed Hadoop cluster and extract the packages.
-
Download the required versions of the component installation packages.
This example uses open source Spark and Hadoop installation packages. You can find them at the following locations:
-
Spark open source package: Apache Spark Archives.
-
Hadoop open source package: Apache Hadoop Archives.
NoteThis example uses the installation packages for
Spark 3.4.2andHadoop 3.2.1. -
-
-
Upload the downloaded Spark and Hadoop installation packages to your OSS bucket.
Build a new image
To create a custom image, you write a script that downloads the Spark and Hadoop installation packages from your OSS bucket and installs them into the base CDH image. After the installation is complete, you build and publish the custom image for use in data development.
-
Create a custom image.
-
Log on to the DataWorks console. In the top navigation bar, select the region where your workspace resides. In the left-side navigation pane, click Image Management, and then click the Custom Image tab.
-
Click Create Image. Configure the parameters for the custom image. For more information, see Image Management. The following table describes the key parameters.
Parameter
Description
Example
Image name/ID
You can choose from various base images. To build a custom image for a Hadoop cluster, select the official CDH image provided by DataWorks.
From the drop-down list, select
dataworks_cdh_custom_task_pod.Supported task types
The CDH image supports
CDH Hive,CDH Spark,CDH Spark SQL,CDH MR,CDH Presto, andCDH Impalatasks. Select the task types you need.For this example, select all task types supported by the CDH image.
Installation package
-
In this section, you must provide a script to download and install the Hadoop and Spark packages that you uploaded to OSS.
-
You can customize the script to replace the packages as needed.
From the drop-down list, select Script.
-
-
After configuring the parameters, click Determine to create the image.
-
-
Build and publish the custom image.
After you create the custom image, you must build and publish it before you can use it in Data Studio. Follow these steps to build and publish the image:
-
After the custom image is created, find it in the list and click Deploy in the Actions column to begin the testing and publishing process.
The image's status changes to Not Tested.
-
In the Publish Image panel, select a resource group from the Test Resource Group drop-down list to test the image. After the test is successful, click Deploy.
NoteIf you need to download installation packages from a public OSS endpoint, the test resource group requires internet access. To enable this, configure a NAT gateway for the serverless resource group. For more information, see Overview of network connectivity solutions.
-
Step 3: Run tasks with the custom environment
After the image is published, you can use it for data development. The method for using the custom image depends on whether your workspace uses the new version of Data Studio.
Log on to the DataWorks console. Select your region, and in the left-side navigation pane, choose . Select your workspace from the drop-down list and enter Data Studio/Data Development.
+ icon and choose
+ icon and choose
Run icon. In the configuration dialog box, select your custom image, for example,
Properties icon. In the configuration dialog box, select your custom image, for example,