This topic describes how to associate a self-managed Hadoop cluster with a workspace in DataWorks to develop tasks. This topic also describes how to configure a custom runtime environment for a self-managed Hadoop cluster.
Background information
When you register a Cloudera's Distribution including Apache Hadoop (CDH) cluster, you can configure the endpoint of a self-managed Hadoop cluster. This way, you can access the self-managed Hadoop cluster in DataWorks. Then, you can use the default CDH image of DataWorks to build a runtime environment that contains the components of the required versions. This way, you can schedule the jobs of the self-managed Hadoop cluster in DataWorks.
Prerequisites
Before a custom image is created, the cluster environment, DataWorks environment, and an Object Storage Service (OSS) bucket are prepared.
A self-managed Hadoop cluster is created.
DataWorks is activated, a DataWorks workspace is created, and a serverless resource group is created.
If you want to download an installation package from the OSS public endpoint, make sure that the serverless resource group can access data sources over the Internet. To enable access to data sources over the Internet, you must configure a network address translation (NAT) gateway for the virtual private cloud (VPC) with which the serverless resource group is associated. For more information, see the Overview of network connectivity solutions section in the "Network connectivity solutions" topic.
OSS is activated and a bucket is created. The bucket is used to upload and store the Spark installation package and Hadoop installation package that you want to configure. Then, the installation packages are provided for the script of the custom image to read.
Step 1: Associate a self-managed Hadoop cluster with a DataWorks workspace
You can associate your self-managed Hadoop cluster with DataWorks as a computing resource. The association method differs for workspaces that Use Data Studio (New Version) and for those that do not. Follow the instructions in the document that corresponds to your workspace environment.
To associate a computing resource with a workspace that joined the public preview of Data Studio, see Associate a computing resource with a workspace (Participate in Public Preview of Data Studio turned on).
To associate a computing resource with a workspace that has not joined the public preview of Data Studio, see DataStudio (old version): Associate a CDH computing resource.
Step 2: Configure a custom runtime environment for the self-managed cluster
DataWorks lets you build a custom image based on the default CDH image of DataWorks. The custom image is used as the runtime environment where jobs in your self-managed cluster are run in DataWorks. You can perform the operations in the following sections to prepare installation packages and build a custom image.
Prepare the environment installation packages for the custom cluster
Before you create a custom image, you must obtain the installation packages of the required components. The installation packages can be extracted from the existing self-managed Hadoop cluster. You can also directly download the installation packages of the required components. After you obtain the installation packages, upload the packages to an OSS bucket.
Obtain the installation packages of the required components.
Locate the installation directory of the required components in the self-managed Hadoop cluster and extract the installation packages.
Download the installation packages of the components of the required versions.
In this example, the open source Spark installation package and open source Hadoop installation package are used. Download URLs of Spark and Hadoop installation packages:
Spark open-source package download: Apache Spark Archives.
Download source: Hadoop open-source package download
NoteThis example uses the installation packages for
Spark 3.4.2andHadoop 3.2.1.
Upload the Spark and Hadoop installation packages to an OSS bucket.
Build a new image based on the installation packages
To create a custom image, you must write a script to download the Spark and Hadoop installation packages that are stored in an OSS bucket, and then install the packages in the CDH image. After the installation is complete, build and publish the custom image for data development.
Create a custom image.
Log on to the DataWorks console. In the top navigation bar, select a desired region. In the navigation pane on the left, click Image Management. The Custom Images page appears.
On the Custom Images tab of the Image Management page, click Create Image. The Create Image panel appears. The following table describes the key parameters used to create a custom image. For more information about the parameters, see Manage images.
Parameter
Description
Example
Image Name/ID
You can select various images. If you want to create a custom image based on a Hadoop cluster, select the official CDH image provided by DataWorks.
From the drop-down list, select
dataworks_cdh_custom_task_pod.Supported Task Type
The CDH image supports tasks of the following types:
CDH Hive,CDH Spark,CDH Spark SQL,CDH MR,CDH Presto, andCDH Impala. You can select task types as needed.In this example, all task types that are supported by the CDH image are selected.
Installation Package
You must write a script to download and install the Spark and Hadoop installation packages that are stored in the OSS bucket.
You can replace the installation packages in the sample code as needed.
Select Script from the drop-down list.
After the configuration is complete, click OK.
Build and publish the custom image.
After you configure the custom image, you must build and publish the image so that you can use the image in Data Studio. You can perform the following steps to build and publish the custom image:
After the custom image is created, click Publish in the Actions column to test the custom image.

In the Publish Image panel, select a resource group from the Test Resource Group drop-down list to test the image. If the test succeeds, click Publish to publish the image.

NoteIf you want to download an installation package from the OSS public endpoint, make sure that the test resource group can access data sources over the Internet. To enable access to data sources over the Internet, you must configure a NAT gateway for the VPC with which the resource group is associated. For more information, see the Overview of network connectivity solutions section in the "Network connectivity solutions" topic.
Step 3: Use the custom environment to run tasks
After the image is published, you can use it for data development. The method for using a custom image varies depending on whether your workspace uses the new version of Data Studio.
Log on to the DataWorks console. Switch to the required region, and in the navigation pane on the left, choose . From the drop-down list, select the required workspace and click Go to Data Studio.
icon and choose 

icon and choose
button and configure 
icon in the top toolbar. In the parameter configuration dialog box that appears, select the Image that you created, such as 