All Products
Search
Document Center

DataWorks:Data Studio (legacy version): Associate a CDH computing resource

Last Updated:Mar 26, 2026

DataWorks connects to Cloudera Distribution for Hadoop (CDH) and Cloudera Data Platform (CDP) clusters. After registering a cluster, you can run task development, scheduling, metadata management in Data Map, and Data Quality monitoring against it — without migrating the cluster itself.

Supported scope

Check the following constraints before you start. If your setup falls outside this scope, the registration will not succeed.

Dimension Supported values
Cluster versions CDH 5.16.2, CDH 6.1.1, CDH 6.2.1, CDH 6.3.2, CDP 7.1.7, Custom Version
Resource group types Serverless resource group (recommended), exclusive resource group for scheduling (earlier version only)
Regions China (Beijing), China (Shanghai), China (Hangzhou), China (Shenzhen), China (Zhangjiakou), China (Chengdu), Germany (Frankfurt)

Additional constraints:

  • New users can purchase only serverless resource groups.

  • Custom Version clusters can only use an exclusive resource group for scheduling of an earlier version, and require a support ticket after registration to initialize the environment.

Prerequisites

Before you begin, make sure you have:

Step 1: Go to the cluster registration page

Note If your workspace runs in standard mode, register clusters for both the development and production environments. CDH and CDP development operations are essentially the same; this topic uses a CDH cluster as an example.
  1. Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose More > Management Center, select the workspace from the drop-down list, and then click Go to Management Center.

  2. In the left-side navigation pane, click Cluster Management. On the Cluster Management page, click Register Cluster, and select CDH as the open source cluster type.

Step 2: Configure basic information

Parameter Description
Display Name of Cluster The cluster's display name in DataWorks. Must be unique.
Cluster Version The version to register. Fixed-component versions available: CDH 5.16.2, CDH 6.1.1, CDH 6.2.1, CDH 6.3.2, CDP 7.1.7. If none fits your needs, select Custom Version to configure component versions manually.
Cluster Name Determines where the cluster's configuration information comes from. Select Registered cluster to reference a cluster already registered in another workspace, or New cluster to enter the configuration yourself.
Note Components to configure vary by cluster version — the actual components shown on the page are the authoritative reference. Custom Version clusters require an exclusive resource group for scheduling and a support ticket after registration.

Step 3: Configure cluster connection information

Select the component versions deployed in your cluster and enter the component addresses you obtained from the preparations guide. See Preparations: Obtain CDH or CDP cluster information and configure network connectivity.

Cluster connection information
Note If you use a serverless resource group to access CDH components by domain name, configure authoritative resolution for those domain names in Alibaba Cloud DNS PrivateZone. See Add a built-in authoritative domain name and Set the scope of a domain name.

Step 4: Add cluster configuration files

Upload the configuration files for the components you use. See Preparations: Obtain CDH or CDP cluster information and configure network connectivity for how to get these files.

Add cluster configuration files
Configuration file Description Required when
core-site.xml Global configurations of the Hadoop Core library, including common I/O settings for Hadoop Distributed File System (HDFS) and MapReduce Running Spark or MapReduce tasks
hdfs-site.xml HDFS configurations: data block size, number of replicas, and path names
mapred-site.xml MapReduce parameters: execution method and scheduling behavior Running MapReduce tasks
yarn-site.xml YARN daemon configurations: resource manager, node manager, and application runtime settings Running Spark or MapReduce tasks, or when using Kerberos as the account mapping type (see Step 5)
hive-site.xml Hive parameters: database connection info, Hive Metastore settings, and execution engine Using Kerberos as the account mapping type (see Step 5)
spark-defaults.conf Default Spark job settings: memory size and CPU cores Running Spark tasks
config.properties Presto server configurations: global properties for coordinator and worker nodes Using the Presto component with OPEN LDAP or Kerberos as the account mapping type
presto.jks Java KeyStore (JKS) file for SSL/TLS encrypted communication in Presto

Step 5: Configure the default access identity

Choose the account used to access the CDH cluster when tasks run in DataWorks.

Note If you set Default Access Identity to any account other than a cluster account and no account mapping is configured (or the mapping type is set to no authentication), all tasks will fail.

Cluster account

A fixed cluster account runs all CDH tasks, regardless of which DataWorks user (Alibaba Cloud account or RAM user) triggered them. This applies to both development and production environments.

Mapped account

DataWorks maps each task executor's account to a cluster account before running the task. The mapping rules differ by environment:

  • Development environment: Map the task executor's account (Alibaba Cloud account or RAM user with development permissions) to a cluster account.

  • Production environment: Map the task owner, Alibaba Cloud account, or RAM user to a cluster account.

To configure mappings, see Set cluster identity mappings.

After configuring all settings, click Complete Creation to register the cluster.

Step 6: Initialize a resource group

Initialize the resource group the first time you associate a cluster, and again whenever the cluster service configuration changes or a component is upgraded (for example, after modifying core-site.xml). This ensures the resource group can access the cluster using the current environment configuration.

On the Cluster Management page, find the registered cluster, click Initialize Resource Group in the upper-right corner, select the resource group, and confirm.

Note Only serverless resource groups and exclusive resource groups for scheduling are supported. If no resource group is available, create one: Use serverless resource groups or Use exclusive resource groups for scheduling. For Custom Version clusters, submit a ticket instead.

(Optional) Set a YARN resource queue

YARN resource queues partition cluster computing resources to prevent task interference across modules. On the Cluster Management page, find the cluster, go to the YARN Resource Queue tab, and click EditYARN Resource Queue to configure the settings.

(Optional) Set Spark properties

Assign dedicated Spark properties to tasks in different modules.

  1. On the Cluster Management page, find the cluster.

  2. Click the Spark-related Parameter tab, then click EditSpark-related Parameter.

  3. Under the target module, click Add. Enter the Spark Property Name and Spark Property Value. See Spark configuration reference for available properties.

What's next

  • Set cluster identity mappings: If the default access identity is not a cluster account, configure mappings between DataWorks accounts and cluster accounts to enforce data permission isolation.

  • After you configure the CDH computing resource, you can use CDH-related nodes in DataStudio to perform data development operations.