All Products
Search
Document Center

DataWorks:Data Studio: Associate a CDH computing resource

Last Updated:Mar 26, 2026

To develop and manage tasks on a Cloudera Distribution Including Apache Hadoop (CDH) cluster in DataWorks, register the cluster as a computing resource. Once registered, the computing resource is available for data synchronization and data development tasks.

Available regions: China (Beijing), China (Shanghai), China (Shenzhen), China (Hangzhou), China (Zhangjiakou), China (Chengdu), and Germany (Frankfurt).

Prerequisites

Before you begin, make sure you have:

  • A RAM user added to the workspace with the Workspace Administrator role

  • A CDH cluster deployed — DataWorks supports CDH clusters deployed outside Alibaba Cloud ECS, as long as the deployment environment is connected to an Alibaba Cloud virtual private cloud (VPC). See Network connectivity for IDC data sources

  • A resource group associated with the workspace, with network connectivity confirmed:

Permissions

Operator Required permissions
Alibaba Cloud account None
RAM user or RAM role O&M and Workspace administrator roles, or the AliyunDataWorksFullAccess permission. See Grant users space administrator permissions.

Go to the computing resource list

  1. Log on to the DataWorks console and switch to the target region.

  2. In the left navigation pane, choose More > Management Center. Select your workspace and click Go To Management Center.

  3. In the left navigation pane, click Computing Resource.

Associate a CDH computing resource

  1. On the Computing Resource page, click Associate Computing Resource.

  2. On the Associate Computing Resource page, set the computing resource type to CDH. You are redirected to the Associate CDH Computing Resource page.

  3. Configure the parameters described below, then click Confirm.

Cluster version and name

Parameter Description
Cluster version The CDH or CDP version to register. For supported versions and their fixed component versions, see Cluster connection information. Select Custom version to specify component versions manually.
Cluster name Select an existing cluster registered in another workspace to load its configuration, or enter a name to create a new configuration.
Computing resource instance name A display name for this computing resource. At runtime, tasks reference computing resources by this name.
Custom version clusters only support legacy exclusive resource groups for scheduling. After registration, submit a ticket to initialize the environment.

Cluster connection information

Configure connection endpoints for the Hadoop components your tasks will use. The system automatically detects component versions for the selected cluster version.

Component Connection format When to configure
Hive — HiveServer2 jdbc:hive2://<host>:<port>/<database> Submit Hive jobs
Hive — Metastore thrift://<host>:<port> Submit Hive jobs
Impala jdbc:impala://<host>:<port>/<schema> Submit Impala jobs
Spark Select a default version from the list Run Spark tasks
YARN — ResourceManager address http://<host>:<port> Submit Spark or MapReduce tasks
YARN — JobHistory webapp address http://<host>:<port2> View historical task details in the JobHistory Server web UI
MapReduce Select a default version from the list Run MapReduce tasks
Presto jdbc:presto://<host>:<port>/<catalog>/<schema> Submit Presto jobs (not a default CDH component)

To look up connection parameters for your cluster, see Obtain CDH or CDP cluster information and configure network connectivity.

If you use a serverless resource group and access CDH components by domain name, configure authoritative resolution for the CDH component domain names and set their effective scope in Alibaba Cloud DNS PrivateZone.

Cluster configuration files

Upload the configuration files that correspond to the tasks you plan to run.

File Description Upload when
Core-site file Global Hadoop Distributed File System (HDFS) and MapReduce I/O settings Running Spark or MapReduce tasks
Hdfs-site file HDFS settings: block size, replication factor, and path names
Mapred-site file MapReduce execution mode and scheduling behavior Running MapReduce tasks
Yarn-site file YARN resource manager, node manager, and application runtime settings Running Spark or MapReduce tasks, or using Kerberos account mapping
Hive-site file Hive database connection, metastore, and execution engine settings Using Kerberos account mapping
Spark-defaults file Default Spark job settings (spark-defaults.conf): memory, CPU cores, and other runtime parameters Running Spark tasks
Config.properties file Presto coordinator and worker node settings Using Presto with OPEN LDAP or Kerberos authentication
Presto.jks file SSL/TLS certificates for encrypted Presto communication

Default access identity

Set the cluster identity used when tasks run against the CDH cluster. To configure identity mappings, go to the Account Mapping tab on the Computing Resources page. See Set the cluster identity mapping.

Environment Available options
Development environment Cluster account; Mapped cluster account of task executor
Production environment Cluster account; Mapped cluster account of task owner; Mapped cluster account of Alibaba Cloud account; Mapped cluster account of RAM user

Initialize the resource group

Initialize the resource group when you register a cluster for the first time or after changing cluster service configurations (for example, modifying core-site.xml). Initialization makes sure the resource group can reach the CDH cluster after network connectivity is configured.

  1. On the Computing Resource page, find the CDH computing resource you created.

  2. In the upper-right corner, click Initialize Resource Group.

  3. Click Initialize next to the target resource group, then click OK.

More operations

Set a YARN resource queue (optional)

On the Computing Resource page, find the CDH cluster. On the YARN Resource Queue tab, click EditYARN Resource Queue to assign dedicated YARN resource queues to tasks in different modules.

Set Spark parameters (optional)

On the Computing Resource page, find the CDH cluster. On the Spark-related Parameter tab, click EditSpark-related Parameter. Click Add under the target module, enter the Spark Property Name and Spark Property Value. For a full list of Spark properties, see Spark configuration.

Configure host mappings for Kerberos authentication (optional)

When using a serverless resource group with a CDH cluster that has Kerberos authentication enabled, task submission can fail if DNS cannot resolve a cluster IP address to the hostname registered in Kerberos.

The Host Configuration feature lets you define a static IP-to-hostname mapping table for the computing resource. DataWorks uses this mapping when accessing your CDH cluster, making sure Kerberos authentication succeeds.

To configure host mappings:

  1. On the Computing Resource page, find the CDH computing resource and click Host Configuration.

  2. In the dialog box, enter the mappings in the following format. Each line is one mapping record:

    <IP address> <Hostname>

    Separate the IP address and hostname with one or more spaces. Configure mappings for all key nodes involved in Kerberos authentication and task execution, including NameNode, ResourceManager, and NodeManagers.

  3. Click OK to save. The configured hostnames appear on the computing resource card, confirming the configuration is active.

Important

Host configuration applies only to the current computing resource and does not affect other computing resources in the workspace.

What's next

After configuring CDH computing resources, use CDH-related nodes in Data Studio for data development.