All Products
Search
Document Center

DataWorks:CDP and CDH development guidelines

Last Updated:Mar 25, 2026

DataWorks integrates with Cloudera Data Platform (CDP) and Cloudera's Distribution including Apache Hadoop (CDH) clusters to give you workflow orchestration, timed scheduling, and metadata management for your Hive, Spark, Presto, Impala, and HBase workloads. This topic covers billing, environment setup, access control, and the DataWorks modules available for CDP/CDH development.

Background information

  • CDH (Cloudera's Distribution including Apache Hadoop) is an open-source platform distribution from Cloudera. It provides out-of-the-box cluster management, monitoring, and diagnostics, and supports various components for end-to-end big data workflows.

  • CDP (Cloudera Data Platform) is a public data platform that collects and integrates customer data across platforms. It supports real-time data collection and building individual user data profiles.

You can register CDH and CDP clusters in DataWorks to perform data development and administration operations, including task development, scheduling, metadata management (Data Map), and Data Quality.

Limitations

  • CDP/CDH tasks run only on Serverless resource groups (recommended) or legacy exclusive resource groups for scheduling.

    Serverless resource groups are General-purpose resource groups that support both data synchronization and task scheduling. New users can purchase only Serverless resource groups. If you registered a Custom Version cluster, only a legacy exclusive resource group is supported.
  • CDP/CDH cluster registration is supported only in the following regions: China (Beijing), China (Shanghai), China (Hangzhou), China (Shenzhen), China (Zhangjiakou), China (Chengdu), and Germany (Frankfurt).

Billing

Using DataWorks with CDP/CDH incurs costs from DataWorks and from other products you connect.

DataWorks costs

Cost itemDescription
DataWorks editionActivating Standard, Professional, or Enterprise Edition incurs edition fees. Basic Edition covers data migration, development, scheduling, and simple data governance for CDP/CDH.
Scheduling resource groupTasks require a resource group to run. Purchase a Serverless resource group (recommended) or a legacy exclusive resource group. A Serverless resource group covers both scheduling and data synchronization.
Data synchronization resource groupA data synchronization task consumes both scheduling and synchronization resources from the resource group.
Scheduling fees are not charged when you run tasks by clicking Run or Run with Parameters in DataStudio, or for failed tasks and dry-run tasks.

For more information about how scheduling costs are calculated, see Issuing logic of scheduling tasks in DataWorks.

Other costs

The following costs are billed by the respective products, not DataWorks.

Important

For billing details, see the billing documentation for each product or Product Billing.

Cost itemDescription
DatabaseUpstream/downstream database fees incurred during data synchronization
Compute and storageFees from the compute engine when running compute tasks
NetworkFees for network products such as Express Connect, Internet Shared Bandwidth, or elastic IP addresses (EIPs) used to connect DataWorks to your cluster

Prerequisites

Before you begin, ensure that you have:

  • Activated DataWorks. See Purchase.

  • Deployed a CDP or CDH cluster and registered it with DataWorks. DataWorks supports clusters deployed outside the Alibaba Cloud ECS environment. The cluster network must connect to your Alibaba Cloud virtual private cloud (VPC) via Express Connect or VPN. See Register a CDH or CDP cluster.

  • Purchased a Serverless resource group. By default, a Serverless resource group cannot reach other cloud product networks after purchase. Establish network connectivity between the cluster and the resource group before running tasks. See Use a Serverless resource group.

  • Created a DataWorks workspace. See Configure a workspace.

Development process overview

StageWhat to do
BillingUnderstand DataWorks and related product costs before you start.
Environment preparationPurchase the right DataWorks edition and resource group, register the cluster, and prepare the development environment.
Access controlGrant RAM users the permissions they need to develop and manage CDP/CDH tasks.
Data integrationSynchronize data between CDP/CDH Hive or HBase and other data sources.
Data development and O&MBuild and schedule compute tasks in DataStudio, then monitor and operate them in Operation Center.
Data governanceManage metadata and perform data governance with Data Map and Data Governance Center.
Data analysis and servicesAnalyze data and publish APIs with DataAnalysis and DataService Studio.
Open PlatformIntegrate your applications with DataWorks via OpenAPI, OpenEvent, and Extensions.

Environment preparation

Resource preparation

CategoryDescriptionReference
Edition selectionBasic Edition covers data migration, development/scheduling, and simple data governance. For professional data governance and security, choose Standard, Professional, or Enterprise Edition.Features of different DataWorks editions
Resource group selectionServerless resource groups (recommended) support both task scheduling and data synchronization. Legacy exclusive resource groups are supported but not available to new users.Use a Serverless resource group

Development environment preparation

Register a CDP or CDH cluster in your DataWorks workspace to enable data development in DataStudio and collaborative development across workspace members.

CategoryDescriptionReference
Data synchronization environmentCreate the Hive component as a DataWorks data source before running any Hive-based data synchronization tasks.Supported data sources and sync solutions
Data development and analysis environmentAdd the cluster to DataWorks to enable data development, analysis, and periodic task scheduling.Register a CDH or CDP cluster
Collaborative development environmentAdd RAM users to the workspace and grant them the Developer role, then add the workspace members to the CDP or CDH cluster environment.Add members to a workspace

Access control

DataWorks provides product-level and module-level access control.

Data access control

Configure cluster account mappings for RAM users added to your workspace. This grants workspace members the permissions of their mapped cluster account. See Configure cluster identity mappings.

Feature module access control

Two levels of module access control are available:

  • Global module access control — manages permissions for DataWorks feature modules (such as restricting access to Data Map) and the DataWorks console (such as allowing users to delete workspaces). See Global module access control.

  • Workspace-level module access control — manages permissions for workspace-level modules (such as allowing access to DataStudio for development) and global modules (such as restricting access to Data Security Guard). See Workspace-level module access control.

For a complete guide to RAM user authorization, see Guidance on RAM user authorization.

Get started

Data integration

Data Integration lets you read from and write to CDP/CDH Hive and CDP/CDH HBase. Register the Hive or HBase component as a DataWorks data source, then run offline, full, or incremental synchronization tasks. For more information, see Data Integration.

Data development and O&M

ModuleDescriptionReferences
Data modelingStructures and governs your data warehouse using planning, standards, dimensional modeling, and metric modules — aligned with the Alibaba data mid-end methodology.Data modeling
DataStudioDevelop and schedule CDP/CDH tasks without complex command lines. Supports CDH Hive, Spark, MR, Presto, and Impala nodes, as well as general-purpose node types.Create a CDH Hive node · Create a CDH Spark node · Create a CDH MR node · Create a CDH Presto node · Create a CDH Impala node
Operation CenterMonitor scheduled tasks and perform O&M operations. Provides intelligent diagnostics, task reruns, and baseline management to ensure timely task output.Perform basic O&M operations on scheduled tasks
Data QualityMonitors data throughout the R&D lifecycle using rules tied to your scheduling processes. Identifies data quality issues early and prevents them from affecting downstream systems.Data Quality overview

General-purpose nodes available in DataStudio

In addition to CDP/CDH compute nodes, DataStudio provides general-purpose nodes for complex workflows:

Node typeUse case
Zero load nodeManage workflow dependencies without executing compute logic
HTTP Trigger nodeTrigger DataWorks node scheduling from an external scheduling system
OSS object inspection nodeCheck for the presence of OSS objects before proceeding
FTP Check nodeCheck for files on an FTP server before proceeding
Assignment nodePass input and output parameters between nodes
Parameter nodeManage scheduling parameters
Do-while nodeExecute node code in a loop
For-each nodeIterate over assignment node outputs
Branch nodeRoute workflow execution based on conditions
Shell nodeRun shell scripts
MySQL database nodeRun MySQL operations

References: Zero load node · HTTP Trigger node · OSS object inspection node · FTP Check node · Assignment node · Parameter node · Do-while node · For-each node · Branch node

Task lifecycle in DataStudio

After developing a node task, complete the following steps before it runs in production:

  1. Configure scheduling properties — set dependencies and scheduling parameters so DataWorks can run the task periodically.

  2. Debug the task — validate the task locally to avoid wasting compute resources in production. See Task debugging process.

  3. Deploy to production — tasks run in Operation Center only after deployment. See Publish tasks.

  4. Manage tasks — deploy, undeploy, or update scheduling properties in bulk. See Batch operations.

  5. Apply process controls — use code review, smoke testing, and custom review logic to govern task changes. See Development process control.

Data governance

After registering a CDP/CDH cluster, DataWorks automatically collects metadata from the cluster. The following modules support data governance.

ModuleDescriptionReferences
Data MapEnterprise-level metadata management: inventory data assets, view lineage, and search data objects. Table-level and field-level lineage is available for CDH Hive, CDH Spark, CDH Spark SQL, and CDH Impala nodes.Data Map overview · Lineage display for different data sources
Security Center / Data Security Guard / Approval CenterIntegrated data security governance: asset classification, sensitive data detection, authorization management, data masking, access auditing, and fraud detection.
Note

Approval Center does not support custom approval flows for CDH/CDP tables.

Security Center overview · Data Security Guard overview · Approval Center overview
Data Governance CenterProactive governance: establish domain rules, automatically detect assets that need optimization, and apply pre-emptive and post-event governance policies.
Note

Only global check items and governance items apply to CDH/CDP data.

Data Governance Center overview

Data analysis and services

ModuleDescriptionReferences
DataAnalysisSQL-based online data analysis, business insight, data editing and sharing, chart cards, and visualized reports for daily reporting.DataAnalysis overview
DataService StudioCentralized API service management for internal and external systems, enabling unified data access and sharing.DataService Studio overview

Open Platform

DataWorks provides APIs and event subscriptions so your application systems can integrate with DataWorks for data management, governance, O&M, and real-time response to business changes.

ModuleDescriptionReferences
OpenAPICall DataWorks API operations to integrate your applications, automate data workflows, and reduce manual O&M.OpenAPI
OpenEventSubscribe to DataWorks change events so your applications can detect and respond to changes immediately.OpenEvent overview
ExtensionsRegister local programs as extensions to manage extension point events and processes in your workspace.Extensions overview