DataWorks integrates with Cloudera Data Platform (CDP) and Cloudera's Distribution including Apache Hadoop (CDH) clusters to give you workflow orchestration, timed scheduling, and metadata management for your Hive, Spark, Presto, Impala, and HBase workloads. This topic covers billing, environment setup, access control, and the DataWorks modules available for CDP/CDH development.
Background information
CDH (Cloudera's Distribution including Apache Hadoop) is an open-source platform distribution from Cloudera. It provides out-of-the-box cluster management, monitoring, and diagnostics, and supports various components for end-to-end big data workflows.
CDP (Cloudera Data Platform) is a public data platform that collects and integrates customer data across platforms. It supports real-time data collection and building individual user data profiles.
You can register CDH and CDP clusters in DataWorks to perform data development and administration operations, including task development, scheduling, metadata management (Data Map), and Data Quality.
Limitations
CDP/CDH tasks run only on Serverless resource groups (recommended) or legacy exclusive resource groups for scheduling.
Serverless resource groups are General-purpose resource groups that support both data synchronization and task scheduling. New users can purchase only Serverless resource groups. If you registered a Custom Version cluster, only a legacy exclusive resource group is supported.
CDP/CDH cluster registration is supported only in the following regions: China (Beijing), China (Shanghai), China (Hangzhou), China (Shenzhen), China (Zhangjiakou), China (Chengdu), and Germany (Frankfurt).
Billing
Using DataWorks with CDP/CDH incurs costs from DataWorks and from other products you connect.
DataWorks costs
| Cost item | Description |
|---|---|
| DataWorks edition | Activating Standard, Professional, or Enterprise Edition incurs edition fees. Basic Edition covers data migration, development, scheduling, and simple data governance for CDP/CDH. |
| Scheduling resource group | Tasks require a resource group to run. Purchase a Serverless resource group (recommended) or a legacy exclusive resource group. A Serverless resource group covers both scheduling and data synchronization. |
| Data synchronization resource group | A data synchronization task consumes both scheduling and synchronization resources from the resource group. |
Scheduling fees are not charged when you run tasks by clicking Run or Run with Parameters in DataStudio, or for failed tasks and dry-run tasks.
For more information about how scheduling costs are calculated, see Issuing logic of scheduling tasks in DataWorks.
Other costs
The following costs are billed by the respective products, not DataWorks.
For billing details, see the billing documentation for each product or Product Billing.
| Cost item | Description |
|---|---|
| Database | Upstream/downstream database fees incurred during data synchronization |
| Compute and storage | Fees from the compute engine when running compute tasks |
| Network | Fees for network products such as Express Connect, Internet Shared Bandwidth, or elastic IP addresses (EIPs) used to connect DataWorks to your cluster |
Prerequisites
Before you begin, ensure that you have:
Activated DataWorks. See Purchase.
Deployed a CDP or CDH cluster and registered it with DataWorks. DataWorks supports clusters deployed outside the Alibaba Cloud ECS environment. The cluster network must connect to your Alibaba Cloud virtual private cloud (VPC) via Express Connect or VPN. See Register a CDH or CDP cluster.
Purchased a Serverless resource group. By default, a Serverless resource group cannot reach other cloud product networks after purchase. Establish network connectivity between the cluster and the resource group before running tasks. See Use a Serverless resource group.
Created a DataWorks workspace. See Configure a workspace.
Development process overview
| Stage | What to do |
|---|---|
| Billing | Understand DataWorks and related product costs before you start. |
| Environment preparation | Purchase the right DataWorks edition and resource group, register the cluster, and prepare the development environment. |
| Access control | Grant RAM users the permissions they need to develop and manage CDP/CDH tasks. |
| Data integration | Synchronize data between CDP/CDH Hive or HBase and other data sources. |
| Data development and O&M | Build and schedule compute tasks in DataStudio, then monitor and operate them in Operation Center. |
| Data governance | Manage metadata and perform data governance with Data Map and Data Governance Center. |
| Data analysis and services | Analyze data and publish APIs with DataAnalysis and DataService Studio. |
| Open Platform | Integrate your applications with DataWorks via OpenAPI, OpenEvent, and Extensions. |
Environment preparation
Resource preparation
| Category | Description | Reference |
|---|---|---|
| Edition selection | Basic Edition covers data migration, development/scheduling, and simple data governance. For professional data governance and security, choose Standard, Professional, or Enterprise Edition. | Features of different DataWorks editions |
| Resource group selection | Serverless resource groups (recommended) support both task scheduling and data synchronization. Legacy exclusive resource groups are supported but not available to new users. | Use a Serverless resource group |
Development environment preparation
Register a CDP or CDH cluster in your DataWorks workspace to enable data development in DataStudio and collaborative development across workspace members.
| Category | Description | Reference |
|---|---|---|
| Data synchronization environment | Create the Hive component as a DataWorks data source before running any Hive-based data synchronization tasks. | Supported data sources and sync solutions |
| Data development and analysis environment | Add the cluster to DataWorks to enable data development, analysis, and periodic task scheduling. | Register a CDH or CDP cluster |
| Collaborative development environment | Add RAM users to the workspace and grant them the Developer role, then add the workspace members to the CDP or CDH cluster environment. | Add members to a workspace |
Access control
DataWorks provides product-level and module-level access control.
Data access control
Configure cluster account mappings for RAM users added to your workspace. This grants workspace members the permissions of their mapped cluster account. See Configure cluster identity mappings.
Feature module access control
Two levels of module access control are available:
Global module access control — manages permissions for DataWorks feature modules (such as restricting access to Data Map) and the DataWorks console (such as allowing users to delete workspaces). See Global module access control.
Workspace-level module access control — manages permissions for workspace-level modules (such as allowing access to DataStudio for development) and global modules (such as restricting access to Data Security Guard). See Workspace-level module access control.
For a complete guide to RAM user authorization, see Guidance on RAM user authorization.
Get started
Data integration
Data Integration lets you read from and write to CDP/CDH Hive and CDP/CDH HBase. Register the Hive or HBase component as a DataWorks data source, then run offline, full, or incremental synchronization tasks. For more information, see Data Integration.
Data development and O&M
| Module | Description | References |
|---|---|---|
| Data modeling | Structures and governs your data warehouse using planning, standards, dimensional modeling, and metric modules — aligned with the Alibaba data mid-end methodology. | Data modeling |
| DataStudio | Develop and schedule CDP/CDH tasks without complex command lines. Supports CDH Hive, Spark, MR, Presto, and Impala nodes, as well as general-purpose node types. | Create a CDH Hive node · Create a CDH Spark node · Create a CDH MR node · Create a CDH Presto node · Create a CDH Impala node |
| Operation Center | Monitor scheduled tasks and perform O&M operations. Provides intelligent diagnostics, task reruns, and baseline management to ensure timely task output. | Perform basic O&M operations on scheduled tasks |
| Data Quality | Monitors data throughout the R&D lifecycle using rules tied to your scheduling processes. Identifies data quality issues early and prevents them from affecting downstream systems. | Data Quality overview |
General-purpose nodes available in DataStudio
In addition to CDP/CDH compute nodes, DataStudio provides general-purpose nodes for complex workflows:
| Node type | Use case |
|---|---|
| Zero load node | Manage workflow dependencies without executing compute logic |
| HTTP Trigger node | Trigger DataWorks node scheduling from an external scheduling system |
| OSS object inspection node | Check for the presence of OSS objects before proceeding |
| FTP Check node | Check for files on an FTP server before proceeding |
| Assignment node | Pass input and output parameters between nodes |
| Parameter node | Manage scheduling parameters |
| Do-while node | Execute node code in a loop |
| For-each node | Iterate over assignment node outputs |
| Branch node | Route workflow execution based on conditions |
| Shell node | Run shell scripts |
| MySQL database node | Run MySQL operations |
References: Zero load node · HTTP Trigger node · OSS object inspection node · FTP Check node · Assignment node · Parameter node · Do-while node · For-each node · Branch node
Task lifecycle in DataStudio
After developing a node task, complete the following steps before it runs in production:
Configure scheduling properties — set dependencies and scheduling parameters so DataWorks can run the task periodically.
Debug the task — validate the task locally to avoid wasting compute resources in production. See Task debugging process.
Deploy to production — tasks run in Operation Center only after deployment. See Publish tasks.
Manage tasks — deploy, undeploy, or update scheduling properties in bulk. See Batch operations.
Apply process controls — use code review, smoke testing, and custom review logic to govern task changes. See Development process control.
Data governance
After registering a CDP/CDH cluster, DataWorks automatically collects metadata from the cluster. The following modules support data governance.
| Module | Description | References |
|---|---|---|
| Data Map | Enterprise-level metadata management: inventory data assets, view lineage, and search data objects. Table-level and field-level lineage is available for CDH Hive, CDH Spark, CDH Spark SQL, and CDH Impala nodes. | Data Map overview · Lineage display for different data sources |
| Security Center / Data Security Guard / Approval Center | Integrated data security governance: asset classification, sensitive data detection, authorization management, data masking, access auditing, and fraud detection. Note Approval Center does not support custom approval flows for CDH/CDP tables. | Security Center overview · Data Security Guard overview · Approval Center overview |
| Data Governance Center | Proactive governance: establish domain rules, automatically detect assets that need optimization, and apply pre-emptive and post-event governance policies. Note Only global check items and governance items apply to CDH/CDP data. | Data Governance Center overview |
Data analysis and services
| Module | Description | References |
|---|---|---|
| DataAnalysis | SQL-based online data analysis, business insight, data editing and sharing, chart cards, and visualized reports for daily reporting. | DataAnalysis overview |
| DataService Studio | Centralized API service management for internal and external systems, enabling unified data access and sharing. | DataService Studio overview |
Open Platform
DataWorks provides APIs and event subscriptions so your application systems can integrate with DataWorks for data management, governance, O&M, and real-time response to business changes.
| Module | Description | References |
|---|---|---|
| OpenAPI | Call DataWorks API operations to integrate your applications, automate data workflows, and reduce manual O&M. | OpenAPI |
| OpenEvent | Subscribe to DataWorks change events so your applications can detect and respond to changes immediately. | OpenEvent overview |
| Extensions | Register local programs as extensions to manage extension point events and processes in your workspace. | Extensions overview |