To use instances such as MaxCompute and Hologres for data development in DataWorks, you must associate them as computing resources. This topic describes how to create and manage computing resources, which provides the foundation for task development and scheduling.
Relationship between computing resources and data sources
DataWorks supports associating various computing resources. After you associate a resource, you can develop complex data processing and periodic scheduling tasks in DataWorks. When you associate most computing resources to DataWorks, a data source with the same name is automatically created. You can then use the Data Integration module to perform operations, such as data synchronization, based on that data source. The differences between a compute engine and a data source are as follows:
A computing resource is an instance of a compute engine that executes data processing and analysis tasks.
A data source connects to different data storage services to store and manage data.
Supported computing resources
DataWorks supports associating the following computing resources for data development.
Category | Computing resource type | Instructions for associating the computing resource | Data Studio (new version) | DataStudio (legacy version) |
Offline computing | ||||
Real-time query | ||||
Real-time computing | ||||
Multimodal search | ||||
Cluster management | ||||
When you associate a MaxCompute, AnalyticDB for MySQL, AnalyticDB for PostgreSQL, AnalyticDB for Spark, ClickHouse, Hologres, Lindorm, EMR Serverless StarRocks, or OpenSearch computing resource, a data source with the same name is also created in the current workspace.
Permissions
To create computing resources, you must be a workspace member with the O&M or Workspace Administrator role, or have the AliyunDataWorksFullAccess or AdministratorAccess access policy. For more information, see Workspace-level module permission control and Grant permissions to a RAM user.
In addition to the preceding permissions, creating certain computing resources requires other access controls. Grant the permissions as prompted on the interface.
Associate a computing resource
The procedure for associating a computing resource varies depending on whether your workspace uses the Data Studio (new version).
Associate a computing resource in Data Studio (new version)
Log on to the DataWorks console. Switch to the destination region. In the navigation pane on the left, click . From the drop-down list, select the desired workspace and click Go To Management Center.
In the navigation pane on the left, click Computing Resources. On the Computing Resources page, find the computing resource type that you want to associate and follow the instructions in the corresponding document.
Associate a computing resource in DataStudio (legacy version)
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
In the navigation pane on the left, click the
icon to go to the Computing Resource page. Find the type of computing resource that you want to associate and follow the instructions in the corresponding document.Computing resource management: Click Create Computing Resource in the upper-right corner to create a computing resource.
Cluster Management: Click Create Cluster in the upper-right corner of the Computing Resource page to create a cluster.
Cluster Management
Supported cluster versions/types
References for associating clusters
Associate a CDH/CDP cluster
DataWorks provides CDH 5.16.2, CDH 6.1.1, CDH 6.2.1, CDH 6.3.2, and CDP 7.1.7. You can select these versions directly. The component versions for these cluster versions are fixed. For more information, see Cluster connection information. If these cluster versions do not meet your business needs, select Custom Version.
Associate an EMR cluster
Supported EMR cluster types: DataLake cluster (new data lake): EMR on ECS, Custom cluster: EMR on ECS, Hadoop cluster (old data lake): EMR on ECS, Spark cluster: EMR on ACK, and EMR Serverless Spark cluster.
ImportantDataWorks supports the following EMR versions for Hadoop clusters (old data lake):
EMR-3.38.2, EMR-3.38.3, EMR-4.9.0, EMR-5.6.0, EMR-3.26.3, EMR-3.27.2, EMR-3.29.0, EMR-3.32.0, EMR-3.35.0, EMR-4.3.0, EMR-4.4.1, EMR-4.5.0, EMR-4.5.1, EMR-4.6.0, EMR-4.8.0, EMR-5.2.1, and EMR-5.4.3
Hadoop clusters (old data lake) are no longer recommended. Migrate to a DataLake cluster as soon as possible. For more information, see Migrate a Hadoop cluster to a DataLake cluster.
Disassociate a computing resource
Disassociate computing resources with caution. disassociating a computing resource also deletes the associated data source of the same name. This action can affect tasks that reference the computing resource or data source in modules such as Data Integration, Operation Center, DataAnalysis, DataService Studio API, and Data Quality. To ensure that your business runs as expected, carefully read the prompts on the interface before you disassociate the resource. You must also migrate all tasks from the computing resource to another one.
On the computing resources page, find the computing resource. Click Disassociate on the right to disassociate the computing resource from this workspace.
Appendix: Task execution environments
In a standard mode workspace, each computing resource instance has two sets of configurations: one for the development environment and one for the production environment. You can specify different databases or instances for each environment. The system automatically maps and accesses the correct computing resource based on the runtime environment. This configuration isolates development and testing from production scheduling. For example, when you execute an offline sync task in the development environment, the task automatically accesses the pre-configured development database. When the task is run for production scheduling, it accesses the production database.
A basic mode workspace has only one environment and does not isolate development from production. For more information, see Differences between workspace modes.
If you upgrade a basic mode workspace to standard mode, the original computing resource is split into two isolated resources: one for the production environment and one for the development environment. Workspaces that use the Data Studio (new version) do not currently support upgrades. For more information, see Upgrade a workspace mode.