All Products
Search
Document Center

DataWorks:Associate a computing resource with a workspace (Participate in Public Preview of Data Studio turned on)

Last Updated:Mar 31, 2025

If you turn on Participate in Public Preview of Data Studio when you create a workspace, you must also associate computing resources with the workspace. Then, you can develop and schedule tasks of the computing resource types in the workspace.

Prerequisites

  • A workspace is created, and Participate in Public Preview of Data Studio is turned on when you create the workspace. For more information about how to create a workspace, see Create a workspace.

    You can find the desired workspace on the Workspaces page in the DataWorks console and perform the following operations to check whether Participate in Public Preview of Data Studio is turned on:

    Participate in Public Preview of Data Studio not turned on

    Participate in Public Preview of Data Studio turned on

    Choose Shortcuts > Data Development in the Actions column.

    The old-version DataStudio page appears, as shown in the following figure.

    image

    For more information about old-version DataStudio, see Overview.

    Choose Shortcuts > DataStudio (new version) in the Actions column.

    The new-version Data Studio page appears, as shown in the following figure.

    image

    For more information about new-version Data Studio, see Data Studio (new version).

  • Related computing resources are available. The operation of associating a computing resource in DataWorks is to only associate your existing computing resource with a DataWorks workspace. The storage, data, and billing of the computing resource are still managed in the corresponding service.

  • A pay-as-you-go serverless resource group is automatically purchased and associated with the default workspace when you activate DataWorks. You are not charged for the resource group if you do not use it. If you create a workspace and want to perform the operations that are described in this topic on the new workspace, associate the serverless resource group with the new workspace. For more information about how to associate a resource group with a workspace, see the Step 2: Associate the resource group with a workspace section of the "Create and use a serverless resource group" topic.

  • The computing resources that you want to associate with the workspace are connected to the serverless resource group. For more information, see Network connectivity solutions.

Terms

computing resource

A computing resource is a resource instance that is used by the related compute engine to run data processing and analysis tasks. For example, a MaxCompute project for which a quota group is configured and a Hologres instance are computing resources. For example, when you use Alibaba Cloud MaxCompute in big data processing scenarios, you can configure a quota group to manage the amount of computing resources used by your computing tasks.

You can associate multiple types of computing resources with a workspace. After you associate MaxCompute, Hologres, AnalyticDB for PostgreSQL, AnalyticDB for MySQL V3.0, ClickHouse, E-MapReduce (EMR), Cloudera's Distribution Including Apache Hadoop (CDH), OpenSearch, Serverless Spark, Serverless StarRocks, or Realtime Compute for Apache Flink computing resources with a workspace, you can develop and schedule the tasks of specific computing resource types in the workspace.

data source

Data sources can be connected to different data storage services. A data source contains all the information that is required to connect to a data storage service. The information includes the username, password, and host address. Before data development, you must define information about the data sources that you want to use in DataWorks. This way, when you configure a task, you can select the names of data sources to determine the database from which you want to read data and the database to which you want to write data. You can add multiple types of data sources to a workspace.

data catalog

A data catalog is a structured list or map that displays all data assets within an organization. The data assets include but are not limited to databases, tables, and files. In DataWorks, a data catalog records the metadata information about data assets.

relationship among computing resources, data sources, and data catalogs

Computing resources, data sources, and data catalogs are independent items, but associations exist among them.

  • When you associate a computing resource with a workspace, the system automatically adds a corresponding data source to the workspace and associates a data catalog with the workspace.

  • When you add a data source to a workspace, the system automatically associates a corresponding data catalog with the workspace.

  • When you create a data catalog in a workspace, the system does not automatically add a corresponding data source to the workspace or associate a corresponding computing resource with the workspace.

Associate a computing resource with a workspace

DataWorks allows you to associate a computing resource with a workspace in multiple ways. You can select a method based on your business requirements.

Associate a computing resource with a workspace when you create the workspace

When you create a workspace, the Associate Computing Resource step appears after you configure parameters and click Create Workspace. Then, you can select computing resources based on your business requirements and associate the computing resources with the workspace.

image

If you turn on Participate in Public Preview of Data Studio when you create a workspace in DataWorks, you can associate different types of computing resources with the workspace in the Associate Computing Resource step. The following table describes specific information about the association.

Category

Computing resource type

Association description

References for parameter configuration

Offline computing

MaxCompute

DataWorks cannot be directly connected to MaxCompute quota groups. You can associate only MaxCompute projects with a DataWorks workspace. After you associate a MaxCompute computing resource with a DataWorks workspace, the system automatically adds a MaxCompute data source to the workspace and associates a MaxCompute data catalog with the workspace.

MaxCompute

Serverless Spark

You can associate a Serverless Spark workspace with a DataWorks workspace. For Spark computing resources, no data catalog is required for association.

Serverless Spark

Real-time query

Hologres

DataWorks cannot be directly connected to Hologres virtual warehouses. You can associate Hologres databases with a DataWorks workspace. After you associate a Hologres computing resource with a DataWorks workspace, the system automatically adds a Hologres data source to the workspace and associates a Hologres data catalog with the workspace.

Hologres

Serverless StarRocks

DataWorks cannot be directly connected to Serverless StarRocks queues. You can associate Serverless StarRocks instances with a DataWorks workspace. After you associate a Serverless StarRocks computing resource with a DataWorks workspace, the system automatically adds a Serverless StarRocks data source to the workspace and associates a Serverless StarRocks data catalog with the workspace.

Serverless StarRocks

Fully managed

Realtime Compute for Apache Flink

You can associate a Realtime Compute for Apache Flink namespace with a DataWorks workspace. For Realtime Compute for Apache Flink computing resources, no data catalog is required for the association.

Realtime Compute for Apache Flink

Multimodal search

OpenSearch

You can associate an OpenSearch instance with a DataWorks workspace. After you associate an OpenSearch computing resource with a DataWorks workspace, the system automatically adds an OpenSearch data source to the workspace. For OpenSearch computing resources, no data catalog is required for the association.

OpenSearch

Associate a computing resource with a DataWorks workspace on the details page of the workspace

If you do not associate a computing resource with a workspace when you create the workspace, you can associate the computing resource with the workspace on the details page of the workspace.

  1. Log on to the DataWorks console. In the top navigation bar, select a desired region. Then, click Workspace in the left-side navigation pane.

  2. On the Workspaces page, find the desired workspace and click Details in the Actions column.

  3. In the left-side navigation pane of the Workspace Details page, click Computing Resource. On the Computing Resource page, click Associate Computing Resource. In the Associate Computing Resource panel, select computing resource types based on your business requirements and configure parameters. For more information about parameter settings, see the Parameter configuration for association of different types of computing resources section in this topic.

    image

  4. After the configuration is complete, click OK.

Associate a computing resource with a DataWorks workspace in Management Center

If you do not associate a computing resource with a DataWorks workspace when you create the workspace, you can associate the computing resource with the workspace in Management Center.

  1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

  2. In the left-side navigation pane of the SettingCenter page, click Computing Resource.

  3. On the Computing Resource page, click Associate Computing Resource. In the Associate Computing Resource panel, select computing resource types based on your business requirements and configure parameters. For more information about parameter settings, see the Parameter configuration for association of different types of computing resources section in this topic.

    image

Associate a computing resource with a DataWorks workspace on the Data Studio page

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. In the left-side navigation pane of the Data Studio page, click the image icon and select Computing Resources.

    image

  3. On the Computing Resources tab, click Associate Computing Resource. In the Associate Computing Resource panel, select computing resource types based on your business requirements and configure parameters. For more information about parameter settings, see the Parameter configuration for association of different types of computing resources section in this topic.

    image

Parameter configuration for association of different types of computing resources

MaxCompute

  1. In the "Select a computing resource type" step of the Associate Computing Resource panel, click MaxCompute. For more information about MaxCompute, see What is MaxCompute?

  2. In the "Enter information" step, configure the parameters. The following table describes the parameters.

    Parameter

    Description

    MaxCompute Project

    Select a MaxCompute project that you want to associate with the current workspace. You can also click Create in the MaxCompute Project drop-down list to create a MaxCompute project and then select the created MaxCompute project from the MaxCompute Project drop-down list.

    Note
    • If you turn on Isolate Development and Production Environments when you create the current workspace, you must separately select a MaxCompute project for the development and production environments. You cannot select the same MaxCompute project for the development and production environments.

    • For more information about billing details of MaxCompute computing resources, see Billable items and billing methods.

    • For more information about how to create a MaxCompute project, see Create a MaxCompute project.

    Default Access Identity

    The default access identity that is used to access the MaxCompute project in the current workspace.

    • Development environment: The value of this parameter is fixed as Executor.

    • Production environment: The value of this parameter can be Alibaba Cloud Account, Alibaba Cloud RAM User, or Alibaba Cloud RAM Role.

    Computing Resource Instance Name

    The identifier of the computing resource. When a task is run, the system selects a computing resource for the task based on the name of the specified computing resource instance.

    Connection Configuration

    Select a resource group to connect to the MaxCompute compute engine. You can test the network connectivity between the resource group and the MaxCompute computing engine. If no resource group is associated with the current workspace, you can ignore the connectivity test.

    Note

    If no resource group is available, you can create a resource group and associate the resource group with the current workspace. Then, go to the details page of the workspace to test the connectivity between the resource group and a computing resource that you want to associate with the workspace. For more information, see Create and use a serverless resource group.

  3. Click OK.

    Note
    • After you associate a MaxCompute computing resource with a DataWorks workspace, the system automatically adds a MaxCompute data source to the workspace and associates a MaxCompute data catalog with the workspace.

    • After you associate a MaxCompute computing resource with a DataWorks workspace, you can view the details of the MaxCompute computing resource in the DATA CATALOG pane of the Data Studio page.

Serverless Spark

  1. In the "Select a computing resource type" step of the Associate Computing Resource panel, click Serverless Spark. For more information about Serverless Spark, see What is EMR Serverless Spark?

  2. In the "Enter information" step, configure the parameters. The following table describes the parameters.

    Parameter

    Description

    Spark Workspace

    Select a Spark workspace that you want to associate with the current workspace. You can also click Create in the Spark Workspace drop-down list and create a Spark workspace on the EMR Serverless Spark page in the EMR console. Then, you can select the created Spark workspace from the Spark Workspace drop-down list.

    Note
    • If you turn on Isolate Development and Production Environments when you create the current workspace, you must separately select a Spark workspace for the development and production environments.

    • For more information about how to create a Spark workspace, see Create a workspace.

    Role assignment

    The first time you select a Spark workspace, click Add Service-linked Role as Workspace Administrator to ensure that DataWorks can obtain the information about a specific EMR Serverless Spark cluster as expected.

    Important

    After service-linked roles are created, we recommend that you do not remove the administrator identity of the DataWorks service-linked roles AliyunServiceRoleForDataWorksOnEmr and AliyunServiceRoleForDataworksEngine from the EMR Serverless Spark workspace.

    Default Engine Version

    When you create an EMR Spark node in Data Studio, the engine version, message queue, and SQL Compute that you specified in the "Enter information" step are used by default. If you want to specify different engine versions, resource queues, or SQL computes for different nodes, go to the EMR Node Parameters section of the Properties tab on the configuration tab of each of the nodes to configure settings.

    Default Resource Queue

    Default SQL Compute

    Default Access Identity

    The default access identity that is used to access the Spark workspace in the current workspace.

    • Development environment: The value of this parameter is fixed as Executor.

    • Production environment: The value of this parameter can be Alibaba Cloud Account, Alibaba Cloud RAM User, or Task Owner.

    Computing Resource Instance Name

    The identifier of the computing resource. When a task is run, the system selects a computing resource for the task based on the name of the specified computing resource instance.

  3. Click OK.

    Note

    For Spark computing resources, no data catalog is required for the association.

Hologres

  1. In the "Select a computing resource type" step of the Associate Computing Resource panel, click Hologres. For more information about Hologres, see What is Hologres?

  2. In the "Enter information" step, configure the parameters. The following table describes the parameters.

    Parameter

    Description

    Hologres Instance

    Select a Hologres instance that you want to associate with the current workspace. You can also click Create in the Hologres Instance drop-down list and create a Hologres instance on the Hologres buy page. Then, you can select the created Hologres instance from the Hologres Instance drop-down list.

    Note
    • If you turn on Isolate Development and Production Environments when you create the current workspace, you must separately select a Hologres instance for the development and production environments.

    • For more information about how to create a Hologres instance, see Purchase a Hologres instance.

    Hologres Virtual Warehouse

    If the current Hologres instance supports virtual warehouses, you must configure a virtual warehouse for the Hologres instance. For more information, see Manage virtual warehouses.

    Database Name

    Select a database in the Hologres instance. If no database is available, you can click Create in the Database Name drop-down list to create a database. For more information about how to create a database in Hologres, see Create a database.

    Default Access Identity

    The default access identity that is used to access the Hologres instance in the current workspace.

    • Development environment: The value of this parameter is fixed as Executor.

    • Production environment: The value of this parameter can be Alibaba Cloud Account, Alibaba Cloud RAM User, or Alibaba Cloud RAM Role.

    Authentication Method

    Specifies whether to configure SSL authentication for the Hologres instance. If you set this parameter to SSL Authentication, you must also configure the SSL Encryption parameter.

    Computing Resource Instance Name

    The identifier of the computing resource. When a task is run, the system selects a computing resource for the task based on the name of the specified computing resource instance.

    Connection Configuration

    Select a resource group to connect to the Hologres instance. You can test the network connectivity between the resource group and the Hologres instance. If no resource group is associated with the current workspace, you can ignore the connectivity test.

    Note

    If no resource group is available, you can create a resource group and associate the resource group with the current workspace. Then, go to the details page of the workspace to test the connectivity between the resource group and a computing resource that you want to associate with the workspace. For more information, see Create and use a serverless resource group.

  3. Click OK.

    Note

    After you associate a Hologres computing resource with a DataWorks workspace, the system automatically adds a Hologres data source to the workspace and associates a Hologres data catalog with the workspace.

Serverless StarRocks

  1. In the "Select a computing resource type" step of the Associate Computing Resource panel, click Serverless StarRocks. For more information about Serverless StarRocks, see What is EMR Serverless StarRocks?

  2. In the "Enter information" step, configure the parameters. The following table describes the parameters.

    Parameter

    Description

    StarRocks Instance

    Select a StarRocks instance that you want to associate with the current workspace. You can also click Create in the StarRocks Instance drop-down list and create a StarRocks instance on the EMR Serverless StarRocks page in the EMR console. Then, you can select the created StarRocks instance from the StarRocks Instance drop-down list.

    Note
    • If you turn on Isolate Development and Production Environments when you create the current workspace, you must separately select a StarRocks instance for the development and production environments.

    • For more information about how to create a StarRocks instance, see Create an instance.

    Database Name

    Select a database in the StarRocks instance. If no database is available, you must create a database in the StarRocks instance first.

    Username

    Password

    The account and password that are specified when you create a StarRocks instance. The default account is admin.

    Computing Resource Instance Name

    The identifier of the computing resource. When a task is run, the system selects a computing resource for the task based on the name of the specified computing resource instance.

    Connection Configuration

    Select a resource group to connect to the StarRocks instance. You can test the network connectivity between the resource group and the StarRocks instance. If no resource group is associated with the current workspace, you can ignore the connectivity test.

    Note

    If no resource group is available, you can create a resource group and associate the resource group with the current workspace. Then, go to the details page of the workspace to test the connectivity between the resource group and a computing resource that you want to associate with the workspace. For more information, see Create and use a serverless resource group.

  3. Click OK.

    Note
    • After you associate a StarRocks computing resource with a DataWorks workspace, the system automatically adds a StarRocks data source to the workspace and associates a StarRocks data catalog with the workspace.

    • After you associate a StarRocks computing resource with a DataWorks workspace, you can view the details of the StarRocks computing resource in the DATA CATALOG pane of the Data Studio page.

Realtime Compute for Apache Flink

  1. In the "Select a computing resource type" step of the Associate Computing Resource panel, click Fully Managed Flink. For more information about Realtime Compute for Apache Flink, see What is Alibaba Cloud Realtime Compute for Apache Flink?

  2. In the "Enter information" step, configure the parameters. The following table describes the parameters.

    Parameter

    Description

    Flink Workspace

    Select a Realtime Compute for Apache Flink workspace that you want to associate with the current workspace. You can also click Create in the Flink Workspace drop-down list and create a workspace on the buy page of Realtime Compute for Apache Flink. Then, you can select the created workspace from the Flink Workspace drop-down list.

    Note
    • If you turn on Isolate Development and Production Environments when you create the current workspace, you must separately select a Realtime Compute for Apache Flink workspace for the development and production environments.

    • For more information about how to create a Realtime Compute for Apache Flink workspace, see Activate Realtime Compute for Apache Flink.

    Flink Namespace

    Select a namespace in the Realtime Compute for Apache Flink workspace. A default namespace is automatically generated after a Realtime Compute for Apache Flink workspace is created. You can also create a namespace in the management console of Realtime Compute for Apache Flink, and then select the created namespace from the Flink Namespace drop-down list. For more information about how to create a Realtime Compute for Apache Flink namespace, see Manage namespaces.

    Default Resource Queue In Which Namespace Is Deployed

    Select a default resource queue. When you create a Realtime Compute for Apache Flink node in Data Studio, the default resource queue that is selected in the "Enter information" step is used by default.

    Computing Resource Instance Name

    The identifier of the computing resource. When a task is run, the system selects a computing resource for the task based on the name of the specified computing resource instance.

  3. Click OK.

    Note

    For Realtime Compute for Apache Flink computing resources, no data catalog is required for the association.

OpenSearch

  1. In the "Select a computing resource type" step of the Associate Computing Resource panel, click OpenSearch. For more information about OpenSearch, see What is OpenSearch?

  2. In the "Enter information" step, configure the parameters. The following table describes the parameters.

    Parameter

    Description

    OpenSearch Instance

    Select an OpenSearch instance that you want to associate with the current workspace. You can also click Create in the OpenSearch Instance drop-down list and create an OpenSearch instance on the OpenSearch buy page. Then, you can select the created OpenSearch instance from the OpenSearch Instance drop-down list.

    Note

    Username

    Password

    Enter the username and password that are specified when you create an OpenSearch instance.

    Computing Resource Instance Name

    The identifier of the computing resource. When a task is run, the system selects a computing resource for the task based on the name of the specified computing resource instance.

  3. Click OK.

    Note

    For OpenSearch computing resources, no data catalog is required for the association.

AnalyticDB for MySQL V3.0

  1. In the "Select a computing resource type" step of the Associate Computing Resource panel, click AnalyticDB for MySQL (V3.0). For more information about AnalyticDB for MySQL V3.0, see What is AnalyticDB for MySQL?

  2. In the "Enter information" step, configure the parameters. The following table describes the parameters.

    Parameter

    Description

    Configuration Mode

    The mode in which you want to associate a computing resource. The value of this parameter is fixed as Alibaba Cloud Instance Mode.

    Alibaba Cloud Account

    The value of this parameter is fixed as Current Alibaba Cloud Account.

    Region

    The region in which the AnalyticDB for MySQL V3.0 instance resides.

    Note

    The system automatically creates a corresponding data source after you associate an AnalyticDB for MySQL V3.0 computing resource with a workspace. If the region that is selected when you associate the computing resource is different from the region in which the current workspace resides, the data source cannot be used for data development or periodic task scheduling in Data Studio. The data source can be used only for data synchronization in Data Integration.

    Instance

    Select an instance that you want to associate with the current workspace.

    Note

    If you turn on Isolate Development and Production Environments when you create the current workspace, you must select different instances or databases for the development and production environments.

    Database Name

    Enter the name of the database that is created in the AnalyticDB for MySQL V3.0 instance. For more information, see Create a database.

    Username

    Enter the username that has permissions on the database.

    Password

    Enter the password that corresponds to the username.

    Computing Resource Instance Name

    The identifier of the computing resource. When a task is run, the system selects a computing resource for the task based on the name of the specified computing resource instance.

    Connection Configuration

    Select a resource group to connect to the AnalyticDB for MySQL V3.0 compute engine. You can test the network connectivity between the resource group and the AnalyticDB for MySQL V3.0 compute engine. If no resource group is associated with the current workspace, you can ignore the connectivity test.

    Note

    If no resource group is available, you can create a resource group and associate the resource group with the current workspace. Then, go to the details page of the workspace to test the connectivity between the resource group and a computing resource that you want to associate with the workspace. For more information, see Create and use a serverless resource group.

  3. Click OK.

AnalyticDB for PostgreSQL

  1. In the "Select a computing resource type" step of the Associate Computing Resource panel, click AnalyticDB for PostgreSQL. For more information about AnalyticDB for PostgreSQL, see Overview.

  2. In the "Enter information" step, configure the parameters. The following table describes the parameters.

    Parameter

    Description

    Configuration Mode

    The mode in which you want to associate a computing resource. The value of this parameter is fixed as Alibaba Cloud Instance Mode.

    Alibaba Cloud Account

    The value of this parameter is fixed as Current Alibaba Cloud Account.

    Region

    The region in which the AnalyticDB for PostgreSQL instance that you want to associate resides.

    Note

    The system automatically creates a corresponding data source after you associate an AnalyticDB for PostgreSQL computing resource with a workspace. If the region that is selected when you associate the computing resource is different from the region in which the current workspace resides, the data source cannot be used for data development or periodic task scheduling in Data Studio. The data source can be used only for data synchronization in Data Integration.

    Instance

    Select an instance that you want to associate with the current workspace.

    Database Name

    Enter the name of the database that is created in the AnalyticDB for PostgreSQL instance. For more information, see Manage databases.

    Username

    Enter the username that has permissions on the database.

    Password

    Enter the password that corresponds to the username.

    Authentication Method

    Specifies whether to configure SSL authentication for the AnalyticDB for PostgreSQL instance. If you set this parameter to SSL Authentication, you must also configure the Truststore Certificate File parameter.

    Computing Resource Instance Name

    The identifier of the computing resource. When a task is run, the system selects a computing resource for the task based on the name of the specified computing resource instance.

    Connection Configuration

    Select a resource group to connect to the AnalyticDB for PostgreSQL compute engine. You can test the network connectivity between the resource group and the AnalyticDB for PostgreSQL compute engine. If no resource group is associated with the current workspace, you can ignore the connectivity test.

    Note

    If no resource group is available, you can create a resource group and associate the resource group with the current workspace. Then, go to the details page of the workspace to test the connectivity between the resource group and a computing resource that you want to associate with the workspace. For more information, see Create and use a serverless resource group.

  3. Click OK.

AnalyticDB for Spark

Note

For information about how to register an Alibaba Cloud instance used by AnalyticDB for Spark as an AnalyticDB for MySQL Enterprise Edition instance, see Spark compute engine.

  1. In the "Select a computing resource type" step of the Associate Computing Resource panel, click AnalyticDB for Spark.

  2. In the "Enter information" step, configure the parameters. The following table describes the parameters.

    Parameter

    Description

    Configuration Mode

    The mode in which you want to associate a computing resource. The value of this parameter is fixed as Alibaba Cloud Instance Mode.

    Alibaba Cloud Account

    The value of this parameter is fixed as Current Alibaba Cloud Account.

    Region

    The region in which the AnalyticDB for MySQL instance that you want to associate resides.

    Note

    The system automatically creates a corresponding data source after you associate an AnalyticDB for MySQL computing resource with a workspace. If the region that is selected when you associate the computing resource is different from the region in which the current workspace resides, the data source cannot be used for data development or periodic task scheduling in Data Studio. The data source can be used only for data synchronization in Data Integration.

    Instance

    Select an instance that you want to associate with the current workspace.

    Database Name

    Enter the name of the database that is created in the AnalyticDB for MySQL instance. For more information, see Create a database.

    Computing Resource Instance Name

    The identifier of the computing resource. When a task is run, the system selects a computing resource for the task based on the name of the specified computing resource instance.

    Connection Configuration

    Select a resource group to connect to the AnalyticDB for Spark compute engine. You can test the network connectivity between the resource group and the AnalyticDB for Spark compute engine. If no resource group is associated with the current workspace, you can ignore the connectivity test.

    Note

    If no resource group is available, you can create a resource group and associate the resource group with the current workspace. Then, go to the details page of the workspace to test the connectivity between the resource group and a computing resource that you want to associate with the workspace. For more information, see Create and use a serverless resource group.

  3. Click OK.

CDH

  1. In the "Select a computing resource type" step of the Associate Computing Resource panel, click CDH.

  2. In the "Enter information" step, configure the parameters. The following tables describe the parameters.

    • Basic information about CDH clusters

      Parameter

      Description

      Cluster Version

      The version of the cluster that you want to associate.

      You can select CDH 5.16.2, CDH 6.1.1, CDH 6.2.1, CDH 6.3.2, or CDP 7.1.7 from the Cluster Version drop-down list. After you select one of these cluster versions, the component versions that are compatible with the cluster version are fixed. If the provided cluster versions do not meet your business requirements, you can select Custom Version from the Cluster Version drop-down list and specify the version for each component based on your business requirements.

      Cluster Name

      The name of the cluster that you want to associate with the current workspace. This parameter is used to determine the source of the configuration information that is required when you associate a cluster. You can select a cluster that is registered to another DataWorks workspace or create a cluster.

      • If you select a cluster that is registered to another DataWorks workspace, you can reference the configuration information of the cluster.

      • If you create a cluster, you must configure the cluster before you can associate the cluster.

      Computing Resource Instance Name

      The identifier of the computing resource. When a task is run, the system selects a computing resource for the task based on the name of the specified computing resource instance.

    • Connection information about CDH clusters

      Configuration item

      Description

      Hive Connection Information

      Select a Hive version from the Version drop-down list based on the cluster version. Then, configure the HiveServer2 and Metastore parameters.

      Impala Connection Information

      Select an Impala version from the Version drop-down list based on the cluster version. Then, configure the JDBC URL parameter.

      Spark Connection Information

      Select a Spark version from the Version drop-down list based on the cluster version.

      YARN Connection Information

      Select a YARN version from the Version drop-down list based on the cluster version. Then, configure the Yarn.Resourcemanager.Address and Jobhistory.Webapp.Address parameters.

      MapReduce Connection Information

      Select a MapReduce version from the Version drop-down list based on the cluster version.

      Presto

      Select a Presto version from the Version drop-down list based on the cluster version. Then, configure the JDBC URL parameter.

    • Addition of configuration files

      Configuration file

      Description

      Scenario

      Core-Site file

      Contains the global configurations of the Hadoop core library, such as I/O settings that are commonly used by Hadoop Distributed File System (HDFS) and MapReduce.

      You must upload such a file if you want to run Spark or MapReduce tasks.

      Hdfs-Site file

      Contains HDFS-related configurations, such as the data block size, the number of replicas, and the path name.

      Mapred-Site file

      Contains MapReduce-related parameters. For example, you can use this file to configure the execution method and scheduling settings of MapReduce jobs.

      You must upload such a file if you want to run MapReduce tasks.

      Yarn-Site file

      Contains all configurations that are related to the YARN daemon, such as the configurations of resource managers, configurations of node managers, and runtime environment configurations of applications.

      You must upload such a file if you want to run Spark or MapReduce tasks or if Kerberos Account is selected as the account mapping type.

      Hive-Site file

      Contains the parameters that are used to configure Hive. For example, you can use this file to configure the database connection information, Hive Metastore, and an execution engine.

      You must upload such a file if Kerberos Account is selected as the account mapping type.

      Spark-Defaults file

      Contains the default configurations based on which a Spark job is run. You can use the spark-defaults.conf file to pre-configure a series of parameters, such as the memory size and the number of CPU cores. The parameter settings are used when a Spark application is run.

      You must upload such a file if you want to run Spark tasks.

      Config.Properties file

      Contains the configurations of a Presto server. For example, you can use this file to configure global properties for coordinator and working nodes in a Presto cluster.

      You must upload such a file if you want to use the Presto component and OpenLDAP Account or Kerberos Account is selected as the account mapping type.

      Presto.Jks file

      Stores security certificates, including private keys and public key certificates issued to applications. In a Presto database query engine, the presto.jks file is used to enable SSL- or TLS-encrypted communication for the Presto process to ensure data transmission security.

  3. Click OK.

ClickHouse

  1. In the "Select a computing resource type" step of the Associate Computing Resource panel, click ClickHouse.

  2. In the "Enter information" step, configure the parameters. The following table describes the parameters.

    Parameter

    Description

    Configuration Mode

    The value of this parameter is fixed as Connection String Mode.

    JDBC URL

    The JDBC URL that is used to connect to ClickHouse. You can log on to the ApsaraDB for ClickHouse console to obtain information about a ClickHouse database and the port number over which you can access the database.

    Username

    The username that you use to access the ClickHouse cluster.

    Password

    The password that you use to access the ClickHouse cluster.

    Authentication Method

    Specifies whether to enable SSL authentication for the ClickHouse cluster. If you enable SSL authentication for the ClickHouse cluster, the ClickHouse data source that is added based on the cluster cannot be used for data development or periodic task scheduling.

    Computing Resource Instance Name

    The identifier of the computing resource. When a task is run, the system selects a computing resource for the task based on the name of the specified computing resource instance.

    Connection Configuration

    Select a resource group to connect to the ClickHouse compute engine. You can test the network connectivity between the resource group and the ClickHouse compute engine. If no resource group is associated with the current workspace, you can ignore the connectivity test.

    Note

    If no resource group is available, you can create a resource group and associate the resource group with the current workspace. Then, go to the details page of the workspace to test the connectivity between the resource group and a computing resource that you want to associate with the workspace. For more information, see Create and use a serverless resource group.

  3. Click OK.

EMR

  1. In the "Select a computing resource type" step of the Associate Computing Resource panel, click EMR. For more information about EMR, see What is EMR on ECS?

  2. In the "Enter information" step, configure the parameters. The following table describes the parameters.

    Parameters that need to be configured if you select Current Alibaba Cloud Account

    If you set the Alibaba Cloud Account To Which Cluster Belongs parameter to Current Alibaba Cloud Account, you must configure the parameters that are described in the following table.

    Parameter

    Description

    Cluster Type

    The type of the EMR cluster that you want to associate with the current workspace. DataWorks allows you to associate only specific types of EMR clusters. For more information, see the Supported EMR cluster types section of the "Register an EMR cluster to DataWorks" topic.

    Cluster

    The EMR cluster that you want to associate within the current account.

    Default Access Identity

    The default access identity that is used to access the EMR cluster in the current workspace.

    • Development environment: You can select Cluster Account: hadoop or Cluster Account Mapped to Account of Task Executor.

    • Production environment: You can select Cluster Account: hadoop, Cluster Account Mapped to Account of Task Owner, Cluster Account Mapped to Alibaba Cloud Account, or Cluster Account Mapped to RAM User.

    Note

    If you select Cluster Account Mapped to Account of Task Owner, Cluster Account Mapped to Alibaba Cloud Account, or Cluster Account Mapped to RAM User for the Default Access Identity parameter, you can configure a mapping between a DataWorks tenant member and a specified EMR cluster account. For more information, see Configure mappings between tenant member accounts and EMR cluster accounts. The mapped EMR cluster account is used to run EMR tasks in DataWorks. If no mapping is configured between a DataWorks tenant member and an EMR cluster account, DataWorks implements the following policies on task running:

    • If you set the Default Access Identity parameter to Cluster Account Mapped to RAM User and select a RAM user from the Alibaba Cloud RAM User drop-down list, the EMR cluster account that has the same name as the RAM user is automatically used to run EMR tasks in DataWorks. If Lightweight Directory Access Protocol (LDAP) or Kerberos authentication is enabled for the EMR cluster, the EMR tasks fail to be run.

    • If you set the Default Access Identity parameter to Cluster Account Mapped to Alibaba Cloud Account, errors will be reported when EMR tasks are run in DataWorks.

    Pass Proxy User Information

    Specifies whether to pass the proxy user information.

    Note

    If LDAP or Kerberos authentication is enabled for the EMR cluster, the EMR cluster issues an authentication credential to each ordinary user, which is a cumbersome operation. To manage user permissions in a centralized manner, a superuser (real user) is used for permission authentication on behalf of a proxy user (ordinary user). When the proxy user accesses the EMR cluster, the identity authentication information of the superuser is used. You need to only add a user as a proxy user if you want the user to be authenticated by using the identity authentication information of a superuser.

    • Pass: When you run a task in the EMR cluster, data access permissions are verified and managed based on the proxy user.

      • Old-version DataStudio and DataAnalysis: The name of the Alibaba Cloud account used by the task executor is dynamically passed. The proxy user information is the account information of the task executor.

      • Operation Center: The name of the Alibaba Cloud account used by the default access identity, which is specified when you register the EMR cluster, is consistently passed. The proxy user information is the account information of the default access identity.

    • Do Not Pass: When you run a task in the EMR cluster, data access permissions are verified and managed based on the account authentication method that is specified when you register the EMR cluster.

    The method used to pass the proxy user information varies based on the type of an EMR task:

    • EMR Kyuubi task: The proxy user information is passed by using the hive.server2.proxy.user configuration item.

    • EMR Spark task or non-JDBC-mode EMR Spark SQL task: The proxy user information is passed by using the -proxy-user configuration item.

    Configuration files

    If you set the Cluster Type parameter to HADOOP, you must also upload the configuration files that are required. You can obtain the configuration files in the EMR console. For more information, see Export and import service configurations. After you export a service configuration file, change the name of the file based on the file upload requirements of the GUI.

    You can also access the EMR cluster that you want to associate, and go to the following paths to obtain the required configuration files:

    /etc/ecm/hadoop-conf/core-site.xml
    /etc/ecm/hadoop-conf/hdfs-site.xml
    /etc/ecm/hadoop-conf/mapred-site.xml
    /etc/ecm/hadoop-conf/yarn-site.xml
    /etc/ecm/hive-conf/hive-site.xml
    /etc/ecm/spark-conf/spark-defaults.conf
    /etc/ecm/spark-conf/spark-env.sh

    Computing Resource Instance Name

    The identifier of the computing resource. When a task is run, the system selects a computing resource for the task based on the name of the specified computing resource instance.

    Parameters that need to be configured if you select Another Alibaba Cloud Account

    If you set the Alibaba Cloud Account To Which Cluster Belongs parameter to Another Alibaba Cloud Account, you must configure the parameters that are described in the following table.

    Parameter

    Description

    UID Of Alibaba Cloud Account

    The UID of the Alibaba Cloud account to which the EMR cluster you want to associate belongs.

    RAM Role

    The RAM role that you want to use to access the EMR cluster. The RAM role must meet the following requirements:

    • The RAM role is created within the Alibaba Cloud account that you selected.

    • The RAM role is authorized to access the DataWorks service activated within the current logon account.

    EMR Cluster Type

    The type of the EMR cluster that you want to associate. You can associate only DataLake clusters, Hadoop clusters, and custom clusters that are created on the EMR on ECS page across accounts.

    EMR Cluster

    The EMR cluster that you want to associate.

    Configuration files

    The configuration files that are required. You can configure the parameters that are displayed to upload the required configuration files. For information about how to obtain configuration files, see Export and import service configurations. After you export a service configuration file, change the name of the file based on the file upload requirements of the GUI.

    image.png

    You can also access the EMR cluster that you want to associate, and go to the following paths to obtain the required configuration files:

    /etc/ecm/hadoop-conf/core-site.xml
    /etc/ecm/hadoop-conf/hdfs-site.xml
    /etc/ecm/hadoop-conf/mapred-site.xml
    /etc/ecm/hadoop-conf/yarn-site.xml
    /etc/ecm/hive-conf/hive-site.xml
    /etc/ecm/spark-conf/spark-defaults.conf
    /etc/ecm/spark-conf/spark-env.sh

    Default Access Identity

    The default access identity that is used to access the EMR cluster in the current workspace.

    • Development environment: You can select Cluster Account: hadoop or Cluster Account Mapped to Account of Task Executor.

    • Production environment: You can select Cluster Account: hadoop, Cluster Account Mapped to Account of Task Owner, Cluster Account Mapped to Alibaba Cloud Account, or Cluster Account Mapped to RAM User.

    Note

    If you select Cluster Account Mapped to Account of Task Owner, Cluster Account Mapped to Alibaba Cloud Account, or Cluster Account Mapped to RAM User for the Default Access Identity parameter, you can configure a mapping between a DataWorks tenant member and a specified EMR cluster account. For more information, see Configure mappings between tenant member accounts and EMR cluster accounts. The mapped EMR cluster account is used to run EMR tasks in DataWorks. If no mapping is configured between a DataWorks tenant member and an EMR cluster account, DataWorks implements the following policies on task running:

    • If you set the Default Access Identity parameter to Cluster Account Mapped to RAM User and select a RAM user from the RAM User drop-down list, the EMR cluster account that has the same name as the RAM user is automatically used to run EMR tasks in DataWorks. If LDAP or Kerberos authentication is enabled for the EMR cluster, the EMR tasks fail to be run.

    • If you set the Default Access Identity parameter to Cluster Account Mapped to Alibaba Cloud Account, errors will be reported when EMR tasks are run in DataWorks.

    Pass Proxy User Information

    Specifies whether to pass the proxy user information.

    Note

    If LDAP or Kerberos authentication is enabled for the EMR cluster, the EMR cluster issues an authentication credential to each ordinary user, which can be a cumbersome operation. To manage permissions of users in a centralized manner, a superuser (real user) is used for permission authentication on behalf of a proxy user (ordinary user). When the proxy user accesses the EMR cluster, the identity authentication information of the superuser is used. You need to add a user as a proxy user if you want the user to be authenticated by using the identity authentication information of a superuser.

    • Pass: When you run a task in the EMR cluster, data access permissions are verified and managed based on the proxy user.

      • DataStudio and DataAnalysis: The name of the Alibaba Cloud account used by the task executor is dynamically passed. The proxy user information is the account information of the task executor.

      • Operation Center: The name of the Alibaba Cloud account used by the default access identity, which is specified when you register the EMR cluster, is consistently passed. The proxy user information is the account information of the default access identity.

    • Do Not Pass: When you run a task in the EMR cluster, data access permissions are verified and managed based on the account authentication method that is specified when you register the EMR cluster.

    The method used to pass the proxy user information varies based on the type of an EMR task:

    • EMR Kyuubi tasks: The proxy user information is passed by using the hive.server2.proxy.user configuration item.

    • EMR Spark tasks and non-JDBC-mode EMR Spark SQL tasks: The proxy user information is passed by using the -proxy-user configuration item.

    Computing Resource Instance Name

    The identifier of the computing resource. When a task is run, the system selects a computing resource for the task based on the name of the specified computing resource instance.

  3. Click OK.

What to do next

  • After you associate specific types of computing resources that are described in the Parameter configuration for association of different types of computing resources section in this topic with a workspace, the system automatically associates corresponding data catalogs with the workspace. In addition, you can also separately associate Data Lake Formation (DLF), MaxCompute, Hologres, or StarRocks data catalogs for visualized data query and management in new-version Data Studio.

  • After you associate a data catalog, you can go to Data Studio to view and manage tables in the data catalog.

  • After you associate a computing resource with a workspace, you can perform operations, such as data development, data analysis, and implementing periodic task scheduling in Operation Center, in the current workspace. For more information, see Data Studio (new version), DataAnalysis overview, and Getting started with Operation Center.