All Products
Search
Document Center

DataWorks:Add a MaxCompute data source

Last Updated:Feb 01, 2024

Before you can develop and manage MaxCompute tasks in DataWorks, you must add a MaxCompute project to the desired DataWorks workspace as a data source. This way, you can use the MaxCompute data source in different services of DataWorks and perform operations such as data synchronization, data development, and data analysis based on the MaxCompute data source.

Prerequisites

  • MaxCompute is activated. For more information, see Activate MaxCompute.

    Note

    We recommend that you create a MaxCompute project in the same region as the workspace to which you want to add a MaxCompute data source. If the regions are different, you can add only a cross-region data source to the workspace. The data source cannot be associated with DataStudio for data development or periodic task scheduling. The data source can be used only for data synchronization.

  • The required resource group is purchased and configured.

    After the MaxCompute data source is added, you can use the data source in scenarios such as data synchronization, development and scheduling of computing tasks, and generation of DataService Studio APIs. In these scenarios, a resource group for Data Integration, a resource group for scheduling, and a resource group for DataService Studio of DataWorks are separately required.

    You must purchase and configure the required resource group based on the use scenario of the MaxCompute data source and establish a network connection between the data source and resource group in advance. For information about resource groups provided by DataWorks and how to select a resource group, see Overview.

  • A DataWorks workspace is created, or the account that you use is added to the desired workspace as a member.

    You must add the desired MaxCompute project to the workspace as a data source. This way, you can use the data source to perform data development operations in the workspace. In addition, you must associate the purchased resource group with the workspace and establish a network connection between the resource group and data source. For information about how to create and manage a workspace, see Create and manage workspaces.

    Note

    You can add the same MaxCompute project to multiple workspaces as a data source.

Limits

  • A MaxCompute data source can be associated with DataStudio only if the MaxCompute data source meets the following conditions: The MaxCompute project based on which the data source is added resides in the same region and belongs to the same Alibaba Cloud account as the workspace. This way, the MaxCompute data source can be used for data development and periodic task scheduling.

  • You can add a MaxCompute project that does not belong to the current Alibaba Cloud account to a workspace within the current Alibaba Cloud account as a data source. After the data source is added, you can use only a RAM role to access the related MaxCompute project. MaxCompute data sources that are added across accounts cannot be used for data development or periodic task scheduling. For more information, see Scenario: Add a MaxCompute data source across accounts.

  • Only the Deploy and Workspace Administrator roles can be used to add data sources. For information about how to assign the roles to a member, see Add a RAM user to a workspace as a member and assign roles to the member.

    Note

    In addition to the permissions of the preceding workspace-level roles, you also need to manage permissions at the MaxCompute side when you add a MaxCompute data source. You can manage permissions by following the instructions shown in the DataWorks console. For more information, see the following section.

Permission description

  • Use a RAM user or RAM role to add a MaxCompute data source.

  • Specify a RAM user or RAM role as the default access identity of a MaxCompute data source in the production environment.

    • If you want to set the default access identity of a MaxCompute data source in the production environment to an identity that is not the current logon account, such as another Alibaba Cloud account or another role, the account or role must be attached the AdministratorAccess RAM policy. After the data source is added, the account or role is granted the permissions of the Role_Project_Scheduler role of the related MaxCompute project in the production environment. For information about how to configure the default access identity, see the "Add a data source" section in this topic.

    • The data of the MaxCompute data source added to the workspace in the production environment belongs to the default access identity that you specify for the MaxCompute data source when you add the data source in the production environment. If you want to use another account to access or perform operations on tables in the MaxCompute data source in the production environment, you must request the required permissions in Security Center. For more information, see Manage permissions on MaxCompute and Overview.

      Note

      You cannot perform fine-grained permission management on a workspace that is in basic mode. In this example, a MaxCompute data source is added to a workspace in standard mode.

Entry points for adding a data source

  1. Go to the Data Source page.

    1. Log on to the DataWorks console. In the left-side navigation pane, click Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

    2. In the left-side navigation pane of the page that appears, click Data Source. The Data Source page appears.

  2. On the Data Source page, click Add Data Source. In the Add data source dialog box, click MaxCompute. On the Create a MaxCompute data source page, configure the parameters to add a MaxCompute data source.

    You can also go to the Data Source page in Data Integration to add a MaxCompute data source. You can add a data source only to the production environment on the Data Source page in Data Integration. After the data source is added, you must manage the data source on the Data Source page in SettingCenter. You can go to Data Integration to view the types of data sources that you can add in this service.

Add a data source

You can use one of the methods described in this section to add a MaxCompute data source to DataWorks.

Note

If you use a workspace in standard mode, you must add a data source separately in the development environment and production environment. For information about the workspace modes, see Differences between workspaces in basic mode and workspaces in standard mode.

Method 1: Add a MaxCompute data source to DataWorks based on an existing MaxCompute project

If you have a MaxCompute project, you can add a MaxCompute data source to the current workspace based on the MaxCompute project.

If you add a MaxCompute data source to DataWorks by using this method, you must make sure that the account you use is granted the odps:ListProjects permission and assigned the Super_Administrator role of the MaxCompute project.

Perform the following steps to add a MaxCompute data source by using this method.

  1. Configure the parameters in the Basic Configuration section.image.png

    Parameter

    Description

    Data Source Name

    The name of the data source in DataWorks. The name must be unique within the current tenant.

    Authentication Method

    For a new data source, the value of this parameter is fixed as Alibaba Cloud account and Alibaba Cloud RAM role.

    Note

    For an existing data source that is added by using an AccessKey pair, we recommend that you change the value of this parameter to Alibaba Cloud account and Alibaba Cloud RAM role for the data source.

    Alibaba Cloud Account

    Specifies whether the MaxCompute project you want to use belongs to the current Alibaba Cloud account or another Alibaba Cloud account. Valid values:

    • Current Alibaba Cloud primary account: The MaxCompute project belongs to the current Alibaba Cloud account.

    • Other Alibaba Cloud primary accounts: The MaxCompute project belongs to another Alibaba Cloud account.

    The other parameters that you must configure vary based on the value of the Alibaba Cloud Account parameter. For more information about how to configure these parameters, see the following descriptions for Other items.

    Geography

    The region in which the MaxCompute project that you want to use resides.

    Note

    If the region that you selected is different from the region in which the workspace resides, you cannot reference the data source as a compute engine instance of the workspace after you add the data source. This indicates that the data source cannot be used in DataStudio or Operation Center and can be used only in Data Integration for data synchronization.

    Other items (Set the Alibaba Cloud Account parameter to Current Alibaba Cloud primary account)

    If you set the Alibaba Cloud Account parameter to Current Alibaba Cloud primary account, you must configure the following parameters:

    • MaxCompute Project Name: The name of the MaxCompute project that you want to add as a data source in the selected region.

      Note

      If you cannot select the desired MaxCompute project, assign the Super_Administrator role of the project to the current logon account. For more information, see the Permission description section in this topic.

    • Default Access Identity: The default access identity that is used to access the data source in the current workspace.

      • Development environment: The value of this parameter is fixed as Executor.

      • Production environment: The value of this parameter can be Alibaba Cloud primary account, Alibaba Cloud RAM sub-account, or Alibaba Cloud RAM role.

        Note
        • Only an Alibaba Cloud account, or a RAM user or RAM role that is attached the AdministratorAccess policy can be used to select any access identity in the development environment and production environment.

        • Specify a RAM user or RAM role as the default access identity of a MaxCompute data source in the production environment.

          If you want to set the default access identity of a MaxCompute data source in the production environment to an identity that is not the current logon account, such as another Alibaba Cloud account or another role, the account or role must be attached the AdministratorAccess RAM policy. After the data source is added, the account or role is granted the permissions of the Role_Project_Scheduler role of the related MaxCompute project in the production environment.

        • The data of the MaxCompute data source added to the workspace in the production environment belongs to the default access identity that you specify for the MaxCompute data source when you add the data source in the production environment. If you want to use another account to access or perform operations on tables in the MaxCompute data source in the production environment, you must request the required permissions in Security Center. For more information, see Manage permissions on MaxCompute and Overview.

          Note

          You cannot perform fine-grained permission management on a workspace that is in basic mode. In this example, a MaxCompute data source is added to a workspace in standard mode.

    Other items (Set the Alibaba Cloud Account parameter to Other Alibaba Cloud primary accounts)

    If you set the Alibaba Cloud Account parameter to Other Alibaba Cloud primary accounts, you must configure the following parameters:

    • Alibaba Cloud Primary Account UID: The UID of the Alibaba Cloud account to which the MaxCompute project you want to add as a data source belongs.

    • Opposite MaxCompute Project: The name of the MaxCompute project that you want to add to the current workspace as a data source.

    • Opposite RAM Role: The RAM role that you want to use to access the MaxCompute project. The RAM role that you select must meet the following conditions:

      • The RAM role is created within the Alibaba Cloud account that you selected.

      • The RAM role is assigned to the current logon account to allow DataWorks to access the MaxCompute project.

      • The RAM role is added to the MaxCompute project that you selected.

    Note
    • For information about how to add a MaxCompute data source across accounts, see Scenario: Add a MaxCompute data source across accounts.

    • If the MaxCompute project that you selected and the workspace belong to different Alibaba Cloud accounts, you cannot reference the data source as a compute engine instance of the workspace after you add the data source. This indicates that the data source cannot be used in DataStudio or Operation Center and can be used only in Data Integration for data synchronization.

    Endpoint

    The configuration method for the endpoints that DataWorks uses to access the MaxCompute project that you want to add as a MaxCompute data source. The endpoints include the endpoint of the MaxCompute service and the endpoint of the Tunnel service that you can use to upload and download local data or data of cloud data sources. The following configuration methods are supported:

    • Automatic adaption: DataWorks automatically matches endpoints based on actual situations. We recommend that you select this option.

      Note

      If the MaxCompute project that you selected and the workspace reside in different regions and you set the Endpoint parameter to Automatic adaption, DataWorks reads and downloads data over the public endpoint of the MaxCompute service by default.

    • Custom Configuration: If you select this option, you must manually configure the endpoint of the MaxCompute service and the endpoint of the Tunnel service. The endpoints vary based on the region that you selected. For more information, see Endpoints.

      Note

      You cannot change the endpoint of the Tunnel service for the default MaxCompute data source that is automatically generated when you associate a MaxCompute compute engine with a workspace for the first time.

  2. Test the network connectivity between the data source and a resource group.

    Resource groups are classified into resource groups for Data Integration, resource groups for scheduling, and resource groups for DataService Studio based on the use scenarios. For more information about different types of resource groups, see Overview.

    You can find the resource group that you want to use in the Connection Configuration section and test the network connectivity between the data source and resource group. If the network connectivity fails, tasks that use the data source cannot be run.

    Note

    After the data source is added to DataWorks, DataWorks adds the default access identity that you selected to the MaxCompute project based on which the data source is added and grants the related permissions on the MaxCompute project to the identity. Before the authorization is complete, the system may report an error for no permissions during the network connectivity test. In this case, you need to wait a moment after you save the data source.

Method 2: Add a MaxCompute data source by creating a MaxCompute project

If you do not have MaxCompute projects, you can create a MaxCompute project and add the project to the current workspace as a data source.

If you use this method to add a MaxCompute data source, you must make sure that the account you use is granted the odps:CreateProject permission of MaxCompute. If you use a RAM user or RAM role to add a MaxCompute data source by creating a MaxCompute project, the RAM user or RAM role is automatically granted the permissions of the Super_Administrator role of the MaxCompute project after the data source is added.

Note

If you add a MaxCompute data source to a workspace that is in standard mode by creating a MaxCompute project, DataWorks automatically adds existing and new members in the workspace to the MaxCompute project in the development environment. In addition, the roles of the members in the workspace are mapped to the related roles of MaxCompute. For more information, see Appendix: Mappings between the built-in workspace-level roles of DataWorks and the roles of MaxCompute.

Perform the following steps to add a MaxCompute data source by using this method.

  1. Configure the parameters in the Basic Configuration section.image.png

    Parameter

    Description

    Data Source Name

    The name of the data source in DataWorks. The name must be unique within the current tenant.

    Authentication Method

    For a new data source, the value of this parameter is fixed as Alibaba Cloud account and Alibaba Cloud RAM role.

    Alibaba Cloud Account

    The value of this parameter is fixed as Current Alibaba Cloud primary account.

    Geography

    The value of this parameter is fixed as the region in which the current workspace resides.

    Project Name

    The name of the MaxCompute project that you want to create. We recommend that you specify a name that conforms to the following requirements:

    • Production environment: Project name

    • Development environment: Project name_dev

    Computing Resource Payment Type

    The billing method of the MaxCompute project. Valid values: Pay-as-you-go and Subscription. For more information about the billing methods of MaxCompute, see Overview.

    Note

    You cannot add a MaxCompute project of the developer version to a workspace that is in standard mode as a data source.

    Default Quota

    The computing resource pool that is used by the MaxCompute project. For information about quotas required by MaxCompute, see Quota.

    Single SQL Consumption Limit

    The upper limit for the memory that can be consumed by a single SQL statement. You can configure this parameter to prevent excessive fees from being generated during the execution of SQL statements.

    Data Type

    The data type edition of the MaxCompute project. Valid values: 2.0 data type (recommended), 1.0 data types (for users who already use 1.0 data types), and Hive compatible types (for Hive migration users). For more information about data type editions of MaxCompute, see Data type editions.

    Whether To Encrypt

    Specifies whether to encrypt data stored in the MaxCompute project by using Key Management Service (KMS). For more information, see Storage encryption.

    Default Access Identity

    The default access identity that is used to access the data source in the current workspace.

    • Development environment: The value of this parameter is fixed as Executor.

    • Production environment: The value of this parameter can be Alibaba Cloud primary account, Alibaba Cloud RAM sub-account, or Alibaba Cloud RAM role.

      Note
      • Only an Alibaba Cloud account, or a RAM user or RAM role that is attached the AdministratorAccess policy can be used to select any access identity in the development environment and production environment.

      • Specify a RAM user or RAM role as the default access identity of a MaxCompute data source in the production environment.

        If you want to set the default access identity of a MaxCompute data source in the production environment to an identity that is not the current logon account, such as another Alibaba Cloud account or another role, the account or role must be attached the AdministratorAccess RAM policy. After the data source is added, the account or role is granted the permissions of the Role_Project_Scheduler role of the related MaxCompute project in the production environment.

      • The data of the MaxCompute data source added to the workspace in the production environment belongs to the default access identity that you specify for the MaxCompute data source when you add the data source in the production environment. If you want to use another account to access or perform operations on tables in the MaxCompute data source in the production environment, you must request the required permissions in Security Center. For more information, see Manage permissions on MaxCompute and Overview.

        Note

        You cannot perform fine-grained permission management on a workspace that is in basic mode. In this example, a MaxCompute data source is added to a workspace in standard mode.

    Endpoint

    The configuration method for the endpoints that DataWorks uses to access the MaxCompute project that you want to add as a MaxCompute data source. The endpoints include the endpoint of the MaxCompute service and the endpoint of the Tunnel service that you can use to upload and download local data or data of cloud data sources. The following configuration methods are supported:

    • Automatic adaption: DataWorks automatically matches endpoints based on actual situations. We recommend that you select this option.

      Note

      If the MaxCompute project that you selected and the workspace reside in different regions and you set the Endpoint parameter to Automatic adaption, DataWorks reads and downloads data over the public endpoint of the MaxCompute service by default.

    • Custom Configuration: If you select this option, you must manually configure the endpoint of the MaxCompute service and the endpoint of the Tunnel service. The endpoints vary based on the region that you selected. For more information, see Endpoints.

      Note

      You cannot change the endpoint of the Tunnel service for the default MaxCompute data source that is automatically generated when you associate a MaxCompute compute engine with a workspace for the first time.

  2. Test the network connectivity between the data source and a resource group.

    Resource groups are classified into resource groups for Data Integration, resource groups for scheduling, and resource groups for DataService Studio based on the use scenarios. For more information about different types of resource groups, see Overview.

    You can find the resource group that you want to use in the Connection Configuration section and test the network connectivity between the data source and resource group. If the network connectivity fails, tasks that use the data source cannot be run.

    Note

    After the data source is added to DataWorks, DataWorks adds the default access identity that you selected to the MaxCompute project based on which the data source is added and grants the related permissions on the MaxCompute project to the identity. Before the authorization is complete, the system may report an error for no permissions during the network connectivity test. In this case, you need to wait a moment after you save the data source.

Subsequent operations

To ensure the smoothness of data development, we recommend that you read Usage notes for development of MaxCompute tasks in DataWorks to understand information such as the procedure of using MaxCompute in DataWorks, fees for data development by using MaxCompute, environment preparation, and permission management before you perform the related operations.

After the data source is added, you can perform the following operations based on your business requirements:

  • Binding data sources or clusters:

    DataWorks DataStudio and Operation Center provide the capabilities of developing and scheduling MaxCompute tasks. If you want to develop MaxCompute tasks based on the MaxCompute data source or periodically schedule MaxCompute tasks, you must go to the DataStudio page in the DataWorks console and associate the MaxCompute data source with DataStudio.

    Note

    You can associate a MaxCompute data source with DataStudio only if the MaxCompute project based on which the data source is added resides in the same region and belongs to the same Alibaba Cloud account as the workspace to which the data source is added.

  • Perform data synchronization:

    DataWorks Data Integration provides MaxCompute Reader and MaxCompute Writer for you to read data from and write data to the MaxCompute data source. You can configure a batch or real-time synchronization task for the MaxCompute data source in DataStudio or configure a synchronization task for the MaxCompute data source in Data Integration based on your business requirements to perform data synchronization.

  • Manage the data source: You can go to the Data Source page in SettingCenter to perform management operations on the data source. For example, you can edit or delete the data source.