All Products
Search
Document Center

MaxCompute:Migrate data across MaxCompute projects by using DataWorks

Last Updated:Sep 19, 2023

This topic describes how to migrate data across MaxCompute projects in the same region by using DataWorks.

Prerequisites

All the steps in the tutorial DataWorks for MaxCompute Workshop are completed. For more information, see Workshop introduction.

Background information

This topic uses the WorkShop2023 workspace that is created in the tutorial DataWorks for MaxCompute Workshop as the source DataWorks workspace. The workspace is associated with the source MaxCompute project. You need to create a destination DataWorks workspace that is associated with the MaxCompute project. Then, you can migrate tables, resources, configurations, and data across the projects by using DataWorks.

Procedure

  1. Create a destination workspace.

    Log on to the DataWorks console, create a workspace, and associate a MaxCompute engine with the workspace. For more information, see Create a workspace and Associate a MaxCompute compute engine with a workspace.

    Note

    The WorkShop2023 workspace is in standard mode. In this example, a destination workspace named clone_test_doc in standard mode is created in DataWorks.

  2. Clone node configurations and resources across workspaces.

    You can use the cross-workspace cloning feature of DataWorks to clone the node configurations and resources from the WorkShop2023 workspace to the clone_test_doc workspace. For more information, see Clone nodes across workspaces.

    Note
    • The cross-workspace cloning feature cannot clone table schemas or data.

    • The cross-workspace cloning feature cannot clone combined nodes. If the destination workspace needs to use the combined nodes that exist in the source workspace, you need to manually create the combined nodes in the destination workspace.

    1. Go to the DataStudio page for the WorkShop2023 workspace and click Cross-project cloning in the upper-right corner. The Create Clone Task page appears.

    2. Set Target Workspace to clone_test_doc and Workflow to Workshop. Select all the nodes in the workflow and click Add to List. Click To-Be-Cloned Node List in the upper-right corner.

    3. In the pane that appears, click Clone All. In the dialog box that appears, click Clone. The selected nodes are cloned to the clone_test_doc workspace.

    4. Go to the destination workspace and check whether the nodes are cloned.

  3. Create tables.

    The cross-workspace cloning feature cannot clone table schemas. Therefore, you need to manually create required tables in the destination workspace.

    • For non-partitioned tables, we recommend that you use the following SQL statement to synchronize the table schema from the source workspace:

      create table table_name as select * from Source MaxCompute project.Table name;
    • For partitioned tables, we recommend that you use the following SQL statement to synchronize the table schema from the source workspace:

      create table table_name partitioned by (Partition key column string);

    After you create tables, commit the tables to the production environment. For more information about table creation, see Create and manage MaxCompute tables.

  4. Synchronize data.

    The cross-workspace cloning feature cannot clone data from the source workspace to the destination workspace. You need to manually synchronize required data to the destination workspace. To synchronize the data of the ods_user_info_d table from the source workspace to the destination workspace, perform the following steps:

    1. Create a data source.

      1. Go to the Data Integration page and click Data Source in the left-side navigation pane.

      2. On the Data Sources page, click Create Data Source. In the Add data source dialog box, select MaxCompute.

      3. Configure the Data Source Name, ODPS project name, AccessKey ID, and AccessKey Secret parameters, and click Complete. For more information, see Add a MaxCompute data source.

    2. Create a batch synchronization node.

      For more information, see Configure a batch synchronization node by using the codeless UI.

      1. Go to the DataStudio page, click the Data Analytics tab, and then click Workshop under Business Flow. Right-click Data Integration and choose Create Node > Offline Synchronization to create an offline synchronization node.

      2. On the configuration tab of the offline synchronization node, configure the required parameters. In this example, set Data Source under Source to WorkShop2023 and that under Target to odps_first. Set Table to ods_user_info_d. After the configuration is complete, click the Properties tab in the right-side navigation pane.

      3. Click Use Root Node in the Dependencies section and commit the offline synchronization node.

    3. Backfill data for the offline synchronization node.

      1. On the DataStudio page, click the DataWorks icon in the upper-left corner and choose All Products > Operation Center.

      2. On the page that appears, choose Cycle Task Maintenance > Cycle Task in the left-side navigation pane.

      3. On the page that appears, find the offline synchronization node you created in the node list and click the node name. On the canvas that appears on the right, right-click the offline synchronization node and choose Run > Backfill Data for Current Node.

      4. In the Backfill Data dialog box, configure the required parameters. In this example, set Data Timestamp to Jun 11, 2019 - Jun 17, 2019 to synchronize data from multiple partitions. Click OK.

        Note

        Configure the data timestamp based on your business requirements.

      5. In the left-side navigation pane, choose Cycle Task Maintenance > Patch Data. On the page that appears, check the running status of the data backfill instances that are generated. If Successful appears for a data backfill instance, the related data is synchronized.

    4. Verify the data synchronization result.

      On the DataStudio page, choose Create > Create Node > MaxCompute > > ODPS SQL to create an ODPS SQL node. On the configuration tab of the ODPS SQL node, execute the following SQL statement to check whether data is synchronized to the destination workspace:

      select * from ods_user_info_d where dt BETWEEN '20190611' and '20190617';