All Products
Search
Document Center

DataWorks:Sync data

Last Updated:Dec 10, 2025

This tutorial uses two data sources as an example: the ods_user_info_d table in MySQL, which contains basic user information, and the user_log.txt file in OSS, which contains website access log data. You will use a Data Integration offline sync task to synchronize data from these sources to the ods_user_info_d and ods_raw_log_d tables in MaxCompute. This tutorial describes how to use DataWorks Data Integration to synchronize data between heterogeneous data sources for a data warehouse.

Prerequisites

Ensure that you have prepared the required operating environment. For more information, see Prepare the environment.

1. Create data sources

To ensure that data can be processed in subsequent steps, you must add the following data sources to your DataWorks workspace to retrieve the raw data.

  • MySQL data source: This tutorial uses a data source named user_behavior_analysis_mysql to retrieve basic user information (ods_user_info_d) from MySQL.

  • HttpFile data source: In this tutorial, the data source is named user_behavior_analysis_httpfile and is used to retrieve user website access logs (user_log.txt) stored in OSS.

Note

If you have already created MySQL and HttpFile data sources for other user persona analysis tutorials, you can skip this step.

Create a MySQL data source (user_behavior_analysis_mysql)

The basic user information for this tutorial is stored in a MySQL database. You must create a MySQL data source to synchronize the user information (ods_user_info_d) from the database to MaxCompute.

  1. Go to the Data Source page.

    1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

    2. In the navigation pane on the left, click Data Sources to go to the Data Sources page.

  2. Click Add Data Source. Search for and select MySQL as the data source type.

  3. On the Add MySQL Data Source page, configure the parameters. For this tutorial, use the same example values for both the development and production environments.

    The following table describes the key parameters. You can keep the default values for the other parameters.

    Parameter

    Description

    Data Source Name

    Enter a name for the data source. For this tutorial, enter user_behavior_analysis_mysql.

    Data Source Description

    This data source is for DataWorks tutorials. Read data from this data source when you configure an offline sync task to access the test data provided by the platform. This data source can only be read in Data Integration scenarios. Other modules cannot use it.

    Configuration Mode

    Select Connection String Mode.

    Connection Address

    • Host IP address: rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com

    • Port: 3306

    Database Name

    Enter the database name. For this tutorial, enter workshop.

    Username

    Enter the username. For this tutorial, enter workshop.

    Password

    Enter the password. For this tutorial, enter workshop#2017.

    Authentication Method

    No authentication.

  4. In the Connection Configuration section, click Test Network Connectivity for both the production and development environments. Ensure that the connectivity status is Connected.

    Important
    • Ensure that the resource group is attached to the workspace and has public network access enabled. Otherwise, data synchronization fails. For more information, see Prepare the environment.

    • If you do not have a resource group available, follow the prompts in the connection configuration section. Click Purchase and Associated Purchased Resource Group.

  5. Click Complete Creation.

Create an HttpFile data source (user_behavior_analysis_httpfile)

The user website access logs for this tutorial are stored in OSS. You must create an HttpFile data source to synchronize the user website access log (user_log.txt) from OSS to MaxCompute.

  1. Click Data Sources in the navigation pane on the left.

  2. Click Add Data Source. In the Add Data Source dialog box, search for and select HttpFile as the data source type.

  3. On the Add HttpFile Data Source page, configure the parameters. For this tutorial, use the same example values for both the development and production environments.

    The following table describes the key parameters. You can keep the default values for the other parameters.

    Parameter

    Description

    Data Source Name

    Enter the data source name. For this tutorial, enter user_behavior_analysis_httpfile.

    Data Source Description

    This data source is for DataWorks tutorials. Read data from this data source when you configure an offline sync task to access the test data provided by the platform. This data source can only be read in Data Integration scenarios. Other modules cannot use it.

    URL

    Set URL to https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com for both the development and production environments.

  4. In the Connection Configuration section, click Test Network Connectivity for both the production and development environments. Ensure that the connectivity status is Connected.

    Important
    • Ensure that the resource group is attached to the workspace and has public network access enabled. Otherwise, data synchronization fails. For more information, see Prepare the environment.

    • If you do not have a resource group available, follow the prompts in the connection configuration section. Click Purchase and Associated Purchased Resource Group.

  5. Click Complete Creation.

2. Build a sync pipeline

In this step, you can build a sync pipeline to synchronize the user information and website access logs to their respective MaxCompute tables. This prepares the data for further processing.

  1. Click the icon icon in the upper-left corner and select All Products > Data Development And Task Operation > DataStudio. Then, at the top of the page, switch to the workspace that is created for this tutorial.

  2. In the navigation pane on the left, click image to go to the Data Development page. Then, in the Workspace Directories area, click image, select Create Workflow, and enter a workflow name. For this tutorial, the name is set to user_profile_analysis.

  3. On the workflow canvas, drag a Zero Load Node and two Batch Synchronization nodes from the pane on the left to the canvas, and name the nodes.

    The following table shows the example node names and their functions for this tutorial.

    Node Type

    Node Name

    Node Function

    image Zero Load Node

    workshop_start

    Manages the entire user persona analysis workflow and clarifies the data forwarding path. This node is a Dry-run node and requires no code editing.

    image Batch Synchronization

    ods_user_info_d

    Synchronizes basic user information from MySQL to the ods_user_info_d table in MaxCompute.

    image Batch Synchronization

    ods_raw_log_d

    Used to synchronize user website access records from OSS to the ods_raw_log_d table in MaxCompute.

  4. Drag a connection to set the workshop_start node as the ancestor node of the two offline sync nodes, as shown in the following figure.

    image
  5. Configure the workflow scheduling properties.

    On the workflow canvas, click Scheduling in the right-side pane and configure the parameters. The following table describes the key parameters. You can keep the default values for the other parameters.

    Scheduling Parameter

    Description

    Scheduling Parameters

    You can configure scheduling parameters for the entire workflow. The inner nodes of the workflow can directly use the configured scheduling parameters. In this tutorial, the parameter is set to bizdate=$[yyyymmdd-1] to obtain the date of the previous day.

    Scheduling Cycle

    Set to Day for this tutorial.

    Scheduling Time

    In this tutorial, Scheduling Time is set to 00:30, which means the workflow will start daily at 00:30.

    Scheduling Dependencies

    The workflow has no upstream dependencies, so you do not need to configure any. For easier management, click Use Workspace Root Node to attach the workflow to the root node of the workspace.

    The naming format for the workspace root node is workspace_name_root.

3. Configure sync tasks

Configure the initial node

  1. On the workflow orchestration page, hover over the workshop_start node and click Open Node.

  2. In the right-side pane of the workshop_start node configuration page, click Scheduling and configure the required parameters. The following table describes the key parameters. You can keep the default values for the other parameters.

    Scheduling Parameter

    Description

    Scheduling Type

    This tutorial uses dry-run scheduling.

    Resource Group

    For this tutorial, select the Serverless resource group created in the Prepare the environment step.

    Scheduling Dependencies

    Because workshop_start is the initial node and has no upstream dependencies, you can click Use Workspace Root Node to trigger the workflow execution from the workspace root node.

    The workspace root node is named workspace_name_root.

Configure the user data sync pipeline (ods_user_info_d)

  1. On the workflow orchestration page, hover over the ods_user_info_d node and click Open Node.

  2. Configure the network and resources for the sync pipeline.

    Parameter

    Description

    Source

    • Data source: MySQL.

    • Data source name: user_behavior_analysis_mysql.

    Resource Group

    Select the Serverless resource group that you purchased in the Prepare the environment step.

    Destination

    • Data Destination: MaxCompute(ODPS).

    • Data Source Name: Select the MaxCompute computing resource that you attached in the Prepare the environment step. For example, MaxCompute_Source.

  3. Click Next to configure the sync task.

    1. Configure Source and Destination

      The following table describes the key parameters. You can keep the default values for the other parameters.

      Module

      Configuration Item

      Description

      Data source

      Table

      Select ods_user_info_d.

      Split key

      The shard key is set to the uid field. The `uid` field is the primary key of the ods_user_info_d table in MySQL.

      Destination

      Tunnel Resource Group

      This tutorial uses the Common transmission resources by default. If you have an exclusive Tunnel Quota, you can select it from the drop-down list. For more information about exclusive Tunnel Quotas, see Purchase and use an exclusive resource group for Data Transmission Service.

      Table

      Click Generate Destination Table Schema to quickly create a MaxCompute table. Paste the following DDL statement into the Statement for Creating Table section and click Create Table. This table receives the user data from the source.

      CREATE TABLE IF NOT EXISTS ods_user_info_d (
       uid STRING COMMENT 'User ID',
       gender STRING COMMENT 'Gender',
       age_range STRING COMMENT 'Age range',
       zodiac STRING COMMENT 'Zodiac sign'
      )
      PARTITIONED BY (
       dt STRING
      )
      LIFECYCLE 7;

      Partition information

      For this tutorial, enter ${bizdate}. This lets you assign a constant value to the bizdate parameter during the testing phase, and dynamically assign a value to the bizdate parameter during scheduled execution. For information about the variable formats and configuration methods supported by DataStudio, see Scheduling Parameters.

    2. Confirm the Field Mapping and Channel Control.

      DataWorks synchronizes data from the specified source fields to the specified destination fields based on the configured field mapping. You can also set the concurrency and configure policies for dirty data. For this tutorial, set Policy for Dirty Data Records to Disallow Dirty Data Records. You can keep the default values for the other settings. For more information, see Configure a sync task using the codeless UI.

  4. Configure debug parameters.

    In the right-side pane of the offline sync task configuration page, click Debugging Configurations. Configure the following parameters to test the run in Step 4: Sync data.

    Configuration Item

    Description

    Resource Group

    Select the Serverless resource group that you purchased in the Prepare the environment step.

    Script Parameters

    You do not need to configure this parameter. In this tutorial, the sample code uses ${bizdate} to represent the data timestamp. When you debug the workflow in Step 4, set the Value Used in This Run to a specific constant, such as 20250223. The job run will then use this constant to replace the variable defined in the job.

  5. (Optional) Configure scheduling properties.

    For this tutorial, you can keep the default values for the scheduling parameters. In the right-side pane of the offline sync task configuration page, click Scheduling. For more information about the parameters, see Node scheduling.

    • Scheduling Parameters: These parameters are already configured for the workflow. Inner nodes do not require separate configuration and can use the parameters directly in tasks or code.

    • Scheduling Policies: You can use the Time for Delayed Execution parameter to specify the duration that a child node waits to run after the workflow is executed. This parameter is not set in this tutorial.

  6. In the node toolbar, click Save.

Configure the user log sync pipeline (ods_raw_log_d)

  1. On the workflow orchestration page, hover over the ods_raw_log_d node and click Open Node.

  2. Configure the network and resources for the sync pipeline.

    Parameter

    Description

    Source

    • Data source: HttpFile.

    • Data source name: user_behavior_analysis_HttpFile.

    Resource Group

    Select the Serverless resource group that you purchased in the Prepare the environment step.

    Destination

    • Data Destination: MaxCompute(ODPS).

    • Data Source Name: Select the MaxCompute computing resource that you attached in the Prepare environment stage. In this example, the name is MaxCompute_Source.

  3. Click Next to configure the sync task.

    • Configure Source and Destination

      The following table describes the key parameters. You can keep the default values for the other parameters.

      Module

      Configuration Item

      Description

      Data source

      File path

      For this tutorial, enter /user_log.txt.

      Text type

      This tutorial uses the text type.

      Column Delimiter

      For this tutorial, enter |.

      Advanced configuration > Skip Header

      For this tutorial, select No to not skip the table header.

      Important

      After you configure the data source, click Confirm Data Structure to check if the log file can be read correctly.

      Destination

      Tunnel Resource Group

      This tutorial uses the Common transmission resources by default. If you have an exclusive Tunnel Quota, you can select it from the drop-down list. For more information about exclusive Tunnel Quotas, see Purchase and use an exclusive resource group for Data Transmission Service.

      Table

      Click Generate Destination Table Schema to quickly create a MaxCompute table. Paste the following DDL statement into the Statement for Creating Table section and click Create Table.

      CREATE TABLE IF NOT EXISTS ods_raw_log_d
      (
          col STRING
      )
      PARTITIONED BY
      (
          dt STRING
      )
      LIFECYCLE 7;

      Partition information

      For this tutorial, enter ${bizdate}. This lets you assign a constant value to the bizdate parameter during the testing phase, and dynamically assign a value to the bizdate parameter during scheduled execution. For information about the variable formats and configuration methods supported by DataStudio, see Scheduling Parameters.

    • Confirm the Field Mapping and Channel Control.

      DataWorks synchronizes data from the specified source fields to the specified destination fields based on the configured field mapping. You can also set the concurrency and configure policies for dirty data. For this tutorial, set Policy for Dirty Data Records to Disallow Dirty Data Records. You can keep the default values for the other settings. For more information, see Configure a sync task using the codeless UI.

  4. Configure debug parameters.

    In the right-side pane of the offline sync task configuration page, click Debugging Configurations. Configure the following parameters to test the run in Step 4: Sync data.

    Configuration Item

    Description

    Resource Group

    Select the Serverless resource group that you purchased in the Prepare the environment step.

    Script Parameters

    You do not need to configure this parameter. In this tutorial, the sample code uses ${bizdate} to represent the data timestamp. When you debug the workflow in Step 4, set the Value Used in This Run to a specific constant, such as 20250223. The job run will then use this constant to replace the variable defined in the job.

  5. (Optional) Configure scheduling properties.

    For this tutorial, you can keep the default values for the scheduling parameters. In the right-side pane of the offline sync task configuration page, click Scheduling. For more information about the parameters, see Node scheduling.

    • Scheduling Parameters: These parameters are already configured for the workflow. Inner nodes do not require separate configuration and can use the parameters directly in tasks or code.

    • Scheduling Policies: You can use the Time for Delayed Execution parameter to specify the duration that a child node waits to run after the workflow is executed. This parameter is not set in this tutorial.

  6. In the node toolbar, click Save.

Step 4: Sync data

  1. Sync the data.

    In the workflow toolbar, click Run. Set the values for the parameter variables that are defined in each node for this run. This tutorial uses 20250223, but you can modify the value as needed. Then, click OK and wait for the run to complete.

  2. Query the data sync results.

    1. In the navigation pane on the left of DataStudio, click image to go to the Data Development page. Then, in the personal folder area, click image to create a file with the .sql extension. You can specify a custom file name.

    2. At the bottom of the page, confirm that the language mode is MaxCompute SQL.image

    3. In the node editing window, enter the following SQL statement to view the number of records that are imported into ods_raw_log_d and ods_user_info_d. This confirms whether the synchronized data is written to the destination tables.

      -- You need to change the partition filter condition to the actual data timestamp for your current operation. In this tutorial, the debug parameter bizdate (data timestamp) was set to 20250223.
      SELECT count(*) FROM ods_user_info_d WHERE dt='your_data_timestamp'; 
      SELECT count(*) FROM ods_raw_log_d WHERE dt='your_data_timestamp';
      • If the queries return data, the data synchronization is complete.

      • If no data is returned, ensure that the Value Used in This Run that is configured for the workflow run matches the data timestamp specified by dt in the query. You can click the workflow, click Running History in the right-side pane, and then click View in the Actions column for a run record to confirm the data timestamp value (partition=[pt=xxx]) in the workflow run log.

Next steps

After you configure data synchronization, you can proceed to the next tutorial to learn how to process and analyze the synchronized data. For more information, see Process data.