All Products
Search
Document Center

DataWorks:Configure and manage a real-time synchronization task

Last Updated:Apr 03, 2024

After you prepare data sources, network environments, and resources, you can create a real-time synchronization task to synchronize data to MaxCompute. This topic describes how to create a real-time synchronization task and view the status of the task.

Prerequisites

  1. The data sources that you want to use are prepared. Before you configure a data synchronization node, you must prepare the data sources from which you want to read data and to which you want to write data. This way, when you configure a data synchronization node, you can select the data sources. For information about the data source types, readers, and writers that are supported by real-time synchronization, see Data source types that support real-time synchronization.

    Note

    For information about the items that you need to understand before you prepare a data source, see Overview.

  2. An exclusive resource group for Data Integration that meets your business requirements is purchased. For more information, see Create and use an exclusive resource group for Data Integration.

  3. Network connections are established between the exclusive resource group for Data Integration and the data sources. For more information, see Establish a network connection between a resource group and a data source.

  4. The data source environments are prepared. You must create an account that can be used to access a database in the source and an account that can be used to access a database in the destination. You must also grant the accounts the permissions required to perform specific operations on the databases based on your configurations for data synchronization. For more information, see Overview.

Limits

  • You can use only exclusive resource groups for Data Integration to run real-time synchronization tasks.

  • You can run a real-time synchronization task to synchronize data only from PolarDB, Oracle, or MySQL to MaxCompute.

  • A real-time synchronization task cannot be used to synchronize data from a table that has no primary key.

Precautions

  • If you run a real-time synchronization task to synchronize data to MaxCompute and a temporary AccessKey pair is used for data synchronization, the temporary AccessKey pair is valid for only seven days. After the period elapses, the temporary AccessKey pair automatically expires, and the real-time synchronization task fails. If DataWorks detects that the failure of the real-time synchronization task is caused by the expiration of the temporary AccessKey pair, the system restarts the task. If a related alert rule is configured for the real-time synchronization task, an alert is reported.

  • Data Integration uses the channels that are provided by MaxCompute to upload and download data. You can select a channel based on your business requirements. For more information about the types of channels that are provided by MaxCompute, see Data upload scenarios and tools.

Create a real-time synchronization task

  1. Create a real-time synchronization task to synchronize all data in a database.

  2. Configure a resource group.

  3. Configure the source and synchronization rules.

    1. In the Data Source section of the Configure Source and Synchronization Rules step, configure the Type and Data source parameters.

    2. Select the tables from which you want to read data.

      In the Source Table section, all tables in the selected data source are displayed in the Source Table list. You can select all or some tables from the Source Table list and click the Icon icon to move the tables to the Selected Source Table list.

      Important

      If a selected table does not have a primary key, the table cannot be synchronized in real time.

    3. In the Set Mapping Rules for Table/Database Names section, click Add rule, select a rule type, and then add a rule of the selected type.

      By default, data in the source tables is written to the destination tables that are named the same as the source tables. You can also configure mapping rules to specify the names of the destination tables to which you want to write data. Data Integration allows you to use a regular expression to configure a mapping rule to specify the names of destination tables. You can also concatenate built-in variables to specify the names of the destination tables. You can configure a mapping rule to synchronize data from multiple tables in the source to the same table in the destination. You can also configure a mapping rule to synchronize data from source tables whose names start with a specified prefix to the destination tables whose names start with another specified prefix. For more information about the configuration logic, see Select the tables from which you want to read data and configure mapping rules.

  4. Configure the destination tables.

    1. In the Set Destination Table step, configure the Destination and Write Mode parameters.

    2. Configure the Automatic Partitioning by Time parameter.

      You can click the 编辑 icon to specify the names of partition key columns based on your business requirements.

    3. Refresh mappings between source tables and destination MaxCompute tables.

      Click Refresh Source table and MaxCompute table mapping to map the source tables to destination MaxCompute tables based on the mapping rules that you configured in the Set Mapping Rules for Table/Database Names section. If no mapping rule is configured in the Set Mapping Rules for Table/Database Names section, data in the source tables is written to the MaxCompute tables that are named the same as the source tables. If no such MaxCompute tables exist in the destination, the system automatically creates the tables in the destination. You can also modify the table generation method of a destination MaxCompute table and add additional fields to a destination MaxCompute table.

      Note

      The names of destination tables are automatically generated based on the mapping rules that you configured in the Set Mapping Rules for Table/Database Names section.

      Operation

      Description

      Select a primary key for a source table

      The current synchronization task cannot be used to synchronize data from a source table that does not have a primary key. If you want to synchronize data from a source table that does not have a primary key, you must click the 编辑 icon in the Synchronized Primary Key column of the source table to specify a primary key for the source table. You can use a field or a combination of multiple fields in the source table as the primary key. The system removes duplicate data based on the primary key during data synchronization.

      Select a table generation method

      You can select Create Table or Use Existing Table from the drop-down list in the Table creation method column.

      • If you select Use Existing Table from the drop-down list in the Table creation method column, all existing MaxCompute tables are displayed in the drop-down list in the MaxComputeTable Name column. You must select the name of the table that you want to use from the drop-down list.

      • If you select Create Table from the drop-down list in the Table creation method column, the name of the table that is automatically created appears in the MaxComputeTable Name column. You can click the table name to view and modify the table creation statement.

      Add additional fields to a destination table and assign values to the fields

      You can click Edit additional fields in the Actions column of a destination table to add additional fields to the table and assign values to the fields. You can assign constants and variables to the additional fields as values.

      Note

      You can add additional fields to a destination table only if you select Create Table from the drop-down list in the Table creation method column of the table.

      Data Integration allows you to assign the following variables to additional fields as values:

      EXECUTE_TIME: the time that was consumed to execute the SQL statement
      UPDATE_TIME: the update time
      DB_NAME_SRC: the name of the original database
      DB_NAME_SRC_TRANSED: the converted name of the database
      DATASOURCE_NAME_SRC: the name of the source
      DATASOURCE_NAME_DEST: the name of the destination
      DB_NAME_DEST: the name of the destination database
      TABLE_NAME_DEST: the name of the destination table
      TABLE_NAME_SRC: the name of the source table

      Modify the schema of a destination table

      By default, the lifecycle of MaxCompute tables that are automatically created is 30 days and field type conversion may occur. For example, if the data types of the fields in a destination table are different from the data types of the fields in a source table, the synchronization task converts the fields in the source table to the data types that can be written to the destination table. You can click the name of a destination MaxCompute table in the MaxComputeTable Name column to modify the lifecycle of the MaxCompute table or the field types of the table.

      Note

      You can add additional fields to a destination table only if you select Create Table from the drop-down list in the Table creation method column of the table.

    4. Click Next.

      If you select Create Table from the drop-down list in the Table creation method column of a destination MaxCompute table, you must click Start Creating Table in the Create tables automatically dialog box to create multiple destination MaxCompute tables at a time.

  5. Configure rules for processing DDL messages.

    DDL operations are performed on a source. Data Integration provides default rules to process DDL messages. You can also configure processing rules for different DDL messages based on your business requirements. For more information, see Rules for processing DDL messages.

  6. Configure the resources required to run the real-time synchronization task.

    1. In the Configure Resource step, configure the parameters.

      Parameter

      Description

      Allow Dirty Data Records

      Specifies whether to allow dirty data records. If you turn on the switch, the synchronization task can continue to run even if dirty data records are generated during data synchronization. You can view the generated dirty data records in the logs of the synchronization task. If you turn off the switch, the synchronization task fails if a dirty data record is found.

      Maximum Number Of Parallel Threads Allowed For Destination

      The maximum number of parallel threads that the synchronization task uses to read data from the source table or write data to the destination table. Maximum value: 32. Specify a value for this parameter based on the specifications of your resource group and the data write capabilities of the destination.

    2. Click Complete.

Commit and deploy the real-time synchronization task

Commit and deploy the node.

  1. Click the Save icon in the top toolbar to save the node.

  2. Click the Submit icon in the top toolbar to commit the node.

  3. In the Commit Node dialog box, configure the Change description parameter.

  4. Click OK.

If you use a workspace in standard mode, you must deploy the node in the production environment after you commit the node. On the left side of the top navigation bar, click Deploy. For more information, see Deploy nodes.

What to do next

After the real-time synchronization node is configured, you can start and manage the node on the Real Time DI page in Operation Center. To go to the Real Time DI page, perform the following operations: Log on to the DataWorks console and go to the Operation Center page. In the left-side navigation pane of the Operation Center page, choose RealTime Task > RealTime DI. For more information, see O&M for real-time synchronization nodes.