How to use Data Integration for real-time data lake ingestion - DataWorks

Data Integration supports real-time synchronization of all data from databases such as MySQL and PolarDB to OSS. This topic describes how to use Data Integration to synchronize data from MySQL to a data lake in OSS in real time.

Prerequisites

A Serverless resource group or an exclusive resource group for Data Integration is purchased.
A MySQL data source and an ApsaraDB for OceanBase data source are created. For more information, see Create a data source for Data Integration.
Note
The binary logging feature must be enabled for the MySQL data source. For more information, see Prepare a MySQL environment.
Network connectivity between the resource group and data sources is established. For more information, see Network connectivity solutions.

Limitations

Values in the primary key columns in the source cannot be NULL or empty strings. Otherwise, an error may be reported when the related synchronization task is run.

Capabilities supported by real-time synchronization of all data from a database to a data lake in OSS

Supports the synchronization of data structures in the MySQL data source, the one-time synchronization of historical data in all or specific tables in a database in the MySQL data source, and the real-time synchronization of incremental data in the MySQL data source to a data lake in OSS.
Supports the synchronization of data changes generated by table creation and column addition DDL operations, and supports the automatic adjustment of destination tables and columns in a destination table based on the number of source tables and the number of columns in a source table during data synchronization.
Important
Data changes generated by DDL operations such as table deletion and table renaming cannot be synchronized. If these DDL operations are performed on source tables, the related synchronization task fails.
Supports the automatic creation of metadatabases and metatables in Data Lake Formation (DLF) if DLF is activated in the same region as your workspace. The system automatically creates metadatabases and metatables when data is stored to a data lake in OSS.
Note
Cross-region metadatabase creation is not supported.

Procedure

Step 1: Select a synchronization task type

Go to the Data Integration page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Integration > Data Integration. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Integration.
In the left-side navigation pane, click Sync Task, and then click Create Sync Task at the top of the page to go to the sync task creation page. Configure the following basic information:
- Source And Destination: MySQL→OSS
- Task Name: Customize a name for the synchronization task.
- Synchronization Type: Real-time Database Synchronization.
- Synchronization Steps: Select Full Synchronization and Incremental Synchronization.

Step 2: Configure network and resources

In the Network And Resource Configuration section, select the Resource Group for the synchronization task. You can allocate the number of CUs for Task Resource Usage.
For Source Data Source, select the added MySQL data source. For Destination Data Source, select the added OSS data source, and then click Test Connectivity.
After ensuring that both the source and destination data sources are successfully connected, click Next.

Step 3: Configure basic settings for the destination

Write Format: Supports three write formats: Hudi, Paimon, and lceberg.
Storage Path Selection: Select the OSS path in which you want to store the synchronized data.
Location For Creating Metadatabase: You can choose whether to enable the system to create a metadatabase in DLF.
Note
Metadatabases can be automatically created only in DLF that is activated in the same region as your workspace.

Step 4: Select the tables to synchronize

In this step, you can select the tables from which you want to synchronize data in the Source Table list and click the icon to move the selected tables to the Selected Tables list.

Select Specific Tables:
- In the Database Filter and Table Filter of the source tables, you can filter the databases and tables that need to be synchronized by entering characteristic information of the database or table names. Select all the databases and tables that need to be synchronized, and click the icon to move them to the Selected Tables section.
- In the Selected Tables section's Database Filter and Table Filter, you can filter the databases and tables that do not need to be synchronized by entering characteristic information of the database or table names. Select all the databases and tables that do not need to be synchronized, and click the icon to move them back to the Source Tables section.
Use Regular Expressions To Select Tables (Supports Adding/Removing Tables By Regular Expressions During Runtime):
Filter table information through regular expressions configured in Database Filter and Table Filter. Click Confirm Selection to select the database and table data you need to synchronize.
Note
For example, if you need to filter database and table information where the database name prefix is a and the table name prefix is order, you can enter in the Database Filtera.*Table Filtering field and order.* in the Table Filter field.

Step 5: Configure destination table mapping

After you select the tables from which you want to synchronize data, the selected tables are automatically displayed in the Mapping Rules for Destination Tables section. The properties of the destination tables are waiting to be mapped. You must manually define mappings between the source tables and destination tables to determine the data reading and writing relationships. Then, you can click Refresh in the Actions column. You can directly refresh mappings between source tables and destination tables. You can also refresh mappings between source tables and destination tables after you configure settings related to destination tables.

Note

You can select the tables to be synchronized and click Batch Refresh Mapping. When no mapping rules are configured, the default table name rule is ${tableName}. If a table with the same name does not exist in the destination, a new table will be automatically created.
You can click the Configure button in the Custom Destination Database Name Mapping column to customize the destination database name rules.
You can use built-in variables and manually entered strings to form the final destination database name. You can edit the built-in variables. For example, you can create a new database name rule to add a suffix to the source database name as the destination database name.
You can click Edit in the Customize Mapping Rules for Destination Table Names column to configure mapping rules for destination table names based on your business requirements.
You can concatenate a built-in variable and a specific string into a destination table name. You can edit built-in variables. For example, when you create a mapping rule, you can add a suffix to a variable that indicates a source table name to form a destination table name.

a. Modify data type mappings for fields

Default mappings exist between data types of source fields and data types of destination fields. You can click Edit Mapping of Field Data Types in the upper-right corner of the Mapping Rules for Destination Tables section to configure data type mappings between source fields and destination fields based on your business requirements. After the configuration is complete, click Apply and Refresh Mapping.

b. Modify the schema of a destination table to add fields to the table and assign values to the fields

If a destination table is in the To Be Created state, you can perform the following steps to add fields to the table and assign values to the fields:

Add fields to one or more destination tables.
- Add fields to a single destination table: Find the destination table to which you want to add fields and click the icon in the Destination Table Name column. In the dialog box that appears, add fields.
- Add fields to multiple destination tables at a time: Select the destination tables to which you want to add fields at a time, click Batch Modify in the lower part of the page, and then click Destination Table Schema - Batch Modify and Add Field.
Assign values to the fields. You can perform one of the following operations to assign values to the fields:
- Assign values to the fields that are added to a single destination table: Find the destination table in which you want to assign values to newly added fields and click Configure in the Value assignment column. In the Additional Field dialog box, assign values to the fields.
- Assign values to the fields that are added to multiple destination tables at a time: Select the destination tables in which you want to assign values to newly added fields, click Batch Modify in the lower part of the page, and then click Value assignment to assign values to the same fields in the selected destination tables at a time.
Note
You can click the icon to switch the value assignment method and assign constants and variables to the fields that are added to a destination table.

c. Configure DML processing rules

Data Integration provides default DML processing rules. You can also configure DML processing rules for destination tables based on your business requirements.

Configure DML processing rules for a single destination table: Find the destination table for which you want to configure DML processing rules and click Configure in the Configure DML Rule column to configure DML processing rules for the table.
Configure DML processing rules for multiple destination tables at a time: Select the destination tables for which you want to configure DML processing rules, click Batch Modify in the lower part of the page, and then click Configure DML Rule.

Step 6: Configure alerts

To prevent the failure of the synchronization task from causing latency on business data synchronization, you can configure different alert rules for the synchronization task.

In the upper-right corner of the page, click Configure Alert Rule to go to the Alert Rule Configurations for Real-time Synchronization Subnode panel.
In the Configure Alert Rule panel, click Add Alert Rule. In the Add Alert Rule dialog box, configure the parameters to configure an alert rule.
Note
The alert rules that you configure in this step take effect for the real-time synchronization subtask that will be generated by the synchronization task. After the configuration of the synchronization task is complete, you can refer to Run and manage real-time synchronization tasks to go to the Real-time Synchronization Task page and modify alert rules configured for the real-time synchronization subtask.
Manage alert rules.
You can enable or disable alert rules that are created. You can also specify different alert recipients based on the severity levels of alerts.

Step 7: Configure advanced parameters

You can change the values of specific parameters configured for the synchronization task based on your business requirements. For example, you can specify an appropriate value for the Maximum read connections parameter to prevent the current synchronization task from imposing excessive pressure on the source database and data production from being affected.

Note

To prevent unexpected errors or data quality issues, we recommend that you understand the meanings of the parameters before you change the values of the parameters.

In the upper-right corner of the configuration page, click Configure Advanced Parameters.
In the Configure Advanced Parameters panel, change the values of the desired parameters.

Step 8: Configure DDL capabilities

DDL operations may be performed on the source. You can click Configure DDL Capability in the upper-right corner of the page to configure rules to process DDL messages from the source based on your business requirements.

Note

For more information, see Configure rules to process DDL messages.

Step 9: Configure resource groups

You can click Configure Resource Group in the upper-right corner of the page to view and change the resource groups that are used to run the current synchronization task.

Step 10: Run the synchronization task

After the configuration of the synchronization task is complete, click Complete in the lower part of the page.
In the Synchronization Task section of the Data Integration page, find the created synchronization task and click Start in the Operation column.
Click the Name/ID of the synchronization task in the Tasks section and view the detailed running process of the synchronization task.

Synchronization task operations and maintenance

View task running status

After creating a synchronization task, you can view the list of currently created synchronization tasks and the basic information of each synchronization task on the synchronization task page.

You can Start or Stop a synchronization task in the Actions column, and you can Edit, View, and perform other operations on the synchronization task from the More menu.
For started tasks, you can see the basic running status in the Execution Overview, and you can also click the corresponding overview area to view execution details.
The real-time synchronization task from MySQL to OSS consists of three steps:
- Structure migration: Includes the creation method of the destination table (existing table/automatic table creation). If it is automatic table creation, DDL statements will be displayed.
- Full initialization: Includes offline synchronization table information, synchronization progress, and the number of records written.
- Real-time data synchronization: Includes real-time synchronization statistics information (real-time progress, DDL records, DML records, and alert information).

Rerun the synchronization task

In some special cases, if you add tables to or remove tables from the source, or change the schema or name of a destination table, you can click More in the Operation column of the synchronization task and then click Rerun to rerun the task after the change. During the rerun process, the synchronization task synchronizes data only from the newly added tables to the destination or only from the mapped source table to the destination table whose schema or name is changed.

If you want to rerun the synchronization task without modifying the configuration of the task, click More in the Actions column and then click Rerun to rerun the task to perform full synchronization and incremental synchronization again.
If you want to rerun the synchronization task after you add tables to or remove tables from the task, click Complete after the change. In this case, Apply Updates is displayed in the Actions column of the synchronization task. Click Apply Updates to trigger the system to rerun the synchronization task. During the rerun process, the synchronization task synchronizes data from the newly added tables to the destination. Data in the original tables is not synchronized again.