Create a real-time ETL synchronization task to synchronize data from Simple Log Service to Hologres - DataWorks

This topic describes how to create a real-time extract, transform, and load (ETL) synchronization task to synchronize data from Simple Log Service to Hologres.

Prepare data sources

Prepare a LogHub data source

Add a LogHub data source in Simple Log Service to DataWorks. For more information, see Add a LogHub data source.

Prepare a Hologres data source

Obtain the information about the Hologres instance that you want to add to DataWorks as a data source
Log on to the Hologres console. In the left-side navigation pane, click Instances. On the Instances page, find the Hologres instance that you want to add to DataWorks as a data source and click the instance name. On the Instance Details page, obtain the following information about the Hologres instance: instance ID, region, and endpoint. If the virtual private cloud (VPC) network type is enabled for the Hologres instance, you can also obtain the VPC ID and vSwitch ID.
Add a Hologres data source
Associate a Hologres compute engine with the workspace that you want to use to enable the system to generate a Hologres data source. Alternatively, directly add a Hologres data source to the workspace that you want to use. For more information, see Associate a Hologres compute engine with a workspace or Add a Hologres data source.
Configure an IP address whitelist for the Hologres data source
Log on to the HoloWeb console. In the top navigation bar, click the Security Center tab. On the Security Center tab, select the name of the Hologres data source from the Instance Name drop-down list and click Add IP Address to Whitelist to configure an IP address whitelist.

Create and configure a synchronization task

Configure a synchronization task to synchronize data from a single Logstore to a database of the Hologres data source.

Create and configure a synchronization task

Go to the Data Integration page in the DataWorks console. In the left-side navigation pane, click Data Synchronization Node. On the Nodes page, click Create Node.
Configure basic information for the synchronization task in the following sections:
- Task name: Configure parameters in this section based on your business requirements.
- Synchronization Type: Select LogHub as the source type and Hologres as the destination type.
- Network and Resource Configuration: Select the LogHub data source, Hologres data source, and exclusive resource group for Data Integration that you prepared, and click Test Connectivity for All Resource Groups and Data Sources to test network connectivity between the resource group and data sources.
Configure the LogHub data source.
1. Click SLSSource in the wizard of the upper part of the configuration page and configure the LogHub data source.
2. Select the Logstore from which you want to synchronize data.
3. Click Data Sampling.
  In the Preview Data Output dialog box, configure the Start At and Sampled Data Records parameters and click Start Collection. The system samples data from the Logstore that you specified. You can preview the data in the Logstore. The data in the Logstore is used as input data for data preview and visualization configurations of a data processing node.
Configure output fields.
The system extracts fields based on the sampled data, such as system fields, tag-related fields, and common fields of the LogHub data source. During data sampling, only specific fields are extracted. You can click Add Output Field to add the fields that are not sampled.
Note
If a field that is extracted does not need to be synchronized, you can click Delete in the Operation column of the field to delete the field. If a field that is extracted needs to store binary data, you can select BINARY from the Data Type column of the field.
Configure a data processing node.
You can click the icon in the wizard of the upper part of the configuration page to add a data processing method. The following data processing methods are supported: Data Masking, Replace String, Data Filtering, JSON Parsing, and Edit Field and Assign Value. You can arrange the data processing methods based on your business requirements. When the synchronization task is run, data is processed based on the processing order that you specify.
Note
A data processing node can obtain data from only one source and can transfer data to only one destination.
After you configure a data processing node, you can click Preview Data Output in the upper-right corner of the configuration page. In the dialog box that appears, you can click Re-obtain Output of Ancestor Node to enable the data processing node to process the data that is sampled from the specified Logstore and preview the processing result.
In the Preview Data Output dialog box, you can change the input data or click Manually Construct Data to customize the input data. Then, you can click Preview to preview the result generated after the input data is processed by the data processing node. If an exception occurs on the data processing node or dirty data is generated, the system reports an error in real time. This can help you check the configurations of the data processing node and determine whether expected results can be obtained at the earliest opportunity.
Configure the Hologres data source.
Click Hologres in the wizard of the upper part of the configuration page and configure the Hologres data source.
1. Configure basic information for the Hologres data source.
  - Select the Hologres schema to which you want to write data.
  - Configure the Destination Table parameter. You can select Create Table or Use Existing Table.
  - Enter a table name or select a table name from the Table Name drop-down list.
2. Edit the schema for the destination Hologres table that is automatically created.
  If you select Create Table for the Destination Table parameter, click Edit Table Schema. In the dialog box that appears, edit the schema for the destination Hologres table that is automatically created. You can also click Re-generate Table Schema Based on Output Column of Ancestor Node to re-generate a table schema based on the output columns of an ancestor node. You can select a column from the generated table schema and configure the column as the primary key. You can select a partition field based on your business requirements. The system creates a child partitioned table based on each value of the partition field. In most cases, you do not need to select a partition field. You can also adjust the properties of the destination Hologres table that is automatically created. Then, you can click Save to save the configurations.
  Note
  The destination Hologres table must have a primary key. Otherwise, the configurations cannot be saved.
3. Configure mappings between fields in the source and fields in the destination.
  After you configure basic information for the Hologres data source with the Destination Table parameter set to Use Existing Table or save the table schema settings, the system automatically establishes mappings between columns in the source and columns in the destination based on the same-name mapping principle. You can modify the mappings based on your business requirements. One column in the source can map to multiple columns in the destination. Multiple columns in the source cannot map to the same column in the destination. If a column in the source has no mapped column in the destination, data in the column in the source is not synchronized to the destination.
4. Configure processing policies for dynamic columns generated by an ancestor data processing node.
  Processing policies for dynamic columns generated by a data processing node are used to control the processing methods of dynamic columns that are generated by an ancestor data processing node. Only JSON parsing nodes can generate dynamic columns. If you make configurations in the Dynamic Output Field section when you configure a JSON parsing node, you must configure processing policies for dynamic columns that are generated by the JSON parsing node. Dynamic columns refer to columns that do not have fixed column names. The synchronization task automatically parses names and values for dynamic columns based on the input data of the source and synchronizes data from the columns to the destination.
Configure advanced parameters.
Click Configure Advanced Parameters in the upper-right corner of the configuration page. In the Configure Advanced Parameters panel, configure items such as parallelism and memory resources. You can configure each item based on the data amount of the specified Logstore and the number of partitions in the Logstore. We recommend that you configure the items based on the following instructions:
- Number of parallel threads used to read data from the Logstore = Number of partitions in the Logstore
- Number of parallel threads used to write data to Hologres = Number of partitions in the Logstore
- Memory size (GB) = 1.5 GB + (256 MB × Number of partitions in the Logstore)
  Note
  The performance and resource consumption of a synchronization task are affected by factors such as the data amount of the source and destination, the network environment, and the loads of DataWorks. You can refer to the preceding instructions to change the settings of the items based on your business requirements.
Perform a test on the synchronization task.
After the preceding configuration is complete, you can click Perform Simulated Running in the upper-right corner of the configuration page to enable the synchronization task to synchronize the sampled data to the destination Hologres table. You can view the synchronization result in the destination Hologres table. If specific configurations of the synchronization task are invalid, an exception occurs during the test run, or dirty data is generated, the system reports an error in real time. This can help you check the configurations of the synchronization task and determine whether expected results can be obtained at the earliest opportunity.
1. Click Perform Simulated Running in the upper-right corner of the configuration page. In the dialog box that appears, configure the parameters for data sampling from the specified Logstore, including the Start At and Sampled Data Records parameters.
2. Click Start Collection to enable the synchronization task to sample data from the specified Logstore.
3. Click Preview to enable the synchronization task to synchronize the sampled data to the destination Hologres table.
After the preceding configuration is complete, click Complete.

Perform O&M on the synchronization task

Start the synchronization task

After you complete the configuration of the synchronization task, you are navigated to the Nodes page. You can find the created synchronization task and click Start in the Actions column to start the synchronization task.

View the running status of the synchronization task

After you complete the configuration of the synchronization task, you can find the task on the Nodes page, and click the task name or click Running Details in the Actions column to view the running details of the task. The running details page displays the following information about the synchronization task:

Basic information: You can view the basic information about the synchronization task, such as the data sources, resource group, and synchronization type.
Running status: The synchronization task has the following stages: schema migration and real-time synchronization. You can view the running status of the synchronization task in each stage.
Details: You can view the details of the synchronization task in the schema migration stage and real-time synchronization stage on the Schema Migration tab and the Real-time Synchronization tab.
- Schema Migration: This tab displays information such as the generation methods of destination tables. The generation methods of destination tables include Use Existing Table and Create Table. If the generation method of a destination table is Create Table, the DDL statement that is used to create the table is displayed.
- Real-time Synchronization: This tab displays statistics about real-time synchronization, including real-time read and write traffic, dirty data information, failovers, and run logs.

Rerun the synchronization task

Directly rerun the synchronization task
Find the synchronization task on the Nodes page and choose More > Rerun in the Actions column to rerun the synchronization task without modifying the configurations of the synchronization task.
Modify the configurations of the synchronization task and then rerun the synchronization task
Find the synchronization task on the Nodes page, modify the configurations of the synchronization task, and then click Complete. Click Apply Updates that is displayed in the Actions column of the synchronization task to rerun the synchronization task for the latest configurations to take effect.