This topic describes how to create a synchronization node to export data from MaxCompute to a MySQL data source.
- An ApsaraDB RDS for MySQL instance is created. The ID of the ApsaraDB RDS for MySQL instance is obtained. A whitelist is configured for the instance in the ApsaraDB for RDS console. The address information about the resource group on which the synchronization node to create will be run is added to the whitelist. For more information, see Create an ApsaraDB RDS for MySQL instance. Note If you use a custom resource group to run the synchronization node, you must add the IP addresses of the servers in the custom resource group to the whitelist of the ApsaraDB RDS for MySQL instance.
- The odps_result table is created in the ApsaraDB RDS for MySQL database to which the data is synchronized. You can create the table by executing the following statement:
After the table is created, execute the
CREATE TABLE `ODPS_RESULT` ( `education` varchar(255) NULL , `num` int(10) NULL );
desc odps_result;statement to view the table details.
Background informationYou can use Data Integration to periodically synchronize the business data generated in a business system to a DataWorks workspace. You can create SQL nodes to compute the data and use Data Integration to periodically synchronize the computing results to your specified data source for further display or use.
Data Integration can import data from and export data to various data sources, such as ApsaraDB RDS, MySQL, SQL Server, PostgreSQL, MaxCompute, ApsaraDB for Memcache, Distribute Relational Database Service (DRDS), Object Storage Service (OSS), Oracle, FTP, Dameng, Hadoop Distributed File System (HDFS), and MongoDB. For more information about the data sources, see Supported data source types, Reader plug-ins, and Writer plug-ins.
Add a data source
- Go to the Workspace Management page of a workspace.
- In the left-side navigation pane, click Data Source.
- On the Data Source page, click Add data source in the upper-right corner.
- In the Add data source dialog box, click MySQL.
- In the Add MySQL data source dialog box, configure the parameters. In this example, set the Data Source Type parameter to Alibaba Cloud Instance Mode.
Parameter Description Data Source Type The mode in which the data source is added. Set this parameter to Alibaba Cloud Instance Mode. Data Source Name The name of the data source. The name can contain only letters, digits, and underscores (_), and must start with a letter. Data Source Description The description of the data source. The description cannot exceed 80 characters in length. Environment The environment in which the data source is used. Valid values: Development and Production.Note This parameter is displayed only if the workspace is in standard mode. Region The region where the data source resides. RDS Instance ID The ID of the ApsaraDB RDS for MySQL instance. You can log on to the ApsaraDB RDS console to obtain the ID. RDS Instance Account ID The ID of the Alibaba Cloud account that is used to purchase the ApsaraDB RDS for MySQL instance. You can log on to the DataWorks console with the Alibaba Cloud account, move the pointer over the profile picture in the upper-right corner, and then view the ID of the account. Default Database NameThe name of the default ApsaraDB RDS for MySQL database. The following descriptions provide instructions for you to configure a data synchronization node or solution that uses a MySQL data source:
- When you configure a database-level real-time or batch synchronization node or solution that uses a MySQL data source, you can select one or more databases on which you have access permissions in the ApsaraDB for MySQL instance.
- If you select multiple databases when you configure a batch synchronization node, you must add a data source for each database.
Username The username that is used to log on to the database. Password The password that is used to log on to the database. Do not use at signs (@) in the password. Configure Secondary Database If the ApsaraDB RDS for MySQL instance is a primary instance and has a secondary ApsaraDB RDS for MySQL instance, you can turn on the Configure Secondary Database switch when you add the primary ApsaraDB RDS for MySQL instance to DataWorks as a data source, and select the ID of the secondary ApsaraDB RDS for MySQL instance. This prevents the primary ApsaraDB RDS for MySQL instance from being affected and ensures the performance of the primary ApsaraDB RDS for MySQL instance. If the secondary ApsaraDB RDS for MySQL instance has multiple read-only ApsaraDB RDS for MySQL instances, only data of one of the read-only instances is read.Note If you turn on the Configure Secondary Database switch when you add a MySQL data source, you can use only an exclusive resource group to run the node that uses the data source.
- Test the connectivity between the data source and resource groups. Click the Data Integration tab next to the Resource Group connectivity parameter, find the resource group for Data Integration that you want to use, and then click Test connectivity in the Actions column. Click the Schedule tab and perform the same operations to test the connectivity between the data source and the resource group for scheduling that you want to use. If the connectivity status is Connected, the resource groups are connected to the data source.Note
- A synchronization node can use only one resource group of a specific type.
- To ensure that your synchronization node can be run as expected, you must test the connectivity between the data source and all types of resource groups on which your synchronization node will be run.
- If the data source passes the network connectivity test, click Complete.
Create and configure a synchronization node
This section describes how to create and configure a synchronization node named write_result and use the node to synchronize data in the result_table table to your MySQL data source. Perform the following operations:
- Go to the DataStudio page and create a batch synchronization node named write_result.
- Configure the batch synchronization node.
- On the DataStudio page, configure the insert_data node as an ancestor node of the write_result node.
- On the configuration tab of the batch synchronization node, select ODPS from the Connection drop-down list, select the odps_first database from the next drop-down list, and then select result_table from the Table drop-down list. result_table is the source table from which the data is synchronized.
- Select the odps_result table that you create in the ApsaraDB RDS for MySQL database as the destination table.
- In the Mappings section, configure field mappings. Make sure that fields in the source table on the left have a one-to-one mapping with fields in the destination table on the right.
- Configure channel control policies. You can configure channel control policies in the setting field, such as the maximum number of parallel threads that can be used for data synchronization, the maximum transmission rate, and settings for dirty data records.
Parameter Description executeMode Specifies whether to enable the distributed execution mode for the batch synchronization node. Valid values:
- distribute: Enables the distributed execution mode. If you enable the distributed execution mode for the batch synchronization node, the system splits the node into slices and distributes them to multiple Elastic Compute Service (ECS) instances for parallel running. In this case, the more ECS instances, the higher the data synchronization speed.
- null: Disables the distributed execution mode. If you do not enable the distributed execution mode for the batch synchronization node, the specified maximum number of parallel threads is used only for a single ECS instance to run the node.
- If your exclusive resource group contains only one ECS instance, we recommend that you do not run your batch synchronization node in distributed execution mode.
- If one ECS instance can meet your business requirements for data transmission speed, you do not need to enable the distributed execution mode. This can simplify the execution mode of your node.
- The distributed execution mode can be enabled only if the maximum number of parallel threads that you specified is greater than or equal to 8.
- Whether a batch synchronization node supports the distributed execution mode varies based on the data source type. For more information, see the topics for readers and writers.
concurrent The maximum number of parallel threads that the batch synchronization node uses to read data from the source or write data to the destination.Note The actual number of parallel threads that are used during data synchronization may be less than or equal to the specified threshold due to the specifications of the exclusive resource group for Data Integration. You are charged for the exclusive resource group for Data Integration based on the number of parallel threads that are used. For more information, see Performance metrics. throttle Specifies whether to enable throttling.
Note The bandwidth value is a metric provided by Data Integration and does not represent the actual traffic of an elastic network interface (ENI). In most cases, the ENI traffic is one to two times the channel traffic. The actual ENI traffic depends on the serialization of the data storage system.
- true: Enables throttling. If you enable throttling, you can specify a maximum transmission rate to prevent heavy read workloads on the source. The minimum value of this parameter is 1 MB/s. Note If you set the throttle parameter to true, you must also configure the mbps parameter. The mbps parameter specifies the maximum transmission rate that is allowed.
- false: Disables throttling. If you do not enable throttling, data is transmitted at the maximum transmission rate allowed by the hardware based on the specified maximum parallel threads.
errorLimit Specifies whether to allow the generation of dirty data during data synchronization.Important If a large amount of dirty data is generated during data synchronization, the overall data synchronization speed is affected.
Note Dirty data indicates data that is meaningless to business, does not match the specified data type, or leads to an exception during data synchronization. If an exception occurs when a single data record is written to the destination, the data record is considered as dirty data. Data records that fail to be written to a destination are considered as dirty data.
- If this parameter is not configured, dirty data records are allowed during data synchronization, and the batch synchronization node can continue to run if dirty data records are generated.
- If you set this parameter to 0, no dirty data records are allowed. If dirty data records are generated during data synchronization, the batch synchronization node fails.
- If you specify a value that is greater than 0 for this parameter, the following situations occur:
- If the number of dirty data records that are generated during data synchronization is less than or equal to the value that you specified, the dirty data records are ignored and are not written to the destination, and the batch synchronization node continues to run.
- If the number of dirty data records that are generated during data synchronization is greater than the value that you specified, the batch synchronization node fails.
For example, when a batch synchronization node attempts to write VARCHAR-type data in a source to an INT-type field in a destination, a data conversion error occurs, and the data fails to be written to the destination. In this case, the data is dirty data. When you configure a batch synchronization node, you can control whether dirty data is allowed. You can also specify the maximum number of dirty data records that are allowed during data synchronization. If the number of generated dirty data records exceeds the upper limit that you specified, the synchronization node fails.Note In addition to the preceding configurations, the overall data synchronization speed of a batch synchronization node is also affected by factors such as the performance of the source and the network environment for data synchronization. For information about the data synchronization speed and performance tuning of a batch synchronization node, see Optimize the performance of batch synchronization nodes.
- Preview and save the configuration. After the node is configured, you can scroll up and down to view the node configuration. After you confirm that the configuration is correct, click the icon in the top toolbar.
Commit the synchronization node
Return to the workflow after you save the synchronization node. Click the icon in the top toolbar to commit the synchronization node to the scheduling system. The scheduling system automatically runs the node at the scheduled time from the next day based on your settings.
What to do next
Now you have learned how to create a synchronization node to export data to a specific data store. You can proceed with the next tutorial. In the next tutorial, you will learn how to configure properties and dependencies for a synchronization node. For more information, see Configure recurrence and dependencies for a node.