Data Integration is a stable, efficient, and scalable data synchronization service. It is designed to migrate and synchronize data between a wide range of heterogeneous data sources fast and stably in complex network environments.

Limits

  • Data Integration can synchronize structured, semi-structured, and unstructured data. Structured data sources include ApsaraDB RDS and PolarDB-X. Unstructured data, such as Object Storage Service (OSS) objects and text files, must be converted to structured data. Data Integration can synchronize only the data that can be abstracted to two-dimensional logical tables to MaxCompute. It cannot synchronize the unstructured data that cannot be converted to structured data, such as MP3 files that are stored in OSS, to MaxCompute.
  • Data Integration supports data synchronization and exchange in the same region or across specific regions.

    Data can be transmitted between some regions over the classic network, but network connectivity cannot be ensured. If the transmission fails over the classic network, we recommend that you transmit data over the Internet.

  • Data Integration supports only data synchronization but not data consumption.

Batch synchronization

Data Integration can be used to synchronize large amounts of offline data. Data Integration facilitates data transmission between diverse structured and semi-structured data sources. It provides readers and writers for the supported data sources and defines a transmission channel between the source and destination data sources and datasets, based on simplified data types. Batch synchronization

Development modes of sync nodes

You can develop sync nodes in one of the following modes:
  • Codeless user interface (UI): Data Integration provides step-by-step instructions to help you configure a sync node. This mode is easy to use but provides only limited features. For more information, see Configure a sync node by using the codeless UI.
  • Code editor: You can write a JSON script to create a sync node. This mode supports advanced features to facilitate flexible configuration. It is suitable for experienced users and increases the cost of learning. For more information, see Create a sync node by using the code editor.
Note
  • The code that is generated for a sync node on the codeless UI can be converted to a script. This conversion is irreversible. After the conversion is complete, you cannot switch back to the codeless UI mode.
  • Before you write code, you must configure the data sources and create the destination table.

Network connectivity

You can run a sync node on a resource group for Data Integration to synchronize data from the source to the destination. Before you run the sync node, make sure the resource group for Data Integration is connected to the data sources.

Data Integration allows you to synchronize data between heterogeneous data sources in various network environments. You can select the network solution based on the network environment in which the data sources reside to ensure the network connectivity between the resource group for Data Integration and the data sources. For more information, see Select a network connectivity solution.

Data synchronization supports data sources that reside on the classic network, in a virtual private network (VPC), or in data centers.
  • Classic network: a network that is deployed and managed by Alibaba Cloud. The classic network is shared by Alibaba Cloud accounts.
  • VPC: a network that is created on Alibaba Cloud, which can be used by only one Alibaba Cloud account. You have full control over your VPC. For example, you can customize the IP address range, divide the VPC into multiple subnets, and configure route tables and gateways.

    A VPC is an isolated network for which you can customize a wide range of parameters, such as the IP address range, subnets, and gateways. Data Integration provides the feature to automatically detect the reverse proxy for the following data sources based on the wide deployment of VPCs: ApsaraDB RDS for MySQL, ApsaraDB RDS for PostgreSQL, ApsaraDB RDS for SQL Server, PolarDB, PolarDB-X, HybridDB for MySQL, AnalyticDB for PostgreSQL, and AnalyticDB for MySQL V3.0. This feature frees you from purchasing an Elastic Compute Service (ECS) instance in your VPC to configure sync nodes for these data sources. Instead, Data Integration automatically uses this feature to provide network connectivity to these data sources.

    When you configure sync nodes for other Alibaba Cloud data sources in a VPC, such as ApsaraDB RDS for PPAS, ApsaraDB for OceanBase, ApsaraDB for Redis, ApsaraDB for MongoDB, ApsaraDB for Memcache, Tablestore, and ApsaraDB for HBase, you must purchase an ECS instance in the same VPC. This ECS instance is used to connect to the data sources.

  • Data center: a network that is deployed by yourself, which is isolated from the Alibaba Cloud network.
For more information about classic networks and VPCs, see VPC FAQ.
Note You can connect to data sources over the Internet. However, the connection speed depends on the Internet bandwidth, and additional network connection expenses are required. We recommend that you do not connect to data sources over the Internet. For more information about the billing rules of Internet traffic generated by Data Integration, see Internet traffic generated by Data Integration.

Terms

  • parallelism

    Parallelism indicates the maximum number of parallel threads that the sync node uses to read data from or write data to data sources.

  • bandwidth throttling

    Bandwidth throttling indicates that a maximum transmission rate is specified for a sync node of Data Integration.

  • dirty data

    Dirty data indicates meaningless data and data that does not match the specified data type or leads to an exception during data synchronization. If an exception occurs when a single data record is written to the destination data source, the data record is considered dirty data. Therefore, data records that fail to be written to the destination data source are considered dirty data. In most cases, dirty data is the data that does not match the specified data type. For example, you want to write VARCHAR-type data in the source table to an INT-type field in the destination table. A data conversion error occurs, and the data cannot be written to the destination table. In this case, the data is dirty data.

    Dirty data cannot be written to the destination table. When you configure a sync node, you can specify whether dirty data can be generated. You can also set the maximum of dirty records that are generated during data synchronization. If the upper limit of the dirty data records is exceeded, the sync node fails.

  • data source

    A data source is a source from which data is processed by DataWorks. A data source can be a database or a data warehouse. DataWorks supports various types of data sources and data type conversion during synchronization.

    Before you create a sync node, you can add the source and the destination data sources on the Data Source page of DataWorks. When you create a sync node, select the source and the destination data sources from the drop-down lists as needed.

References