Data Integration is a stable, efficient, and scalable data synchronization service. It is designed to migrate and synchronize data between a wide range of heterogeneous data stores fast and stably in complex network environments.

Limits

  • Data Integration can synchronize structured, semi-structured, and unstructured data. Structured data stores include Relational Database Service (RDS) and Distributed Relational Database Service (DRDS). Unstructured data, such as Object Storage Service (OSS) objects and text files, must be capable of being converted to structured data. Data Integration can only synchronize data that can be abstracted to two-dimensional logical tables to MaxCompute. It cannot synchronize unstructured data that cannot be converted to structured data, such as MP3 files that are stored in OSS, to MaxCompute.
  • Data Integration supports data synchronization and exchange in the same region or across regions.

    Data can be transmitted between some regions by using the classic network, but the network connectivity is not ensured. If the transmission fails on the classic network, we recommend that you transmit data by using the Internet.

  • Data Integration supports only data synchronization but not data consumption.

Batch data synchronization

Data Integration can be used to synchronize large amounts of data. Data Integration facilitates data transmission between diverse structured and semi-structured data stores. It provides readers and writers for the supported data stores and defines a transmission channel between the source and destination data stores and datasets, based on simplified data types.Batch synchronization

Development modes of sync nodes

You can develop sync nodes in one of the following modes:
  • Codeless user interface (UI): Data Integration provides step-by-step instructions to help you configure a sync node. This mode is easy to use but provides only limited features. For more information, see Create a sync node by using the codeless UI.
  • Code editor: You can write a JSON script to create a sync node. This mode supports advanced features to facilitate flexible configuration. It is suitable for experienced users and increases the cost of learning. For more information, see Create a sync node by using the code editor.
Note
  • The code that is generated for a sync node on the codeless UI can be converted to a script. This conversion is irreversible. After the conversion is completed, you cannot switch back to the codeless UI mode.
  • Before you write code, you must configure a connection and create the destination table.

Network types

A data store can reside on the classic network or in a virtual private cloud (VPC). The user-created Internet data center (IDC) network type has been planned and will be supported soon.
  • Classic network: a network that is deployed by Alibaba Cloud, which is shared with other tenants. This network is easy to use.
  • VPC: a network that is created on Alibaba Cloud, which can be used by only one Alibaba Cloud account. You have full control over your VPC, including customizing the IP address range, dividing the VPC to multiple subnets, and configuring routing tables and gateways.

    A VPC is an isolated network for which you can customize a wide range of parameters, such as the IP address range, subnets, and gateways. Based on wide deployment of VPCs, Data Integration provides the feature to automatically detect the reverse proxy for some data stores, including ApsaraDB RDS for MySQL, ApsaraDB RDS for PostgreSQL, ApsaraDB RDS for SQL Server, PolarDB, DRDS, HybridDB for MySQL, AnalyticDB for PostgreSQL, and AnalyticDB for MySQL 3.0. By using this feature, you do not need to purchase an extra Elastic Compute Service (ECS) instance in your VPC to configure sync nodes for these data stores. Instead, Data Integration automatically uses this feature to provide network connectivity to these data stores.

    When you configure sync nodes for other Alibaba Cloud data stores in a VPC, such as ApsaraDB RDS for PPAS, ApsaraDB for OceanBase, ApsaraDB for Redis, ApsaraDB for MongoDB, ApsaraDB for Memcache, Tablestore, and ApsaraDB for HBase, you must purchase an ECS instance in the same VPC. This ECS instance is used to connect to the data stores.

  • User-created IDC network: an IDC network that is deployed by yourself, which is isolated from the Alibaba Cloud network.
For more information about classic networks and VPCs, see VPC FAQ.
Note You can connect to data stores by using the Internet. However, the connection speed depends on the Internet bandwidth, and additional network connection expenses are required. We recommend that you do not use the Internet.

Basic concepts

  • Concurrency

    Concurrency indicates the maximum number of concurrent threads that the sync node uses to read data from or write data to data stores.

  • Bandwidth throttling

    Bandwidth throttling indicates that a maximum transmission rate is specified for a sync node of Data Integration.

  • Dirty data

    Dirty data indicates meaningless data and data that does not match the specified data type. For example, you want to write data of the VARCHAR type in the source table to an INT-type field in the destination table. A data conversion error occurs and the data cannot be written to the destination table. In this case, the data is dirty.

  • Connection

    A connection in DataWorks is used to connect to a data store, which can be a database or a data warehouse. DataWorks supports various types of data stores, and supports data synchronization between data stores of different types.

Reference

  • For more information about how to configure a sync node, see Node configuration.
  • For more information about how to process unstructured data, such as objects that are stored in OSS, see Access unstructured data.
  • DataWorks provides the default resource group for you to migrate large amounts of data to the cloud for free. However, the default resource group does not work if a high transmission speed is required or your data stores are deployed in complex environments. You can use exclusive or custom resource groups for Data Integration to run your sync nodes. This ensures connections to your data stores and enables a higher transmission speed. For more information, see Create and use an exclusive resource group for Data Integration and Create a custom resource group for Data Integration.