Data Integration is a stable, efficient, and scalable data synchronization service provided by Alibaba Cloud. It is designed to migrate and synchronize data between a wide range of heterogeneous data stores fast and stably in complex network environments.

Batch data synchronization

Data Integration can be used to synchronize large amounts of data. Data Integration facilitates data transmission between diverse structured and semi-structured data stores. It provides readers and writers for the supported data stores and defines a transmission channel between the source and destination data stores and datasets, based on simplified data types.

Supported data stores

Data Integration supports a wide range of data stores, including:
  • Relational databases: MySQL, SQL Server, PostgreSQL, Oracle, DM, Distributed Relational Database Service (DRDS), POLARDB, HybridDB for MySQL, AnalyticDB for PostgreSQL, AnalyticDB for MySQL 2.0, and AnalyticDB for MySQL 3.0
  • Big data storage: MaxCompute, Datahub, and Data Lake Analytics (DLA)
  • Semi-structured storage: Object Storage Service (OSS), Hadoop Distributed File System (HDFS), and File Transfer Protocol (FTP)
  • NoSQL: MongoDB, Memcache, Redis, and Table Store
  • Message queue: LogHub
  • Graph computing engine: GraphCompute
  • Real-time data: MySQL binlog and Oracle Change Data Capture (CDC)
For more information, see Supported data stores.
Note The connection configurations for data stores vary greatly. You can view the specific parameters that need to be set when you configure connections and sync nodes for data stores.

Development modes of sync nodes

You can develop sync nodes in either of the following modes:
  • Codeless UI: Data Integration provides step-by-step instructions to help you quickly configure a sync node. This mode is easy to use but provides only limited features.
  • Code editor: You can write a JSON script to create a sync node. This mode supports advanced features to facilitate flexible configuration. It is suitable for experienced users and increases the cost of learning.
Note
  • The code generated for a sync node on the codeless UI can be converted to a script. This conversion is irreversible. After the conversion is completed, you cannot switch back to the codeless UI mode.
  • Before writing code, you must configure a connection and create the destination table.

Network types

A data store can reside on a classic network or in a Virtual Private Cloud (VPC). The user-created IDC network type has been planned and will be supported soon.
  • Classic network: a network deployed by Alibaba Cloud, which is shared with other tenants. Networks of this type are easy to use.
  • VPC: a network created on Alibaba Cloud, which is available to only one Alibaba Cloud account. You have full control over your VPC, including customizing the IP address range, dividing the VPC to multiple subnets, and configuring routing tables and gateways.

    A VPC is an isolated network for which you can customize a wide range of parameters, such as the IP address range, subnets, and gateways. With wide deployment of VPCs, Data Integration provides the feature to automatically detect the reverse proxy for some data stores, including ApsaraDB RDS for MySQL, ApsaraDB RDS for PostgreSQL, ApsaraDB RDS for SQL Server, ApsaraDB for POLARDB, DRDS, HybridDB for MySQL, AnalyticDB for PostgreSQL, and AnalyticDB for MySQL 3.0. With this feature, you do not need to purchase an extra Elastic Compute Service (ECS) instance in your VPC to configure sync nodes for these data stores. Instead, Data Integration automatically uses this feature to provide network connectivity to these data stores.

    When you configure sync nodes for other Alibaba Cloud data stores in a VPC, such as PPAS, ApsaraDB for OceanBase, ApsaraDB for Redis, ApsaraDB for MongoDB, ApsaraDB for Memcache, Table Store, and ApsaraDB for HBase, you must purchase an ECS instance in the same VPC. This ECS instance is used to access the data stores.

  • User-created IDC network: an IDC network deployed by yourself, which is isolated from the Alibaba Cloud network.

For more information about classic networks and VPCs, see FAQ.

Note You can access data stores over a public network. However, the access speed depends on the public network bandwidth, and additional network access expenses are required. We recommend that you do not use public network connections.

Limits

  • Data Integration can synchronize structured, semi-structured, and unstructured data. Structured data stores include RDS and DRDS. Unstructured data, such as OSS objects and text files, must be capable of being converted to structured data. That is, Data Integration can only synchronize data that can be abstracted to two-dimensional logical tables to MaxCompute. It cannot synchronize unstructured data that cannot be converted to structured data, such as MP3 files stored in OSS, to MaxCompute.
  • Data Integration supports data synchronization and exchange in the same region or across regions.

    Data can be transmitted between some regions over a classic network, but the network connectivity is not guaranteed. If the transmission fails over a classic network, we recommend that you use a public network connection.

  • Data Integration supports only data synchronization but not data consumption.

Reference

  • For more information about how to configure a sync node, see Create a sync node.
  • For more information about how to process unstructured data, such as objects stored in OSS, see Access OSS unstructured data.
  • DataWorks provides the default resource group for you to migrate large amounts of data to the cloud for free. However, the default resource group does not work if a high transmission speed is required or your data stores are deployed in complex environments. You can use exclusive or custom resource groups to run your sync nodes. This guarantees connections to your data stores and enables a higher transmission speed. For more information about exclusive resource groups for data integration, see Use exclusive resource groups for data integration and Add a custom resource group.