edit-icon download-icon

Data Sync overview

Last Updated: Apr 03, 2018

The Alibaba Group offers Data Integration - a data synchronization platform that provides stable, efficient, and elastically scalable services. The Data Integration is designed to implement fast and stable data movement and synchronization between multiple heterogeneous data sources in complex network environments.

Introduction to offline (batch) data sync

The offline (batch) data channel provides a set of abstract data extraction plug-ins (Readers) and data writing plug-ins (Writers) by defining the source and target databases and data sets. Also, it designs a set of simplified intermediate data transmission formats based on the framework to transfer data between any structured and semi-structured data sources.

offline-trans

Supported data sources

Data Integration provides extensive options for data sources listed as follows:

  • Text storage (FTP/SFTP/OSS/Multimedia files)

  • Database (RDS/DRDS/MySQL/PostgreSQL)

  • NoSQL (Memcache/Redis/MongoDB/HBase)

  • Big data (MaxCompute/AnalyticDB/HDFS)

  • MPP database (HybridDB for MySQL)

For more information, see Supported data sources.

Note:

The configuration information of different data sources varies dramatically from each other, and the parameter configuration information must be queried in detail based on the actual use case. For this reason, detailed description of parameters is available on the data source configuration and job configuration pages and can be queried and used as needed.

Description of synchronization development

Synchronization development provides two development modes: Wizard and Script.

Wizard

Wizard provides wizard-like visualized development guidance and comprehensive details about configuration of data sync tasks. This mode is cost-effective but lacks certain advanced functions.

Script

It allows you to directly write a data sync JSON script to complete data sync development. It is suitable for the advanced users but incurs a high learning cost. It also provides a rich set of flexible functions for refined configuration management.

Note:

  • The code generated in the wizard mode can be converted to that in the script mode. This conversion is unidirectional and the resulting code cannot be converted back to that in the wizard mode. This is because that the capabilities of the script mode are a superset of those of the wizard mode.

  • Always configure the data source and create the target table before writing codes.

Description of network types

Networks can be classified as the classic network, VPC, and local IDC network (planning).

  • Classic network: A network that is centrally deployed on the Alibaba Cloud public infrastructure network and planned and managed by Alibaba Cloud. The classic network is suitable for the users who demand ease-of-use requirements.

  • VPC network: An isolated network environment created based on Alibaba Cloud. For this type of network, you have full control over your virtual network, including customizing the IP address range, partitioning network segments, and configuring routing tables and gateways.

  • Local IDC network: The network environment of your server room, which is isolated from the Alibaba Cloud network.

Supplemental instructions

  • Public network access is supported - only select classic network as a network type. Note the speed of the public network bandwidth and relevant network traffic charges when using this type of network. It is not recommended except for special cases.

  • To synchronize the data on a network that is under planning, you can use new local running resources along with the script mode. Alternatively, use the SHELL + DataX solution.

  • The Virtual Private Cloud (VPC) creates an isolated network environment and allows you to customize the IP address range, network segments, and gateways. VPC applications expand as the VPC security improves, and thus Data Integration provides RDS for MySQL, RDS for SQL Server, and RDS for PostgreSQL and eliminates the need to purchase extra ECSs that reside on the same network as the VPC. Instead, the system ascertains interconnectivity by detecting devices automatically through the reverse proxy. The support for other Alibaba Cloud databases including PPAS, OceanBase, Redis, MongoDB, Memcache, TableStore, and HBase will also be available in the future. For any non-RDS data sources, an ECS on the same network is required for configuring data integration synchronization tasks on the VPC network and assuring interconnectivity.

Constraints and limitations

  • Only structured (such as RDS and DRDS), semi-structured, and non-structured (such as OSS and TXT, but the specific data sync tasks must be abstracted as structured data) data sync tasks are supported. In other words, Data Integration supports transmitting and synchronizing the data that can be abstracted as a logical two-dimensional table. For other non-structured data such as an MP3 audio stored in the OSS currently, Data Integration does not support synchronizing it to MaxCompute. However, it will be available in the future.

  • Data sync and exchange between a single and certain cross-region data storage are supported.

  • For certain regions, cross-region data transmission is supported but not guaranteed by the classic network. If you must use this function while the classic network is disconnected, consider using the public network connection instead.

  • Only data sync (transmission) is performed and no consumption plans of data stream are provided.

Help

  • For more information about configuring data sync tasks, see Create a Data Sync job.

  • For more information about processing non-structured data such as the OSS data, see Access OSS data.

Thank you! We've received your feedback.