Tools and services for data migration scenarios - MaxCompute

MaxCompute supports a range of tools for uploading and downloading data. Most are open source, with source code available on GitHub. Choose a tool based on your migration scenario — whether you need bulk batch transfers, incremental synchronization, or real-time log streaming.

Choose a tool

Two underlying transport mechanisms move data into and out of MaxCompute:

Transport	Best for	Managed by
Tunnel	Batch and incremental transfers (structured data from databases, files, and HDFS or Hive)	Alibaba Cloud (managed tools) or self-operated (open-source tools)
DataHub	Real-time log streaming (collects streaming data, then archives it into MaxCompute tables)	Self-operated (open-source tools)

Recommended starting point: Use an Alibaba Cloud managed tool (MaxCompute client, Data Integration, or DTS) if it supports your data source — these require less maintenance. Fall back to an open-source tool if you need broader source compatibility or a custom pipeline.

The following table maps common scenarios to the corresponding tool:

Scenario	Tool	Transport	Type
Upload or download data from the command line	MaxCompute client	Tunnel	Alibaba Cloud
Sync data between heterogeneous cloud data sources (full or incremental)	Data Integration (DataWorks)	Tunnel	Alibaba Cloud
Replicate data from ApsaraDB RDS or MySQL to MaxCompute in real time	Data Transmission Service (DTS)	Tunnel	Alibaba Cloud
Import data from relational databases, HDFS, or Hive; export to MySQL	Sqoop	Tunnel	Open source
Build visual ETL pipelines with drag-and-drop on Windows, UNIX, or Linux	Kettle	Tunnel	Open source
Collect and archive large-scale log data via Apache Flume	Apache Flume	DataHub	Open source
Collect application, system, or access logs via plug-in	Fluentd	DataHub	Open source
Build an end-to-end streaming pipeline with MaxCompute or StreamCompute	Logstash	DataHub	Open source
Sync incremental Oracle database changes to MaxCompute in real time	OGG	DataHub	Open source

Alibaba Cloud services

MaxCompute client

The MaxCompute client includes built-in Tunnel commands for uploading and downloading data, based on the Tunnel SDK. For the full command reference, see Tunnel commands. For installation and usage instructions, see MaxCompute client.

This is an open-source program. Source code is available at aliyun-odps-console.

Data Integration (DataWorks)

Data Integration is a data synchronization platform that supports full offline and incremental real-time synchronization, integration, and exchange between heterogeneous data storage systems on Alibaba Cloud.

Supported data sources: MaxCompute, ApsaraDB RDS (MySQL, SQL Server, and PostgreSQL), Oracle, FTP, AnalyticDB, Object Storage Service (OSS), ApsaraDB for Memcache, and PolarDB-X.

For more information, see Data Integration.

Data Transmission Service (DTS)

Data Transmission Service (DTS) supports data interaction between RDBMS, NoSQL, and OLAP data sources. It provides data migration, real-time data subscription, and real-time data synchronization.

DTS can synchronize data from ApsaraDB RDS and MySQL instances to MaxCompute tables in real time. Other data source types are not supported.

DTS can synchronize data from ApsaraDB RDS and MySQL instances to MaxCompute tables in real time. Other source types are not supported.

For more information, see What is DTS?.

Open-source software

Sqoop

Sqoop 1.4.6 is extended to support MaxCompute. Use it to:

Sqoop 1.4.6 in the community is further developed to provide enhanced support for MaxCompute. You can use Sqoop to import data from relational databases such as MySQL and data from HDFS or Hive to MaxCompute tables. You can also use Sqoop to export data from MaxCompute tables to relational databases such as MySQL.

Import data from relational databases (such as MySQL), HDFS, or Hive into MaxCompute tables
Export data from MaxCompute tables to relational databases (such as MySQL)

Source code is available at aliyun-maxcompute-data-collectors.

Kettle

Kettle is an open-source extract, transform, load (ETL) tool written in Java. It runs on Windows, UNIX, and Linux, and provides a graphical interface for defining data transmission topologies using drag-and-drop components.

Kettle is an open-source extract, transform, load (ETL) tool that is developed in Java. Kettle runs on Windows, UNIX, or Linux and provides graphic interfaces for you to define a data transmission topology by using drag-and-drop components.

Source code is available at aliyun-maxcompute-data-collectors.

Apache Flume

Apache Flume is a distributed, reliable system for collecting large volumes of log data from multiple sources and storing it in a centralized location. It supports a wide range of Source and Sink plug-ins.

The DataHub Sink plug-in lets Apache Flume upload log data to DataHub in real time, which is then archived into MaxCompute tables.

Source code is available at aliyun-maxcompute-data-collectors.

Fluentd

Fluentd is an open-source log collector for application logs, system logs, and access logs. Plug-ins let you filter log data and route it to data processors such as MySQL, Oracle, MongoDB, Hadoop, and Treasure Data.

The DataHub plug-in of Fluentd allows you to upload log data to DataHub in real time and archive the data into MaxCompute tables.

The DataHub plug-in uploads log data to DataHub in real time and archives it into MaxCompute tables.

Logstash

Logstash is an open-source log collection and processing framework. The logstash-output-datahub plug-in imports data into DataHub, which is then archived into MaxCompute tables. Combined with MaxCompute or StreamCompute, Logstash supports end-to-end streaming pipelines from data collection through analysis.

The DataHub plug-in of Logstash allows you to upload log data to DataHub in real time and archive the data into MaxCompute tables.

OGG

The DataHub plug-in for OGG synchronizes incremental changes from an Oracle database to DataHub in real time, then archives the data into MaxCompute tables.

Source code is available at aliyun-maxcompute-data-collectors.