All Products
Search
Document Center

MaxCompute:Data migration scenarios

Last Updated:Mar 17, 2026

MaxCompute provides a variety of tools that you can use to upload and download data. This topic describes three migration paths: migrating data from a Hadoop cluster, synchronizing data from a relational database, and collecting logs. The right tool depends on your source system type, whether you need batch or real-time transfer, and your operational environment.

ScenarioDecision axisTools
Migrate Hadoop dataManaged vs. distributed vs. visualMMA, Sqoop, DataWorks
Synchronize data from a databaseSource DB type + batch vs. real-timeDataWorks, OGG plug-in, Data Integration
Collect logsExisting stack and throughput needsFlume, Fluentd, Logstash

Migrate Hadoop data

Choose a tool based on how much control you need over the migration process:

  • Managed service with minimal setup → use MaxCompute Migration Assist (MMA)

  • Distributed, parallel transfers on your existing cluster → use Sqoop

  • Visual workflow with scheduling and monitoring → use DataWorks

MaxCompute Migration Assist (MMA) is a managed service for migrating data from Hadoop clusters to MaxCompute.

Sqoop runs a MapReduce (MR) job on the original Hadoop cluster to transmit data to MaxCompute in a distributed manner. Use Sqoop when you need fine-grained control over parallelism or have existing Sqoop pipelines. For details, see the Apache Sqoop documentation.

DataWorks provides a codeless UI for building migration workflows. It requires DataX as the underlying data transfer engine. Use DataWorks when you need scheduled jobs, task dependencies, or centralized monitoring across multiple migration tasks.

Synchronize data from a database

Tool selection depends on two factors:

  1. Source database type — each tool covers a different database.

  2. Synchronization policy — offline (batch) or real-time.

Offline batch synchronization

Source databases supported: MySQL, SQL Server, PostgreSQL (and others supported by DataWorks)

Use DataWorks to migrate data offline from relational databases. DataWorks provides a codeless UI to configure and schedule batch synchronization nodes.

Real-time synchronization

Two tools support real-time synchronization, each covering a different source database:

Source databaseToolHow it works
OracleOGG plug-inCaptures change data from Oracle and streams it to MaxCompute continuously
ApsaraDB RDSData Integration (in DataWorks)Synchronizes data from ApsaraDB RDS databases in real time

For ApsaraDB RDS setup, see Configure data sources for data synchronization from MySQL.

Collect logs

To collect logs, you can use tools such as Flume, Fluentd, or Logstash. Choose based on your existing infrastructure:

ToolBest for
FlumeHigh-throughput log aggregation in Java-based environments
FluentdMulti-source collection with a flexible plugin ecosystem
LogstashLogs that need parsing, enrichment, or filtering before ingestion