MaxCompute provides a variety of tools that you can use to upload and download data. This topic describes three migration paths: migrating data from a Hadoop cluster, synchronizing data from a relational database, and collecting logs. The right tool depends on your source system type, whether you need batch or real-time transfer, and your operational environment.
| Scenario | Decision axis | Tools |
|---|---|---|
| Migrate Hadoop data | Managed vs. distributed vs. visual | MMA, Sqoop, DataWorks |
| Synchronize data from a database | Source DB type + batch vs. real-time | DataWorks, OGG plug-in, Data Integration |
| Collect logs | Existing stack and throughput needs | Flume, Fluentd, Logstash |
Migrate Hadoop data
Choose a tool based on how much control you need over the migration process:
Managed service with minimal setup → use MaxCompute Migration Assist (MMA)
Distributed, parallel transfers on your existing cluster → use Sqoop
Visual workflow with scheduling and monitoring → use DataWorks
MaxCompute Migration Assist (MMA) is a managed service for migrating data from Hadoop clusters to MaxCompute.
Sqoop runs a MapReduce (MR) job on the original Hadoop cluster to transmit data to MaxCompute in a distributed manner. Use Sqoop when you need fine-grained control over parallelism or have existing Sqoop pipelines. For details, see the Apache Sqoop documentation.
DataWorks provides a codeless UI for building migration workflows. It requires DataX as the underlying data transfer engine. Use DataWorks when you need scheduled jobs, task dependencies, or centralized monitoring across multiple migration tasks.
Synchronize data from a database
Tool selection depends on two factors:
Source database type — each tool covers a different database.
Synchronization policy — offline (batch) or real-time.
Offline batch synchronization
Source databases supported: MySQL, SQL Server, PostgreSQL (and others supported by DataWorks)
Use DataWorks to migrate data offline from relational databases. DataWorks provides a codeless UI to configure and schedule batch synchronization nodes.
To perform operations on instances directly, see Create a synchronization node.
Real-time synchronization
Two tools support real-time synchronization, each covering a different source database:
| Source database | Tool | How it works |
|---|---|---|
| Oracle | OGG plug-in | Captures change data from Oracle and streams it to MaxCompute continuously |
| ApsaraDB RDS | Data Integration (in DataWorks) | Synchronizes data from ApsaraDB RDS databases in real time |
For ApsaraDB RDS setup, see Configure data sources for data synchronization from MySQL.
Collect logs
To collect logs, you can use tools such as Flume, Fluentd, or Logstash. Choose based on your existing infrastructure:
| Tool | Best for |
|---|---|
| Flume | High-throughput log aggregation in Java-based environments |
| Fluentd | Multi-source collection with a flexible plugin ecosystem |
| Logstash | Logs that need parsing, enrichment, or filtering before ingestion |