All Products
Search
Document Center

ApsaraDB for SelectDB:Data import

Last Updated:Apr 03, 2025

ApsaraDB for SelectDB supports various data import methods, including native interfaces and ecosystem tools, to meet requirements for various scenarios, including real-time streaming processing and batch processing. This topic describes the core interfaces and tools you can use to import data into a SelectDB instance.

Import method selection recommendations

  • Source data of Alibaba Cloud ecosystems: DTS and DataWorks

  • Source data of non-Alibaba Cloud ecosystems:

  • Large amounts of data:

    • Data import interfaces:

    • Data import tools:

      • MySQL data sources of Alibaba Cloud ecosystems: DTS (preferred)

      • MySQL data sources of non-Alibaba Cloud ecosystems: Flink (preferred)

For more information about the interfaces and tools, see Data import interfaces and Data import tools.

Data import interfaces

Interface

Description

Supported data format

Scenario

Reference

Stream Load (Recommended)

  • The HTTP protocol is used to transmit data.

  • Stream Load is a synchronous interface. After the request is successful, the result is immediately returned.

CSV, JSON, PARQUET, and ORC.

You want to import local files or data streams to a SelectDB instance in real time or in batches.

Import data by using Stream Load

Routine Load

You can process data streams in real time.

CSV and JSON.

You want to continuously import data sources specified in longtime jobs into a SelectDB instance.

Note

Only Kafka data sources are supported.

Import data by using Routine Load

Broker Load

  • You can import hundreds of GB of data into an instance at a time.

  • Broker Load is an asynchronous interface.

CSV, PARQUET, and ORC.

You want to read and import data from remote storage systems, such as Object Storage Service (OSS), Hadoop Distributed File System (HDFS), and Amazon Simple Storage Service (Amazon S3), into a SelectDB instance.

Import data by using Broker Load

OSS Load

  • Data is transmitted over an internal network. This reduces the Internet bandwidth consumption.

  • You can import hundreds of GB of data into an instance at a time.

CSV, PARQUET, and ORC.

You want to import data in Alibaba Cloud OSS into a SelectDB instance.

Import data by using OSS Load

INSERT INTO

The performance of INSERT INTO VALUES is poor. We recommend that you do not use INSERT INTO VALUES in the production environment.

Data of databases and tables are read, and no file format is involved.

  • INSERT INTO VALUES is suitable for scenarios in which you want to import a small amount of data into a SelectDB instance, and the data import frequency must be less than once every five minutes.

  • INSERT INTO SELECT is suitable for scenarios in which you want to calculate and process internal data of a SelectDB instance and external data in the data lakehouse and import the data into a new table of the SelectDB instance.

Import data by using INSERT INTO

Data import tools

Tool

Benefit

Supported data source

Incremental data

Historical data

Scenario

Reference

DataWorks

End-to-end management: The task scheduling, data monitoring, and lineage analysis features are integrated, and the Alibaba Cloud ecosystem can be seamlessly integrated.

  • MySQL

  • ApsaraDB for ClickHouse

  • StarRocks

Not supported

Supported

Complex data synchronization scenarios in which enterprise-level data needs to be integrated and tasks need to be orchestrated and monitored.

Use DataWorks to import data

DTS

Real-time data synchronization: Data migration can be complete at a second-level latency, and the resumable upload and data verification features are provided to ensure data migration reliability.

  • MySQL

  • PostgreSQL

Supported

Supported

Highly reliabile data migration scenarios in which cross-cloud or hybrid cloud databases need to be synchronized in real time.

Use DTS to import data

Flink

Unified stream-batch processing: Exactly-once semantics are supported for real-time data stream processing, and the data compute and import features are integrated to adapt to complex extract, transform, load (ETL) scenarios.

  • MySQL

  • Kafka

  • Oracle

  • PostgreSQL

  • SQL Server

Supported

Supported

Scenarios in which real-time data warehouses can be built and stream computing and data import need to be integrated.

Use Flink to import data

Kafka

High-throughput pipeline: Terabyte-level data buffering is supported, and persistence and multi-replica storage mechanisms are provided to prevent data loss.

  • Kafka

Supported

Supported

Scenarios in which asynchronous data pipelines are used and the producers and consumers need to be decoupled to achieve high-concurrency data buffering.

Use Doris Kafka Connector to import data

Spark

Distributed computing: The Spark engine can be used to parallelly process massive amounts of data, and flexible conversions between DataFrames and SQL queries are supported.

  • MySQL

  • PostgreSQL

  • HDFS

  • S3

Supported

Supported

Batch import scenarios in which the computing logic, such as SQL queries and DataFrames, needs to be combined to achieve large-scale ETL processing.

Import data by using Spark

DataX

Plug-in-based architecture: More than 20 data source extensions are supported, batch processing synchronization is supported, and enterprise-level heterogeneous data migration is allowed.

  • MySQL

  • Oracle

  • HDFS

  • Hive

  • ODPS

  • HBase

  • FTP

Not supported

Supported

Scenarios in which highly scalable plug-ins are required to synchronize multi-source heterogeneous data in batches.

Import data by using DataX

SeaTunnel

Lightweight ETL: The driven mode is configured to simplify development, the Change Data Capture (CDC) feature is supported to capture data changes in real time, and the Flink and Spark engines are compatible.

  • MySQL

  • Hive

  • Kafka

Supported

Supported

Scenarios in which the CDC feature in driven mode needs to be configured in a simple way and lightweight real-time data synchronization needs to be archieved.

Use SeaTunnel to import data

BitSail

Multi-engine adaptation: Multiple computing frameworks such as MapReduce and Flink are supported, and the data sharding strategy is provided to improve data import efficiency.

  • MySQL

  • Hive

  • Kafka

Supported

Supported

Data migration scenarios in which compute frameworks, such as Flink and MapReduce (MR), need to be switched.

Use BitSail to import data