ApsaraDB for SelectDB supports various data import methods, including native interfaces and ecosystem tools, to meet requirements for various scenarios, including real-time streaming processing and batch processing. This topic describes the core interfaces and tools you can use to import data into a SelectDB instance.
Import method selection recommendations
Source data of non-Alibaba Cloud ecosystems:
Data import interfaces:
Kafka data source: Routine Load (preferred)
Non-Kafka data source: Stream Load (preferred)
Data import tool: Flink
Large amounts of data:
Data import interfaces:
Kafka data source: Routine Load (preferred)
Non-Kafka data source: Stream Load (preferred)
Data import tools:
For more information about the interfaces and tools, see Data import interfaces and Data import tools.
Data import interfaces
Interface | Description | Supported data format | Scenario | Reference |
Stream Load (Recommended) |
| CSV, JSON, PARQUET, and ORC. | You want to import local files or data streams to a SelectDB instance in real time or in batches. | |
Routine Load | You can process data streams in real time. | CSV and JSON. | You want to continuously import data sources specified in longtime jobs into a SelectDB instance. Note Only Kafka data sources are supported. | |
Broker Load |
| CSV, PARQUET, and ORC. | You want to read and import data from remote storage systems, such as Object Storage Service (OSS), Hadoop Distributed File System (HDFS), and Amazon Simple Storage Service (Amazon S3), into a SelectDB instance. | |
OSS Load |
| CSV, PARQUET, and ORC. | You want to import data in Alibaba Cloud OSS into a SelectDB instance. | |
INSERT INTO | The performance of | Data of databases and tables are read, and no file format is involved. |
|
Data import tools
Tool | Benefit | Supported data source | Incremental data | Historical data | Scenario | Reference |
DataWorks | End-to-end management: The task scheduling, data monitoring, and lineage analysis features are integrated, and the Alibaba Cloud ecosystem can be seamlessly integrated. |
| Not supported | Supported | Complex data synchronization scenarios in which enterprise-level data needs to be integrated and tasks need to be orchestrated and monitored. | |
DTS | Real-time data synchronization: Data migration can be complete at a second-level latency, and the resumable upload and data verification features are provided to ensure data migration reliability. |
| Supported | Supported | Highly reliabile data migration scenarios in which cross-cloud or hybrid cloud databases need to be synchronized in real time. | |
Flink | Unified stream-batch processing: Exactly-once semantics are supported for real-time data stream processing, and the data compute and import features are integrated to adapt to complex extract, transform, load (ETL) scenarios. |
| Supported | Supported | Scenarios in which real-time data warehouses can be built and stream computing and data import need to be integrated. | |
Kafka | High-throughput pipeline: Terabyte-level data buffering is supported, and persistence and multi-replica storage mechanisms are provided to prevent data loss. |
| Supported | Supported | Scenarios in which asynchronous data pipelines are used and the producers and consumers need to be decoupled to achieve high-concurrency data buffering. | |
Spark | Distributed computing: The Spark engine can be used to parallelly process massive amounts of data, and flexible conversions between DataFrames and SQL queries are supported. |
| Supported | Supported | Batch import scenarios in which the computing logic, such as SQL queries and DataFrames, needs to be combined to achieve large-scale ETL processing. | |
DataX | Plug-in-based architecture: More than 20 data source extensions are supported, batch processing synchronization is supported, and enterprise-level heterogeneous data migration is allowed. |
| Not supported | Supported | Scenarios in which highly scalable plug-ins are required to synchronize multi-source heterogeneous data in batches. | |
SeaTunnel | Lightweight ETL: The driven mode is configured to simplify development, the Change Data Capture (CDC) feature is supported to capture data changes in real time, and the Flink and Spark engines are compatible. |
| Supported | Supported | Scenarios in which the CDC feature in driven mode needs to be configured in a simple way and lightweight real-time data synchronization needs to be archieved. | |
BitSail | Multi-engine adaptation: Multiple computing frameworks such as MapReduce and Flink are supported, and the data sharding strategy is provided to improve data import efficiency. |
| Supported | Supported | Data migration scenarios in which compute frameworks, such as Flink and MapReduce (MR), need to be switched. |