Data import methods - AnalyticDB for MySQL - Alibaba Cloud Documentation Center

To meet a variety of data import requirements, AnalyticDB for MySQL allows you to import data by using external tables, DataWorks, Data Transmission Service (DTS), or a Java Database Connectivity (JDBC)-based program. This topic describes the features and use scenarios of the preceding import methods to help you select a suitable data import method.

Use external tables to import data

AnalyticDB for MySQL allows you to access different data sources and create external tables to map external data sources. This way, you can concurrently read external data and import the data to AnalyticDB for MySQL. This data import method maximizes the use of cluster resources to improve import performance.

Basic characteristics

This method is suitable for importing a large amount of data in a single job.
A data import job consumes large amounts of cluster resources to ensure high performance. We recommend that you use this method during off-peak hours.
The imported data is invisible when the data import job is in progress, and becomes visible only after the job is complete.
If you import a partition by using this method, data of the existing partition that uses the same name is overwritten.
When a data import job is completed by using an external table, table indexes are created to improve query performance.

Common scenarios

Data initialization in data warehouses
If you want to initialize and import terabytes of data to AnalyticDB for MySQL for analysis, we recommend that you store the data in OSS or HDFS and then use external tables to import the data.
Accelerated analysis of batch data
Batch data is stored in batch data warehouses such as MaxCompute. Each day, gigabytes or terabytes of data are imported from batch data warehouses to AnalyticDB for MySQL for accelerated analysis.

Methods

You can import data by using the regular import or elastic import method. By default, the regular import method is used. If you use the regular import method, data is read from compute nodes and indexes are created on storage nodes. The regular import method consumes computing and storage resources. If you use the elastic import method, data is read and indexes are created for Serverless Spark jobs. The elastic import method consumes resources of job resource groups and generates fees. The following table describes the comparison between the regular import method and the elastic import method.

Comparison between the regular import method and the elastic import method

Item	Regular import	Elastic import
Supported minor versions	No limits.	V3.1.10.0 and later.
Use scenarios	You want to import a small amount of data. No requirements are imposed on the speed and concurrency of import jobs.	You want to consume fewer storage node resources to accelerate data import. An import job involves multiple tables. You want to import a large amount of data to partitioned tables.
Supported data sources	MaxCompute data import OSS data import HDFS data import ApsaraDB RDS for MySQL data import ApsaraDB RDS for SQL Server data import PolarDB-X data import PolarDB for MySQL data import	MaxCompute data import OSS data import
How to enable	By default, the regular import method is enabled.	By default, the elastic import method is enabled for AnalyticDB for MySQL clusters of V3.1.10 and later.
Limits	No limits.	You can use elastic import to import data to AnalyticDB for MySQL Data Lakehouse Edition (V3.0) based on external tables by using only the `INSERT OVERWRITE INTO` statement. Take note of the following limits for SQL operations: You can use elastic import to import data from a single table, but not multiple joined tables. Elastic import supports only the WHERE and LIMIT clauses. Other syntax such as ORDER BY is not supported. When you run an elastic import job, you can use asterisks (*), column names, default values, or constants for the column and value fields in the SELECT statement. Other syntax such as SQL functions is not supported. You can use elastic import to import data only to fact tables. Elastic import supports the following data types: `BOOLEAN`, `TINYINT`, `SMALLINT`, `INT`, `BIGINT`, `FLOAT`, `DOUBLE`, `DECIMAL`, `VARCHAR`, `DATE`, `TIME`, `DATETIME`, and `TIMESTAMP`.
Concurrent import jobs	The regular import method supports up to two concurrent jobs.	The elastic import method supports up to 32 concurrent jobs. Note To improve import concurrency, you can specify the adb.load.job.max.acu parameter to increase the maximum amount of resources that can be used by an elastic import job. The default value of the adb.load.job.max.acu parameter is the number of shards plus 1. You can set the adb.load.job.max.acu parameter to `K × default value` (K ≥ 1). We recommend that you set the maximum value of K to a value that is less than or equal to the number of partitions of the partitioned table in the current import job. Use the following methods to query the numbers of shards and partitions: Execute the following statement to query the number of shards in an AnalyticDB for MySQL cluster: `SELECT count(1) FROM information_schema.kepler_meta_shards;` For information about how to view the number of partitions, see the "View the data size of a table" section of the Storage analysis topic.
Consumed resources	The regular import method reads source data from interactive resource groups (resident compute nodes) and creates indexes on storage nodes. This process consumes a large amount of resources of interactive resource groups (resident compute nodes) and storage nodes. Note When you run a regular import job, the values of the storage node metrics, such as CPU utilization and I/O usage, are high. This affects the read and write performance.	The elastic import method reads source data from and creates indexes on job resource groups (dynamically scaled compute nodes). This process does not consume storage node resources. Only a small amount of storage node resources is consumed to download the data of the built hot partitions, including data formats and indexes, to the on-premises device. Note An elastic import job consumes a small amount of storage node resources, but does not consume the resources of interactive resource groups (resident compute nodes). On the monitoring page, the values of the storage node metrics, such as CPU utilization and I/O usage, are low and the transactions per second (TPS) metric is not displayed.
Import speed	The import speed of regular import jobs varies based on the import configuration, import method, and table schema. For information about the data import optimization methods for different scenarios, see Optimize data import performance.	Sufficient resources of job resource groups ensure high concurrency and increased speed of elastic import jobs. You can run multiple elastic import jobs and increase the maximum amount of resources that can be used by an elastic import job to improve the import speed.

Recommendations:

An elastic import job requires at least 2 minutes to 3 minutes to complete and is not suitable for small amounts of data. If an import job must be completed within 3 minutes, we recommend that you use the regular import method.
An elastic import job requires a longer period of time to complete compared with a regular import job that uses the same resources. If you want to accelerate an import job, we recommend that you increase the maximum amount of resources that can be used by the elastic import job.

Optimization of data import performance

For information about how to improve the performance of the external table-based data import method, see the "Optimize the performance when you use external tables to import data" section of the Optimize data import performance topic.

Use DataWorks to import data

DataWorks provides a visualized data import method that allows you to import data from a variety of data sources to AnalyticDB for MySQL. Compared with external tables, DataWorks is more suitable when you want to import only a small amount of data.

Note

DataWorks To import several hundred gigabytes of data, we recommend that you use external tables instead of DataWorks. For more information, see the "Use external tables to import data" section of this topic.

Common scenarios

Data import every minute or hour
You want to import small amounts of data to AnalyticDB for MySQL every minute or hour for analysis.
Data import from multiple external data sources
You want to import data from multiple data sources such as Tablestore, Redis, and PostgreSQL to AnalyticDB for MySQL.

Methods

To use DataWorks to import data, perform the following steps:

Configure the data source. The following data sources are supported: RDS for MySQL, Oracle, RDS for SQL Server, OSS, MaxCompute, and HDFS. For more information, see the following topics:
Add an AnalyticDB for MySQL V3.0 data source.
Configure the source and destination connections for a synchronization job.

Optimization of data import performance

For information about how to improve the performance of the DataWorks-based data import method, see the "Optimize the performance when you use DataWorks to import data" section of the Optimize data import performance topic.

Use DTS to import data

Data Transmission Service (DTS) is a real-time data streaming service. DTS supports data transmission between data sources such as relational database management system (RDBMS) databases, NoSQL databases, and online analytical processing (OLAP) databases. DTS has the following advantages over traditional data migration and synchronization tools: diverse features, high performance, security, reliability, and ease of use. DTS helps simplify data transmission and allows you to focus on business development. You can use DTS to import data to AnalyticDB for MySQL from a variety of data sources for real-time analysis.

Common scenarios

Real-time data synchronization within seconds
You want to synchronize data in real time from RDS for MySQL or PolarDB for MySQL to AnalyticDB for MySQL for analysis.
Multi-source aggregation
You want to synchronize data from multiple RDS for MySQL instances or PolarDB for MySQL clusters to an AnalyticDB for MySQL cluster for analysis. You can use the multi-table merging feature of DTS to synchronize multiple source tables that use the same schema to a single table of an AnalyticDB for MySQL cluster.

Methods

Use a JDBC-based program to import data

If the method in which external tables or DataWorks is used to import data cannot meet your business requirements, you can use a JDBC-based program to import data in data cleansing and other complex scenarios that involve unstructured data.

Common scenarios

Import after data preprocessing
You want to parse and import the generated log files to AnalyticDB for MySQL in real time.
On-premises data import
You want to import on-premises data that fails to be uploaded to OSS, HDFS, or MaxCompute to AnalyticDB for MySQL.

Usage notes

When you use a JDBC-based program to connect to AnalyticDB for MySQL, you must configure an appropriate JDBC driver. For more information, see the "Supported MySQL JDBC driver versions" section of the Java topic.
If you want to import a large amount of data and reduce the amount of time that is required to complete the import job, we recommend that you configure a connection pool. For more information, see Druid connection pool.
This data import method supports batch import and concurrent import for higher import performance.
For information about how to import streaming data, see Import data from Apache Flink.
For information about how to import non-customized on-premises data, see Use LOAD DATA to import data to Data Warehouse Edition and Use the AnalyticDB for MySQL import tool to import data to Data Warehouse Edition.

Optimization of data import performance

For information about how to improve the performance of the data import method that uses a JDBC-based program, see the "Optimize the performance when you use a JDBC-based program to import data" section of the Optimize data import performance topic.

Use data synchronization to import data

AnalyticDB for MySQL provides a variety of data synchronization features, such as AnalyticDB Pipeline Service (APS), metadata discovery, and Hive data migration. The data synchronization features allow you to discover and organize OSS metadata, synchronize data in real time from Simple Log Service (SLS) and ApsaraMQ for Kafka to AnalyticDB for MySQL, and migrate Hive data to OSS for subsequent data analysis.

Common scenarios

Low-cost storage and analysis
If you want to store and analyze a large amount of logs such as SLS data and messages such as Kafka data in a cost-effective manner, you can use the APS feature to synchronize data in real time from SLS and Kafka to AnalyticDB for MySQL.
Metadata discovery
A large amount of OSS data is difficult to manage and analyze due to the absence of metadata. AnalyticDB for MySQL uses the metadata discovery feature to build metadata levels, such as databases, tables, and partitions, and identify data formats and field information to generate a table schema. The metadata built by using the metadata discovery feature allows data to be identified by other analysis engines for subsequent analysis.
Hive data migration
If you want AnalyticDB for MySQL to perform operations on the data stored in Hive, including metadata management and subsequent analysis, you can use the Hive data migration feature to migrate Hive data to OSS. Then, AnalyticDB for MySQL automatically organizes the metadata and performs subsequent analysis.