how to optimize data import performance in different scenarios - AnalyticDB

AnalyticDB for MySQL offers multiple data import methods for different scenarios. However, various factors can affect data import performance. For example, a poor table schema can cause long-tail tasks, and low import configurations can lead to inefficient resource utilization. This topic describes how to optimize data import performance in different scenarios.

Optimize data import from external tables

Check the distribution key

The distribution key determines the hash partition for data import. Data is imported into each table concurrently at the hash partition level. If data is not distributed evenly, the hash partitions with more data become long-tail nodes, which degrades the performance of the entire import task. Therefore, you must ensure that data is distributed evenly during import. For more information about how to select a distribution key, see Select a distribution key.

To determine if the distribution key is appropriate, consider the following points:

Before the import, determine if the selected distribution key is appropriate based on the business context. For example, in the Lineitem table, if you select the l_discount column as the distribution key, the order discount values have a low cardinality of only 11 distinct values. Data with the same l_discount value is distributed to the same partition. This causes severe data skew and creates long-tail tasks, which degrades performance. The l_orderkey column is a better choice because the order IDs are unique, which ensures that the data is distributed more evenly.
After the import, if the data modeling diagnostics show data skew for the distribution field, the selected distribution key is causing uneven data distribution. For more information about how to view distribution key diagnostics, see Storage diagnostics.

Check the partition key

The INSERT OVERWRITE SELECT statement overwrites partitions. This means that imported subpartitions replace existing subpartitions that have the same name in the target table. Within each hash partition, data is imported into its corresponding subpartition. Avoid importing too many subpartitions at once. Importing multiple subpartitions at the same time can trigger an external sort process, which degrades import performance. For more information about how to select a partition key, see Select a partition key.

To determine if the partition key is appropriate, consider the following points:

Before the import, determine if the partition key is appropriate based on your requirements and the data distribution. For example, if you subpartition the Lineitem table by the l_shipdate column and the timestamp range spans seven years, partitioning by year creates 7 subpartitions. Partitioning by day creates more than 2,000 subpartitions, with each subpartition containing about 30 million records. In this case, partitioning by month or year is more appropriate.
After the import, if the data modeling diagnostics indicate that the subpartitions are unreasonable, the selected partition key is inappropriate. For more information about how to view partition key diagnostics, see Partitioned table diagnostics.

Check the index

By default, AnalyticDB for MySQL indexes all columns when you create a table. Building an index for all columns of a wide table consumes a significant amount of resources. When you import data into a wide table, consider using only a primary key index. A primary key index is used to remove duplicates. Using too many primary key columns can degrade deduplication performance. For more information about how to select a primary key, see Select a primary key.

To determine if the index is appropriate, consider the following points:

In offline import scenarios, data is usually deduplicated by offline computing. You do not need to specify a primary key index.
On the Monitoring Information > Table Information Statistics tab, you can view the amount of table data, index data, and primary key index data. If the amount of index data is larger than the amount of table data, check for columns with long strings. Building indexes on these columns is time-consuming and consumes a large amount of storage space. You can delete these indexes. For more information, see ALTER TABLE.
Note
You cannot delete a primary key index. You must recreate the table.

Add a hint to accelerate import

You can add a hint (direct_batch_load=true) before the import task to accelerate it.

Note

This hint is supported only by Data Warehouse Edition clusters in elastic mode that run version 3.1.5. If the import performance does not significantly improve after you use the hint, Submit a ticket.

Example:

SUBMIT JOB /*+ direct_batch_load=true*/INSERT OVERWRITE adb_table
SELECT * FROM adb_external_table;

Use the elastic import feature to accelerate import

Note

The elastic import feature is supported only on clusters that run kernel version 3.1.10.0 or later.
The elastic import feature is supported on Enterprise Edition, Basic Edition, and Data Lakehouse Edition clusters for which a Job resource group has been created.
Elastic import supports only importing MaxCompute data and OSS data stored in CSV, Parquet, or ORC format.
When you use the elastic import feature to accelerate an import, ensure that the Job resource group has sufficient resources. This helps prevent issues such as long wait times, long task durations, and task failures due to insufficient resources.

Elastic import supports running multiple elastic import tasks concurrently. You can also accelerate a single elastic import task by increasing its allocated resources. For more information, see Data import methods.

Example:

/*+ elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group]*/
submit job insert overwrite adb_table select * from adb_external_table;

For more information about the parameters, see Hint parameters.

Optimize data import from DataWorks

Optimize task configuration

Optimize Number of records per batch insert
This parameter specifies the batch size for a single import. The default value is 2048. We do not recommend changing this value.
If a single record is large, such as several hundred KB or up to 512 KB, you can change this setting to 16. This change ensures that each batch import does not exceed 8 MB and helps prevent high memory usage on frontend nodes.
Optimize Channel Control
- Data synchronization performance is proportional to the value of Maximum Concurrency. You can increase the value of Maximum Concurrency as much as possible.
  Important
  A higher value for Maximum Concurrency consumes more DataWorks resources. Select a value based on your requirements.
- You can turn on Distributed Execution for better synchronization performance.

FAQ and solutions

If the client does not generate a sufficient import load, the cluster's CPU utilization, disk I/O utilization, and write response time remain low. The database server can promptly process the data sent by the client. However, because the total amount of data sent is small, the write transactions per second (TPS) might not meet your expectations.
Solution: You can increase the value of Number of records per batch insert and Maximum Concurrency. The data import performance increases linearly with the import load.
If the target table for the import has data skew, some cluster nodes become overloaded, which degrades import performance. In this case, the cluster's CPU utilization and disk I/O utilization are low, but the write response time is high. You can also identify the target table in the skew diagnostics table on the Diagnostics And Optimization > Data Modeling Diagnostics page.
Solution: You can redesign the table schema and then re-import the data. For more information, see Table schema design.

Optimize data import from a JDBC program

Client-side optimization

Use batch import on the application side
- When you use a JDBC program to import data, use batch imports to reduce network and link overhead. Unless you have special requirements, avoid importing single records.
- We recommend a batch size of 2048 records. If a single record is large, such as several hundred KB, keep the batch size under 8 MB. You can calculate the number of records per batch by dividing 8 MB by the size of a single record. An oversized batch can use too much memory on frontend nodes and degrade import performance.
Configure concurrency on the application side
- When you import data from an application, use multiple concurrent threads to import data. A single process cannot fully utilize system resources. Also, the client typically needs to process data and create batches, which can be slower than the database's import speed. Using multiple concurrent threads can accelerate the import.
- Import concurrency is affected by batching, the data source, and the client machine's load. A single optimal value does not exist. We recommend that you perform tests to find the optimal concurrency level. If the import performance does not meet your expectations, double the concurrency. If the import speed decreases, gradually reduce the concurrency to find the optimal number.

FAQ and solutions

If performance is poor when you import data into AnalyticDB for MySQL from a program, you should first check for performance bottlenecks on the client.

Ensure that the data source can produce data at a sufficient speed. If the data comes from other systems or files, check for output bottlenecks on the client.
Ensure that the data processing speed is high. Check whether data production and consumption are synchronized. Ensure that sufficient data is ready to be imported into AnalyticDB for MySQL.
Check the client machine's load. Ensure that system resources, such as CPU utilization and disk I/O utilization, are sufficient.