Although AnalyticDB for MySQL provides several data import methods to meet requirements in different scenarios, import performance can be affected by a variety of factors. For example, inappropriate table modeling may lead to data skew, and inappropriate configurations for data import may lead to ineffective resource utilization. This topic describes how to optimize the data import performance in different scenarios.
Optimize the performance when you use external tables to import data
Check the distribution key
The distribution key determines how to concurrently import table data by shard. If data is unevenly distributed, the shards to which large amounts of data are imported may take a long time to import, which incurs data skew issues and affects the performance of the entire data import task. To resolve these issues, make sure that data is evenly distributed when you import data. For information about how to select a distribution key, see the "Select a distribution key" section of the Schema design topic.
- Before data is imported, determine whether the distribution key is appropriate based on the meaning of the distribution key. For example, assume that the l_discount column is selected as the distribution key in the Lineitem table, and only 11 distinct values exist for the l_discount column. Data with the same value in the l_discount column is distributed to the same shard, which causes severe data skew issues and affects the data import performance. In this case, the l_orderkey column is a more appropriate distribution key. Data can be evenly distributed because the order ID values are all unique.
- After data is imported, determine whether the distribution key is appropriate based on the diagnostics on distribution key reasonability. For information about how to view the distribution key diagnostics, see the "Diagnostics on distribution field skew" section of the Data modeling diagnostics topic.
Check the partition key
If a partition is written to AnalyticDB for MySQL by using
INSERT OVERWRITE INTO SELECT, data of the existing partition that has the same name is overwritten. Data in each
shard is imported to the corresponding partitions. We recommend that you do not import
a large number of partitions at a time to prevent external sorting and reduced data
import performance. For information about how to select a partition key, see the "Select a partition key" section of the Schema design topic.
- Before data is imported, determine whether the partition key is appropriate based on the business data requirements and data distribution. For example, assume that the l_shipdate column is used to partition the Lineitem table, and the table data spans seven years. If data is partitioned by year, the table contains seven partitions. If data is partitioned by date, the table contains more than 2,000 partitions and each partition contains about 30 million rows of data. In this case, partitioning the table by month or year is more appropriate.
- After data is imported, determine whether the partition key is appropriate based on the diagnostics on partition key reasonability. For information about how to view the partition key diagnostics, see the "Diagnostics on partition field reasonability" section of the Data modeling diagnostics topic.
Check the index
By default, AnalyticDB MySQL creates an index for all columns when a table is created, and specific resources are consumed when an index is created for all columns of a wide table. When you import data to a wide table, we recommend that you use a primary key index. The primary key index can be used for deduplication, but the use of large numbers of primary key columns reduces performance. For information about how to select a primary key index, see the "Select a primary key" section of the Schema design topic.
- Typically, you do not need to specify a primary key index in offline import scenarios because data has been deduplicated during offline computing.
- On the Monitoring Information page of the AnalyticDB for MySQL console, click the
Table Information Statistics tab to view the amount of table data, index data, and
primary key index data. If the amount of index data exceeds the amount of table data,
check whether long character strings exist in columns. For such columns, indexes may
take a long time to be created and occupy a large amount of storage space. We recommend
that you delete such indexes. For more information, see the "Delete an index" section of the ALTER TABLE topic.
Note Primary key indexes cannot be deleted. If you want to delete a primary key index, you must create another table.
Add a hint to accelerate import
/direct_batch_load=true) before a data writing statement to accelerate the writing task.
submit job /* direct_batch_load=true*/insert overwrite adb_table select * from adb_external_table;
Optimize the performance when you use DataWorks to import data
Optimize task configurations
- Optimize the value of Data Records Per Write
This parameter indicates the number of data records imported per batch. The default value is 2048. We recommend that you use the default value in most cases.
However, if the size of a single data record reaches hundreds of KB, we recommend that you modify the value of this parameter. For example, if the size of a single data record is 512 KB, you can set the parameter to 16. This ensures that the amount of data imported per batch does not exceed 8 MB and prevent high memory usage of frontend nodes.
- Optimize the channel control configurations
- The data synchronization performance of a data import task is proportional to the
value of Expected Maximum Concurrency. We recommend that you increase the value of this parameter.
Notice The greater the value of Expected Maximum Concurrency, the more the occupied DataWorks resources. Increase this parameter to an appropriate value.
- We recommend that you turn on Distributed Execution for better synchronization performance.
- The data synchronization performance of a data import task is proportional to the value of Expected Maximum Concurrency. We recommend that you increase the value of this parameter.
Common issues and solutions
- When a small amount of data is imported from the client, the data can be processed
by the database server in a timely manner, but the CPU and disk I/O resources are
underutilized. As a result, the write response time and write transactions per second
(TPS) may not meet your expectations.
Solution: Increase the values of Data Records Per Write and Expected Maximum Concurrency to increase the amount of imported data. The data import performance linearly increases as the amount of imported data increases.
- If the imported table has data skew issues, specific cluster nodes are overloaded
and import performance is reduced. In this case, the CPU and disk I/O resources are
underutilized, and write response time is prolonged. On the Data Modeling Diagnostics
tab of the Diagnostics and Optimization page, the data skew diagnostic results of
the imported table are displayed.
Solution: Redesign the table schema and import the data again. For more information, see Schema design.
Optimize the performance when you use a program and JDBC to import data
Optimize the client
- Batch data import from the client
- When you use a program and Java Database Connectivity (JDBC) to import data, we recommend that you batch import data to reduce network and connection overheads. Do not import a single record of data unless special requirements apply.
- We recommend that you import 2,048 data records per batch. If the size of a single data record reaches hundreds of KB, we recommend that you import no more than 8 MB of data per batch. You can calculate the optimal number of data records per batch by dividing 8 MB by the size of a single data record. A large amount of data imported per batch may cause high memory usage on frontend nodes and reduce the import performance.
- Concurrency configurations of the client
- When you import data from the client, we recommend that you concurrently import data because a single process cannot fully utilize system resources and the batch operations on the client slow down the import speed. You can accelerate data import by using the concurrency policy.
- The concurrency capability is determined by the number of batches, data sources, and workloads of the host where the client is located. We recommend that you find an appropriate concurrency capability by testing. For example, if the import performance cannot meet your expectations, you can double the number of concurrent processes. If the import speed decreases, you can gradually decrease the number of concurrent processes.
Common issues and solutions
- Make sure that the data source can produce data at a high speed. If the data comes from other systems or files, check the client for output bottlenecks.
- Make sure that data can be processed at a high speed. Check whether data can be synchronously produced and consumed and make sure that sufficient amount of data is waiting to be imported to AnalyticDB MySQL.
- While ensuring that the host where the client is located has appropriate workloads, check the CPU utilization or disk I/O usage to determine whether system resources are sufficient.