Fix Import Bottlenecks by Optimizing Distribution Keys in AnalyticDB for MySQL - AnalyticDB for MySQL

AnalyticDB for MySQL supports multiple data import methods. Two root causes account for most performance problems: a poorly designed table schema that creates long-tail tasks, and under-configured import settings that leave cluster resources underutilized. This topic covers both.

Identify the bottleneck first

Before tuning, check these three metrics in the console to narrow down the cause:

Symptom	CPU utilization	Disk I/O utilization	Write response time	Most likely cause
Low client load	Low	Low	Low	The client is not sending enough data — increase batch size or concurrency
Data skew	Low	Low	High	Some nodes are overloaded — redesign the table schema

Optimize data import from external tables

This section uses an example of importing 1 TB of TPC-H Lineitem data (about 6 billion records) from an external table to demonstrate how to tune the import process for optimal performance. For more information about data import, see INSERT OVERWRITE SELECT.

Check the distribution key

The distribution key controls how data maps to hash partitions. Import runs concurrently across hash partitions, so if one partition receives significantly more data than the others, it becomes a long-tail node and slows the entire import.

When this applies: Run this check if data modeling diagnostics flag skew on the distribution field after import, or if you suspect low-cardinality columns are being used as the key.

Choose a distribution key with high cardinality. For example, in the Lineitem table, l_discount has only 11 distinct values — all rows sharing the same discount value go to the same partition, causing severe data skew. l_orderkey, where each order ID is unique, distributes data evenly across partitions.

Before import: Verify that the distribution key has enough distinct values to spread data evenly. For more information, see Select a distribution key.
After import: Go to Storage diagnostics to confirm data is distributed evenly. If the distribution field shows skew, the key needs to be changed.

Check the partition key

INSERT OVERWRITE SELECT replaces existing subpartitions with the ones being imported. Within each hash partition, data loads into its corresponding subpartition. Importing many subpartitions at once triggers an external sort process, which degrades performance.

When this applies: Run this check if your data spans a wide time range and you are partitioning by a date or timestamp column.

Keep the number of subpartitions small. For example, if the Lineitem table has seven years of data in l_shipdate:

Partitioning by day creates more than 2,000 subpartitions (~30 million records each) — too many.
Partitioning by month or year (7 subpartitions) avoids triggering the external sort.
Before import: Choose a partition granularity that keeps the total subpartition count manageable. For more information, see Select a partition key.
After import: Go to Partitioned table diagnostics to verify subpartition counts are reasonable.

Check the index

AnalyticDB for MySQL indexes all columns by default. On wide tables, building indexes for every column consumes substantial resources and slows import significantly.

When this applies: Run this check if you are importing into a wide table, especially one with long string columns.

For wide tables, consider using only a primary key index. The primary key index is used for deduplication — using too many primary key columns degrades deduplication performance.

To evaluate your current index overhead:

Go to Monitoring Information > Table Information Statistics.
Compare the index data size to the table data size.
If index data exceeds table data, look for long string columns. Building indexes on these columns is expensive. Delete unnecessary indexes using ALTER TABLE.

Note

You cannot delete a primary key index. To change it, recreate the table.

In offline import scenarios, data is typically deduplicated upstream by an offline computing job. In this case, a primary key index is not needed.

Accelerate import with hints

`direct_batch_load` hint

Add the direct_batch_load=true hint before the import statement to accelerate the import:

SUBMIT JOB /*+ direct_batch_load=true*/INSERT OVERWRITE adb_table
SELECT * FROM adb_external_table;

Note

This hint is supported only on Data Warehouse Edition clusters in elastic mode running version 3.1.5. If performance does not improve significantly after adding the hint, Submit a ticket.

Elastic import

Elastic import supports running multiple import tasks concurrently and lets you accelerate a single task by allocating more resources to it.

/*+ elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group]*/
submit job insert overwrite adb_table select * from adb_external_table;

For parameter details, see Hint parameters and Data import methods.

Note

Kernel version 3.1.10.0 or later
An Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster with a Job resource group created
Source data in MaxCompute, or OSS data in CSV, Parquet, or ORC format
Make sure the Job resource group has enough resources before starting. Insufficient resources can cause long wait times, longer task durations, or task failures.

Optimize data import from DataWorks

Tune task configuration

Number of records per batch insert

This parameter controls how many records are sent per batch. The default value is 2048 and works well for most cases.

If individual records are large — several hundred KB up to 512 KB each — reduce this value to 16. This keeps each batch under 8 MB and prevents high memory usage on frontend nodes.

Channel Control

Maximum Concurrency: Import throughput scales linearly with this value. Set it as high as your DataWorks resource quota allows.
Important
Higher concurrency consumes more DataWorks resources. Increase this value incrementally and monitor resource usage.
Distributed Execution:

Troubleshooting

Low throughput despite no errors

If CPU utilization, disk I/O utilization, and write response time are all low, the cluster is idle — the client is not sending data fast enough. Increase Number of records per batch insert and Maximum Concurrency. Throughput scales linearly with import load.

Low CPU and I/O but high write response time

This pattern points to data skew: a subset of nodes is overloaded while others sit idle. Identify the affected table in the skew diagnostics on the Diagnostics And Optimization > Data Modeling Diagnostics page, then redesign the table schema and re-import. For schema guidance, see Table schema design.

Optimize data import from a JDBC program

Use batch import

Send data in batches rather than one record at a time to reduce network round-trips and connection overhead.

Target a batch size of 2,048 records.
If individual records are large (several hundred KB), calculate the batch size as: 8 MB ÷ size of one record. Batches larger than 8 MB cause high memory usage on frontend nodes and degrade performance.

Use multiple concurrent threads

A single thread cannot saturate the cluster's write capacity. The client also needs time to fetch and batch data, which is typically slower than the database's import speed. Use multiple threads to keep data flowing continuously.

There is no universal optimal concurrency value — it depends on batch size, data source throughput, and client machine load. To find the right level:

Start with a baseline concurrency.
If throughput is below target, double the concurrency.
If speed drops, reduce concurrency gradually until you find the peak.

Troubleshooting

Before assuming the cluster is the bottleneck, check the client:

Data source output speed: If data comes from another system or file, check for output bottlenecks there.
Data processing speed: Verify that data production and consumption are in sync — enough data must be ready to import at all times.
Client machine load: Check CPU utilization and disk I/O on the machine running the JDBC program.

Identify the bottleneck first

Optimize data import from external tables

Check the distribution key

Check the partition key

Check the index

Accelerate import with hints

direct_batch_load hint

Elastic import

Optimize data import from DataWorks

Tune task configuration

Troubleshooting

Optimize data import from a JDBC program

Use batch import

Use multiple concurrent threads

Troubleshooting

`direct_batch_load` hint