Import performance in AnalyticDB for MySQL depends on three factors: table schema design, import job configuration, and client-side setup. Poor choices in any of these areas—mismatched distribution keys, over-partitioned tables, or insufficient client concurrency—can create bottlenecks that cap throughput regardless of cluster size. This topic explains how to diagnose and resolve bottlenecks for each import path.
Optimize data import from external tables
This section uses an example—importing 1 TB of TPC-H lineitem data (approximately 6 billion records) from an external table—to demonstrate how to tune import jobs for optimal performance. For more information about data import, see INSERT OVERWRITE SELECT.
The following example uses 1 TB of TPC-H lineitem data (approximately 6 billion records) to show how to tune import jobs for optimal performance. For background on the INSERT OVERWRITE SELECT statement, see INSERT OVERWRITE SELECT.
Optimize in order: fix schema design first, then apply import acceleration.
Step 1: Choose a distribution key that avoids data skew
The distribution key determines how data is split across shards. AnalyticDB for MySQL imports data concurrently across shards, so an uneven distribution makes the most-loaded shard a bottleneck for the entire job.
Before importing: Evaluate the cardinality of your candidate column. A column with few distinct values causes severe skew. For example, the l_discount column has only 11 distinct values — all rows with the same discount land on the same shard. The l_orderkey column is a better choice because order IDs are unique, distributing rows evenly.
After importing: Check Diagnostics and Optimization > Data Modeling Diagnostics for skew warnings. Skew indicates that the chosen distribution key needs to be revised. For guidance on choosing a distribution key, see Select a distribution key.
Step 2: Choose a partition key with the right granularity
INSERT OVERWRITE SELECT writes new data partition by partition, overwriting existing data with the same partition name. Importing too many partitions at once causes external sorting overhead that degrades performance.
Before importing: Match partition granularity to your data volume and query patterns. For example, if the Lineitem table spans seven years of l_shipdate data:
Partitioning by year creates 7 partitions.
Partitioning by day creates over 2,000 partitions, with roughly 30 million records each.
Partitioning by month or year is more appropriate in this case — fewer partitions per import means lower sorting overhead.
After importing: Check Diagnostics and Optimization > Data Modeling Diagnostics for partitioning issues. These indicate that the partition granularity needs adjustment. For guidance, see Select a partition key.
Step 3: Trim index configuration for wide tables
AnalyticDB for MySQL creates an index on every column by default. For wide tables, building a full-column index consumes significant compute and storage during import.
In batch import scenarios, skip the primary key index. Data is typically deduplicated during batch computing, so the index is unnecessary.
On the Monitoring Information > Table Information Statistics tab, compare the index data size to the table data size. If index data is larger, check whether the table has columns with long string values. Indexing such columns is slow and storage-intensive — drop the index using ALTER TABLE.
Keep the primary key lean. Using too many columns in the primary key degrades deduplication performance. For guidance, see Select a primary key.
A primary key index cannot be dropped. To change it, recreate the table.
Step 4: Apply import acceleration hints
After optimizing schema, apply hints to accelerate the import job.
Option 1 — direct_batch_load
Add the direct_batch_load=true hint to an import statement to accelerate the import:
SUBMIT JOB /*+ direct_batch_load=true*/INSERT OVERWRITE adb_table
SELECT * FROM adb_external_table;This hint is supported only on elastic clusters running Data Warehouse Edition (V3.0) with kernel version 3.1.5 or later. If performance does not improve noticeably after applying the hint, submit a ticket.
Option 2 — Elastic import
Elastic import lets you run multiple import jobs concurrently and allocate more resources to a single job. Add the elastic_load hint with a target resource group:
/*+ elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group]*/
submit job insert overwrite adb_table select * from adb_external_table;Elastic import requires:
Kernel version 3.1.10.0 or later
An Enterprise Edition, Basic Edition, or Data Lakehouse Edition (V3.0) cluster with a job resource group configured
Source data from MaxCompute or OSS in CSV, Parquet, or ORC format
Make sure the job resource group has sufficient resources before starting the job. Insufficient resources cause long wait times, slow execution, or job failures.
For a full list of hint parameters, see Hint parameters. For an overview of all import methods, see Data import methods.
Optimize data import from DataWorks
Tune job configuration
Data Records Per Write
This parameter controls the batch size for each write operation. Keep the default value unless records are large.
If a single record exceeds 512 KB, set Data Records Per Write to 16. This caps each batch at roughly 8 MB and prevents high memory pressure on frontend nodes.
Channel control
Increase Expected Maximum Concurrency as much as your workload allows — data synchronization performance scales directly with this value.
Enable Distributed Execution for better throughput.
Higher Expected Maximum Concurrency values consume more DataWorks resources. Set a value appropriate for your workload.
Troubleshoot DataWorks import issues
Symptom: Write TPS is low, and CPU utilization and disk I/O are also low
The client is not producing data fast enough. The database can handle more, but there is not enough data coming in.
Fix: Increase both Data Records Per Write and Expected Maximum Concurrency. Write TPS improves as client-side throughput increases.
Symptom: CPU utilization and disk I/O are low, but write response time is high
The target table has data skew. Some nodes are overloaded while others are idle, causing high latency despite low overall resource usage. Confirm by checking Diagnostics and Optimization > Data Modeling Diagnostics for skew indicators.
Fix: Redesign the table schema with a better distribution key, then re-import the data. See Design a table schema.
Optimize JDBC data import
Configure the client for bulk inserts
Batch size
Batch records on the client before sending them to AnalyticDB for MySQL. Single-row inserts add significant network and connection overhead.
Default target: 2,048 records per batch.
If individual records are large (hundreds of kilobytes), calculate the batch size as:
8 MB ÷ size of a single record. This keeps each batch under the 8 MB limit and avoids memory pressure on frontend nodes.
Concurrency
Use multiple concurrent threads. A single thread rarely saturates the database ingestion capacity — client-side data processing and batching introduce delays that one thread cannot overcome.
The optimal thread count depends on batch size, data source characteristics, and client machine load. Find it through testing: if import speed is unsatisfactory, double the thread count. If speed drops, reduce concurrency incrementally until you find the optimal value.
Troubleshoot JDBC import issues
Before investigating the database side, rule out client-side bottlenecks:
Check data source throughput. If the source system or files have output limits, they cap the overall import rate regardless of database tuning.
Check data processing pace. Confirm that data production and consumption are synchronized and that enough data is ready to send at any given moment.
Check client machine load. If CPU utilization or disk I/O on the client machine is saturated, the client itself is the bottleneck — adding more threads will not help until resources free up.