Import a large volume of data to DRDS from a text file using programmatic batch insertion - PolarDB

When you need to migrate millions of rows from a MySQL database into a Distributed Relational Database Service (DRDS) database, loading data directly through a JDBC program gives you full control over throughput and batch size. This guide walks you through exporting the source data to a CSV file, creating the target table with a shard key, and running a Java batch-insert program to load the data.

Reference benchmark: 8,143,801 rows inserted in ~916 seconds (~9,000 TPS) on a client with an i5 CPU, 8 GB RAM, and SSD storage, against a 4-core 4 GB DRDS instance.

This approach is suited for one-time offline imports. For ongoing incremental replication, use Data Transmission Service (DTS).

Prerequisites

Before you begin, ensure that you have:

Access to the source MySQL database with SELECT ... INTO OUTFILE privilege
A running DRDS instance with credentials (host, port, database name, username, and password)
Java and the MySQL Connector/J JDBC driver configured in your project

How it works

Export the source table to a CSV file using SELECT INTO OUTFILE.
Create the target table in DRDS with a shard key that matches your access patterns.
Run a Java program that reads the CSV and inserts rows in batches using PreparedStatement.executeBatch().

Export source data

Export the source table to a CSV file on the MySQL server's local filesystem. Run the following on the MySQL client:

SELECT postingType, id, acceptedAnswer, parentId, score, tags
INTO OUTFILE '/tmp/stackoverflow.csv'
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM post;

The output file does not include the table schema. Create the target table in DRDS separately in the next step.

Create the target table in DRDS

Create the table in DRDS and choose a shard key. The shard key determines how DRDS distributes rows across shards. Choose a column with high cardinality and even distribution — id is a good default for this schema.

CREATE TABLE `post` (
  `postingType` int NOT NULL,
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `acceptedAnswer` bigint(20) DEFAULT NULL,
  `parentId` bigint(20) DEFAULT NULL,
  `score` int DEFAULT NULL,
  `tags` varchar(128) DEFAULT NULL,
  PRIMARY KEY (`id`)
) DBPARTITION BY hash(id) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DBPARTITION BY hash(id) partitions the data across database shards by hashing the id column.

Import data with a Java program

Understand JDBC batch insert behavior

By default, MySQL Connector/J sends each addBatch() call as a separate INSERT statement, even when you call executeBatch(). Setting rewriteBatchedStatements=true rewrites those into a single multi-row INSERT, which significantly reduces network round trips.

Without rewriteBatchedStatements=true, the driver sends one statement per row:

INSERT INTO post (postingType, id, ...) VALUES (1, 101, ...);
INSERT INTO post (postingType, id, ...) VALUES (2, 102, ...);
INSERT INTO post (postingType, id, ...) VALUES (3, 103, ...);

With rewriteBatchedStatements=true, the driver combines them:

INSERT INTO post (postingType, id, ...) VALUES (1, 101, ...), (2, 102, ...), (3, 103, ...);

The following table describes the JDBC connection parameters used in this example:

Parameter	Description	Value
`autoReconnect`	Reconnects automatically after a dropped connection	`true`
`socketTimeout`	Prevents the connection from hanging indefinitely (milliseconds)	`600000` (10 minutes)
`rewriteBatchedStatements`	Rewrites `addBatch()` calls into a single multi-row `INSERT`	`true`

Run the import program

Add stackoverflow.csv to your project's classpath resources directory, then run the following Java program:

import java.io.*;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.sql.*;

public class Main {

    public static void main(String[] args)
            throws IOException, URISyntaxException, ClassNotFoundException, SQLException {

        // Load the CSV file from the classpath
        URL url = Main.class.getClassLoader().getResource("stackoverflow.csv");
        File dataFile = new File(url.toURI());

        String sql = "INSERT INTO post (postingType, id, acceptedAnswer, parentId, score, tags) "
                   + "VALUES (?, ?, ?, ?, ?, ?)";
        int batchSize = 10000;

        try (
            Connection connection = getConnection(
                "<drds-host>",       // Replace with your DRDS endpoint
                3306,
                "<database-name>",   // Replace with your database name
                "<username>",        // Replace with your username
                "<password>"         // Replace with your password
            );
            BufferedReader br = new BufferedReader(new FileReader(dataFile))
        ) {
            PreparedStatement st = connection.prepareStatement(sql);
            long startTime = System.currentTimeMillis();
            int batchCount = 0;
            String line;

            while ((line = br.readLine()) != null) {
                String[] data = line.split(",");

                st.setInt(1, Integer.valueOf(data[0]));
                st.setInt(2, Integer.valueOf(data[1]));
                st.setObject(3, "".equals(data[2]) ? null : Integer.valueOf(data[2]));
                st.setObject(4, "".equals(data[3]) ? null : Integer.valueOf(data[3]));
                st.setObject(5, "".equals(data[4]) ? null : Integer.valueOf(data[4]));
                if (data.length >= 6) {
                    st.setObject(6, data[5]);
                }

                st.addBatch();

                // Flush every batchSize rows to keep memory usage bounded
                if (++batchCount % batchSize == 0) {
                    st.executeBatch();
                    System.out.printf("Inserted %d rows%n", batchCount);
                }
            }

            // Flush remaining rows
            if (batchCount % batchSize != 0) {
                st.executeBatch();
            }

            long elapsed = System.currentTimeMillis() - startTime;
            System.out.printf(
                "Done: %d rows in %d seconds (~%d TPS)%n",
                batchCount, elapsed / 1000, batchCount / (elapsed / 1000)
            );
        }
    }

    /**
     * Creates a JDBC connection to the DRDS database.
     *
     * @param host     DRDS endpoint (for example, xxxxx.drds.aliyuncs.com)
     * @param port     Port number (default: 3306)
     * @param database Database name
     * @param username Username
     * @param password Password
     */
    private static Connection getConnection(
            String host, int port, String database, String username, String password)
            throws ClassNotFoundException, SQLException {

        Class.forName("com.mysql.jdbc.Driver");
        String url = String.format(
            "jdbc:mysql://%s:%d/%s?autoReconnect=true&socketTimeout=600000&rewriteBatchedStatements=true",
            host, port, database
        );
        return DriverManager.getConnection(url, username, password);
    }
}

Replace the following placeholders with your actual values:

Placeholder	Description	Example
`<drds-host>`	DRDS endpoint	`xxxxx.drds.aliyuncs.com`
`<database-name>`	Target database name	`mydb`
`<username>`	Database username	`root`
`<password>`	Database password	—

Tune batch size for your workload

The example uses a batch size of 10,000 rows. A larger batch size reduces the number of round trips but increases memory usage. Start with 5,000–10,000 rows per batch and adjust based on observed TPS and heap usage.

Verify the import

After the program finishes, confirm the row count on the DRDS database:

SELECT COUNT(*) FROM post;

Compare the result with the row count from the source table to confirm a complete import.

What's next

For ongoing incremental replication instead of a one-time bulk import, use Data Transmission Service (DTS).