All Products
Search
Document Center

Import big data through a program

Last Updated: Oct 09, 2020

This topic describes how to compile code to import a large volume of data to a Distributed Relational Database Service (DRDS) database offline.

Assume that a table in the current database needs to be imported to a DRDS database. The number of data records to be imported is about 8.14 million. See the following for the table schema:

  1. CREATE TABLE `post` (
  2. `postingType` int NOT NULL,
  3. `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  4. `acceptedAnswer` bigint(20) DEFAULT NULL,
  5. `parentId` bigint(20) DEFAULT NULL,
  6. `score` int DEFAULT NULL
  7. `tags` varchar(128) DEFAULT NULL,
  8. PRIMARY KEY (`id`)
  9. );

Export source data

When you migrate a large volume of data between databases, we recommend that you export the source data as a text file, and then import the file to the target database through a program or commands.

For the preceding post table, you can use SELECT INTO to export data from the MySQL database to the file named as stackoverflow.csv. Execute the following command on the MySQL client:

  1. SELECT postingType,id,acceptedAnswer,parentId,score,tags
  2. INTO OUTFILE '/tmp/stackoverflow.csv'
  3. FIELDS TERMINATED BY ','
  4. OPTIONALLY ENCLOSED BY '"'
  5. LINES TERMINATED BY '\n'
  6. FROM test_table;

Create a table in the DRDS database

The exported data file does not include the table schema, therefore you must create a table in the target DRDS database manually and set a shard key according to the actual situation.

For example, the following command partitions the data of the post table to database shards based on ID:

  1. CREATE TABLE `post` (
  2. `postingType` int NOT NULL,
  3. `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  4. `acceptedAnswer` bigint(20) DEFAULT NULL,
  5. `parentId` bigint(20) DEFAULT NULL,
  6. `score` int DEFAULT NULL,
  7. `tags` varchar(128) DEFAULT NULL,
  8. PRIMARY KEY (`id`)
  9. ) DBPARTITION BY hash(id) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Import data to the DRDS database

After you export the data file, you can read the file content through code, and then import the content to the DRDS database. To improve efficiency, we recommend that you insert data in batches.

The following is the demo of code compiled in Java.

Test case: 8,143,801 data records are inserted. It takes about 916 seconds. The TPS is about 9,000.

Configuration of the test client: i5, 8 GB, and SSD

Configuration of the test DRDS database: 4-core 4 GB

  1. public static void main(String[] args) throws IOException, URISyntaxException, ClassNotFoundException,
  2. SQLException {
  3. URL url = Main.class.getClassLoader().getResource("stackoverflow.csv");
  4. File dataFile = new File(url.toURI());
  5. String sql = "insert into post(postingType,id,acceptedAnswer,parentId,score,tags) values(?,?,?,?,?,?)";
  6. int batchSize = 10000;
  7. try (
  8. Connection connection = getConnection("XXXXX.drds.aliyuncs.com", 3306, "XXXXX",
  9. "XXXX",
  10. "XXXX");
  11. BufferedReader br = new BufferedReader(new FileReader(dataFile))) {
  12. String line;
  13. PreparedStatement st = connection.prepareStatement(sql);
  14. long startTime = System.currentTimeMillis();
  15. int batchCount = 0;
  16. while ((line = br.readLine()) != null) {
  17. String[] data = line.split(",");
  18. st.setInt(1, Integer.valueOf(data[0]));
  19. st.setInt(2, Integer.valueOf(data[1]));
  20. st.setObject(3, "".equals(data[2]) ? null : Integer.valueOf(data[2]));
  21. st.setObject(4, "".equals(data[3]) ? null : Integer.valueOf(data[3]));
  22. st.setObject(5, "".equals(data[4]) ? null : Integer.valueOf(data[4]));
  23. if (data.length >= 6) {
  24. st.setObject(6, data[5]);
  25. }
  26. st.addBatch();
  27. if (++batchCount % batchSize == 0) {
  28. st.executeBatch();
  29. System.out.println(String.format("insert %d record", batchCount));
  30. }
  31. }
  32. if (batchCount % batchSize != 0) {
  33. st.executeBatch();
  34. }
  35. long cost = System.currentTimeMillis() - startTime;
  36. System.out.println(String.format("Take %d second,insert %d record, tps %d", cost/1000,batchCount, batchCount/(cost/1000) ));
  37. }
  38. }
  39. /**
  40. * Obtain the database connection
  41. *
  42. * @param host Database address
  43. * @param port Port number
  44. * @param database Database name
  45. * @param username Username
  46. * @param password Password
  47. * @return
  48. * @throws ClassNotFoundException
  49. * @throws SQLException
  50. */
  51. private static Connection getConnection(String host, int port, String database, String username, String password)
  52. throws ClassNotFoundException, SQLException {
  53. Class.forName("com.mysql.jdbc.Driver");
  54. String url = String.format(
  55. "jdbc:mysql://%s:%d/%s?autoReconnect=true&socketTimeout=600000&rewriteBatchedStatements=true", host, port,
  56. database);
  57. Connection con = DriverManager.getConnection(url, username, password);
  58. return con;
  59. }