Import data by using INSERT INTO statements - ApsaraDB for SelectDB

ApsaraDB for SelectDB is compatible with the standard SQL syntax. You can use standard INSERT INTO statements to import data.

Background information

INSERT INTO statements are commonly used to import data into databases such as MySQL databases. ApsaraDB for SelectDB is compatible with the standard SQL syntax and allows you to execute INSERT INTO statements to import data. The INSERT INTO statements are divided into the following two types:

INSERT INTO tbl SELECT ...
INSERT INTO tbl (col1, col2, ...) VALUES (1, 2, ...), (1,3, ...);
Important
The write performance of the INSERT INTO VALUES statement is low. We recommend that you do not use this statement in a production environment. We recommend that you use Stream Load to write small amounts of data to an ApsaraDB for SelectDB instance at a time. This helps achieve a hundredfold improvement in the write performance.

Usage notes

In specific scenarios, if you frequently write a small amount of data to an ApsaraDB for SelectDB instance, the performance of the ApsaraDB for SelectDB instance significantly decreases and deadlocks may occur on tables. We recommend that you write small amounts of data to the ApsaraDB for SelectDB instance at the same time and set the frequency of data writes to a table to a value greater than 10 seconds.

INSERT INTO SELECT statement

The INSERT INTO SELECT statement can efficiently calculate and process data in SelectDB and data in external data lakes by using a large number of SQL functions and federated queries that are provided by SelectDB. Then, the processed data is imported to a new table in SelectDB for further analysis.

Perform ETL operations on data in an internal table

If the data is already stored in a table in SelectDB, you can perform extract, transform, and load (ETL) operations on the data by executing the INSERT INTO SELECT statement and then import the data to a new table in ApsaraDB for SelectDB. For example, you can execute the following statements to import the data whose region is set to bj from the store_sales table in ApsaraDB for SelectDB to a new table named bj_store_sales:

INSERT INTO bj_store_sales
SELECT id, total, user_id, sale_timestamp FROM store_sales WHERE region = "bj";

Synchronize data from a data lake

If data is stored in an external system such as a data lake, you can create a catalog in SelectDB, map the catalog to the data in the external system, and then execute the INSERT INTO SELECT statement to import the data to a table in SelectDB.

You can use a catalog to integrate data sources, such as Hive, Iceberg, Hudi, Elasticsearch, and Java Database Connectivity (JDBC), with ApsaraDB for SelectDB. The following information provides an example on how to use a catalog to integrate a Hive data source with an ApsaraDB for SelectDB instance and synchronize data from a data lake to a table in the SelectDB instance. For more information about other data sources, see Data lakehouse.

Connect to an SelectDB instance. For more information, see Connect to an ApsaraDB for SelectDB instance by using a MySQL client.
Create a catalog and use the catalog to integrate a Hive data source with the ApsaraDB for SelectDB instance. For more information, see Hive data source.
Optional. Create a database named hive_db.
If you have created the destination database, skip this step.
```
create database hive_db;
```
Switch to the destination database.
```
use hive_db;
```
Create a table.
If you have created the destination table, check whether the data types of the columns in the destination table are mapped to those in the source table of the Hive data source.Column data type mappings
If you have not created the destination table, create the destination table and make sure that the data types of the columns in the destination table are mapped to those in the source table of the Hive data source. For more information, see Column data type mappings.
```
CREATE TABLE test_Hive2SelectDB
(
    id int,
    name varchar(50),
    age int
)
DISTRIBUTED BY HASH(id) BUCKETS 4
PROPERTIES("replication_num" = "1");
```
Optional. View table data.
```
select * from test_Hive2SelectDB;
```
Migrate data.
Execute the INSERT INTO SELECT statement to synchronize data from the Hive data source to the table in ApsaraDB for SelectDB and specify a unique label for the data import job.
```
INSERT INTO test_Hive2SelectDB WITH LABEL test_label SELECT *  FROM hive_catalog.testdb.hive_t;
```
Query data.
The destination table data is displayed on the left and the source data table is displayed on the right.

INSERT INTO VALUES statement

The INSERT INTO VALUES statement is commonly used to write data to databases such as MySQL databases. We recommend that you use the INSERT INTO VALUES statement only in a test environment. Typically, you can send a request for writing data by using a SQL client or a JDBC program.

The following sample code provides an example on how to create a table to which data is imported in ApsaraDB for SelectDB.

CREATE TABLE test_table
(
    id int,
    name varchar(50),
    age int
)
DISTRIBUTED BY HASH(id) BUCKETS 4
PROPERTIES("replication_num" = "1");

Import data by using a SQL client

BEGIN;
INSERT INTO test_table VALUES (1, 'Zhang San', 32),(2, 'Li Si', 45),(3, 'Zhao Liu', 23);
INSERT INTO test_table VALUES (4, 'Wang Yi', 32),(5, 'Zhao Er', 45),(6, 'Li Er', 23);
INSERT INTO test_table VALUES (7, 'Li Yi', 32),(8, 'Wang San', 45),(9, 'Zhao Si', 23);
COMMIT;

Import data by using a JDBC program

public static void main(String[] args) throws Exception {
    // The number of INSERT statements that are used to import data in a batch. 
    int insertNum = 10;
    // The number of data entries to be imported in a batch. 
    int batchSize = 10000;
    String URL="jdbc:mysql://<Host IP address >:< MySQL protocol port>/test_db?useLocalSessionState=true"; // You can log on to the ApsaraDB for SelectDB console to view the virtual private cloud (VPC) endpoint or the public endpoint in the Network Information section of the Instance Details page. 
    Connection connection = DriverManager.getConnection(URL, "admin", "password");  // The username and password of the account that is used to log on to the ApsaraDB for SelectDB instance. 
    Statement statement = connection.createStatement();
    statement.execute("begin;");
    // Append multiple INSERT statements. 
    for (int num = 0; num < insertNum; num++) {
        StringBuilder sql = new StringBuilder();
        sql.append("INSERT INTO test_table values ");
        for(int i = 0; i < batchSize; i++){
            if(i > 0){
                sql.append(",");
            }
            // Append a row of data that contains fields such as name and age. You can modify the fields that a row of data contains based on your business requirements. 
            sql.append("(1, 'Zhang San', 32)");
        }
        //add sql to batch: INSERT INTO tbl values(),(),()
        statement.addBatch(sql.toString());
    }
    statement.addBatch("commit;");
    statement.executeBatch();
    // Close the resources. 
    statement.close();
    connection.close();
}

Best practices

Returned results
An INSERT INTO operation is a synchronous operation. The insert operation is complete if results are returned. You need to perform operations based on the returned results.
1. The INSERT INTO operation is successful and the result set is empty.
  If the result set of the SELECT clause in the INSERT INTO statement is empty, a result similar to the following output is returned:
```
INSERT INTO tbl1 SELECT * FROM empty_tbl;
Query OK, 0 rows affected (0.02 sec)
```
  Query OK indicates that the operation is successful. 0 rows affected indicates that no data is imported.
2. The INSERT INTO operation is successful and the result set is not empty.
  In the case where the result set is not empty, a result similar to one of the following outputs is returned:
```
INSERT INTO tbl1 SELECT * FROM tbl2;
Query OK, 4 rows affected (0.38 sec)
{'label':'insert_8510c568-9eda-****-9e36-6adc7d35291c', 'status':'visible', 'txnId':'4005'}
 
INSERT INTO tbl1 with label my_label1 SELECT * FROM tbl2;
Query OK, 4 rows affected (0.38 sec)
{'label':'my_label1', 'status':'visible', 'txnId':'4005'}
 
INSERT INTO tbl1 SELECT * FROM tbl2;
Query OK, 2 rows affected, 2 warnings (0.31 sec)
{'label':'insert_f0747f0e-7a35-****-affa-13a235f4020d', 'status':'visible', 'txnId':'4005'}
 
INSERT INTO tbl1 SELECT * FROM tbl2;
Query OK, 2 rows affected, 2 warnings (0.31 sec)
{'label':'insert_f0747f0e-7a35-****-affa-13a235f4020d', 'status':'committed', 'txnId':'4005'}
```
  Query OK indicates that the operation is successful. 4 rows affected indicates that four rows of data are imported. 2 warnings indicates that two rows of data are filtered out. A JSON string is also returned.
```
{'label':'my_label1', 'status':'visible', 'txnId':'4005'}
{'label':'insert_f0747f0e-7a35-****-affa-13a235f4020d', 'status':'committed', 'txnId':'4005'}
{'label':'my_label1', 'status':'visible', 'txnId':'4005', 'err':'some other error'}
```
  label indicates the label that you specified or that is automatically generated. The label is used to identify the job for importing data by using the INSERT INTO statement. Each data import job has a unique label within a single database. status indicates whether the imported data is visible. If the imported data is visible, visible is returned for the status parameter. If the imported data is invisible, committed is returned for the status parameter. txnId indicates the ID of the import transaction that corresponds to the insert operation. err indicates some unexpected errors.
  To query the rows that are filtered out, execute the following statement:
```
SHOW LOAD WHERE label="xxx";
```
  The URL in the returned results can be used to query the error data. For more information, see the error row summary in this topic. Data may be temporarily invisible. However, this batch of data is eventually visible. You can execute the following statement to check whether this batch of data is visible:
```
SHOW TRANSACTION WHERE id=4005;
```
  If visible is displayed in the TransactionStatus column in the returned results, the data is visible.
3. The INSERT INTO operation fails.
  If the INSERT INTO operation fails, no data is imported and a result similar to the following output is returned:
```
INSERT INTO tbl1 SELECT * FROM tbl2 WHERE k1 = "a";
ERROR 1064 (HY000): all partitions have no load data. url: http://10.74.167.16:8042/api/_load_error_log?file=__shard_2/error_log_insert_stmt_ba8bb9e158e4879-ae8de8507c0bf8a2_ba8bb9e158e4879_ae8de8507c0bf8a2
```
  ERROR 1064 (HY000): all partitions have no load data indicates the cause of the failure. The URL in the returned result can be used to query the error data:
```
SHOW LOAD WARNINGS ON "url";
```
Timeout period
The timeout period of the INSERT INTO operation is specified by the query_timeout session variable. The default timeout period of the INSERT INTO operation is 5 minutes. If an import job is not completed within the specified timeout period, the system cancels the job.
Label and atomicity
The INSERT INTO operation also ensures the atomicity of import jobs. If you want to use common table expressions (CTEs) to define subqueries in an INSERT INTO statement, you must specify WITH LABEL and column.
Filtering threshold
If you import data by using an INSERT INTO statement, you cannot specify a filtering threshold by specifying the max_filter_ratio parameter. The default filtering threshold is 1, which indicates that all error rows can be ignored.
In the business scenarios in which data cannot be filtered out, you can set the enable_insert_strict session variable to true. This ensures that the INSERT INTO statement is not successfully executed when data is filtered out.
Performance issues
We recommend that you do not use the INSERT INTO VALUES statement to import data, especially in an online production environment of big data. If you must use the INSERT INTO VALUES statement to import data, merge multiple rows of data into one INSERT INTO statement to import data in batches. We recommend that you import 1,000 to 1,000,000 data rows in a single batch.
Update of specific columns
By default, an INSERT INTO statement writes data by rows. In the merge-on-write (MOW) implementation method of the Unique data model, you can configure the following session variables to update specific columns based on your business requirements.
```
set enable_unique_key_partial_update=true
```
For more information about variables, see Variable management.
Important
- This variable is available only when the table uses the Unique model that uses the Merge on Write (MOW) mode.
- If you set both the enable_unique_key_partial_update and enable_insert_strict variables to true, you can execute the INSERT INTO statement only to update data. The default value true of the enable_insert_strict variable indicates that the strict mode is enabled for the INSERT statement. If the INSERT INTO statement contains keys that do not exist in the table, an error is reported.
- After you set this variable to true, if you want to execute the INSERT INTO statement to update specific columns and insert data to the columns, you must set the enable_unique_key_partial_update session variable to true and the enable_insert_strict session variable to false. For more information about variables settings, see Configure variables.

FAQ

What do I do if the get table cloud commit lock timeout error message is displayed during data import?

You frequently write data to an ApsaraDB for SelectDB instance. As a result, deadlocks occur. We recommend that you write small amounts of data to the ApsaraDB for SelectDB instance at the same time. We also recommend that the frequency of data writing on a table be greater than 5 seconds.