All Products
Search
Document Center

E-MapReduce:Insert Into

Last Updated:Aug 03, 2023

The Insert Into import method is based on INSERT INTO statements. You can execute an INSERT INTO statement in Doris in a way similar to that in databases such as MySQL. However, data is inserted into Doris by using independent import jobs. Therefore, INSERT INTO is also considered as an import method in Doris. This topic describes how to use INSERT INTO statements to import data and the best practices.

Background information

Note

Some information in this topic is from Apache Doris. For more information, see Introduction to Apache Doris.

Doris supports the following syntax for INSERT INTO statements:

  • NSERT INTO tbl SELECT ...

  • INSERT INTO tbl (col1, col2, ...) VALUES (1, 2, ...), (1,3, ...);

    Important

    This syntax can be used only in demos. Do not use the syntax in the test or production environment.

Import operations and returned results

You must submit an INSERT INTO statement by using the MySQL protocol. The result is synchronously returned when you execute an INSERT INTO statement.

Import operations

Sample INSERT INTO statements:

INSERT INTO tbl2 WITH LABEL label1 SELECT * FROM tbl3;
INSERT INTO tbl1 VALUES ("qweasdzxcqweasdzxc"), ("a");
Important

If you want to use common table expressions (CTEs) to define subqueries in an INSERT INTO statement, you must specify WITH LABEL and the column list or wrap the CTEs. The following sample statements are for your reference.

INSERT INTO tbl1 WITH LABEL label1
WITH cte1 AS (SELECT * FROM tbl1), cte2 AS (SELECT * FROM tbl2)
SELECT k1 FROM cte1 JOIN cte2 WHERE cte1.k1 = 1;

INSERT INTO tbl1 (k1)
WITH cte1 AS (SELECT * FROM tbl1), cte2 AS (SELECT * FROM tbl2)
SELECT k1 FROM cte1 JOIN cte2 WHERE cte1.k1 = 1;

INSERT INTO tbl1 (k1)
select * from (
WITH cte1 AS (SELECT * FROM tbl1), cte2 AS (SELECT * FROM tbl2)
SELECT k1 FROM cte1 JOIN cte2 WHERE cte1.k1 = 1) as ret

For more information about the parameters in the preceding sample statements, see INSERT INTO. You can also run the HELP INSERT command to view the parameter description.

Returned results

An INSERT INTO statement is an SQL statement. The returned result varies in the following cases:

  • The result set is empty.

    If the result set of the SELECT clause in the INSERT INTO statement is empty, a result similar to the following output is returned:

    mysql> insert into tbl1 select * from empty_tbl;
    Query OK, 0 rows affected (0.02 sec)

    Query OK indicates that the statement is successfully executed. 0 rows affected indicates that no data is imported.

  • The result set is not empty.

    If the result set is not empty, the returned result varies based on whether the INSERT statement is successfully executed.

    • The INSERT INTO statement is successfully executed.

      mysql> insert into tbl1 select * from tbl2;
      Query OK, 4 rows affected (0.38 sec)
      {'label':'insert_8510c568-9eda-4173-9e36-6adc7d35****', 'status':'visible', 'txnId':'4005'}
      
      mysql> insert into tbl1 with label my_label1 select * from tbl2;
      Query OK, 4 rows affected (0.38 sec)
      {'label':'my_label1', 'status':'visible', 'txnId':'4005'}
      
      mysql> insert into tbl1 select * from tbl2;
      Query OK, 2 rows affected, 2 warnings (0.31 sec)
      {'label':'insert_f0747f0e-7a35-46e2-affa-13a235f4****', 'status':'visible', 'txnId':'4005'}
      
      mysql> insert into tbl1 select * from tbl2;
      Query OK, 2 rows affected, 2 warnings (0.31 sec)
      {'label':'insert_f0747f0e-7a35-46e2-affa-13a235f4****', 'status':'committed', 'txnId':'4005'}

      Query OK indicates that the statement is successfully executed. 4 rows affected indicates that four rows of data are imported. 2 warnings indicates that two rows of data are filtered out.

      A JSON string is also returned. Examples:

      {'label':'my_label1', 'status':'visible', 'txnId':'4005'}
      {'label':'insert_f0747f0e-7a35-46e2-affa-13a235f4****', 'status':'committed', 'txnId':'4005'}
      {'label':'my_label1', 'status':'visible', 'txnId':'4005', 'err':'some other error'}
      • label: the label that you specified or the label that is automatically generated by the system. A label is used to identify an INSERT INTO job. Each import job has a unique label within a single database.

      • status: indicates whether the imported data is visible. If a value of visible is returned, the imported data is visible. If a value of committed is returned, the imported data is invisible.

      • txnId: the ID of the transaction for the INSERT INTO statement.

      • err: the unexpected error.

      You can execute the SHOW LOAD statement to view the data rows that are filtered out. The following sample statement provides an example. The URL in the returned result can be used to query the error data.

      show load where label="xxx";

      The committed state is a temporary state. Data will eventually be in the visible state. You can execute the SHOW TRANSACTION statement to check whether the imported data is visible. The following sample statement provides an example. If the value of the TransactionStatus parameter is visible in the returned result, the data is visible.

      show transaction where id=4005;
    • The INSERT INTO statement fails to be executed.

      If the INSERT INTO statement fails to be executed, no data is imported. Sample returned result:

      mysql> insert into tbl1 select * from tbl2 where k1 = "a";
      ERROR 1064 (HY000): all partitions have no load data. url: http://10.74.xx.xx:8042/api/_load_error_log?file=__shard_2/error_log_insert_stmt_ba8bb9e158e4879-ae8de8507c0b****_ba8bb9e158e4879_ae8de8507c0b****

      The ERROR 1064 (HY000): all partitions have no load data message in the returned result shows the cause of the failure. The URL can be used to query the error data.

To sum up, you can interpret the returned result of an INSERT INTO statement and perform subsequent operations based on the following logic:

  • If ERROR 1064 (HY000) is returned, the import fails.

  • If Query OK is returned, the statement is successfully executed.

    • If the number of rows affected is 0, the result set is empty and no data is imported.

    • If the number of rows affected is greater than 0, further check other parameters in the returned result:

      • If the value of the status parameter is committed, the data is invisible. In this case, you can execute the SHOW TRANSACTION statement to check the data status until the data becomes visible.

      • If the value of the status parameter is visible, the import is successful.

      • If the number of warnings is greater than 0, one or more rows of data are filtered out. In this case, you can execute the SHOW LOAD statement to obtain the URL to view the data rows that are filtered out.

SHOW LAST INSERT

The preceding section describes how to perform subsequent operations based on the returned result of an INSERT INTO statement. However, it is difficult to obtain the JSON string of the returned result by using the MySQL class libraries in specific languages. Therefore, Doris allows you to run the SHOW LAST INSERT command to explicitly obtain the returned result of the most recent INSERT INTO statement. After you execute an INSERT INTO statement, you can run the SHOW LAST INSERT command in the same session to obtain the returned result of the most recent INSERT INTO statement. Example:

mysql> show last insert\G
*************************** 1. row ***************************
    TransactionId: 640**
            Label: insert_ba8f33aea9544866-8ed77e2844d0****
         Database: default_cluster:db1
            Table: t1
TransactionStatus: VISIBLE
       LoadedRows: 2
     FilteredRows: 0

This command returns the result of the most recent INSERT INTO statement and the details of the transaction. You can run the SHOW LAST INSERT command each time after you execute an INSERT INTO statement to obtain the returned result of the INSERT INTO statement.

Important

You can run this command to obtain only the returned result of the most recent INSERT INTO statement during the connection of the same session. If the connection is closed or replaced, an empty result set is returned.

Related system parameters

FE configurations

timeout: the timeout period of the import job. Unit: seconds. If an import job is not complete within the specified timeout period, the system cancels the job. In this case, the job enters the CANCELLED state. You cannot specify a custom timeout period separately for an INSERT INTO job. The same timeout period applies to all INSERT INTO jobs. The default timeout period is 1 hour. If your data cannot be imported within the default timeout period, you can modify the insert_load_default_timeout_second parameter of a frontend (FE). In addition, an INSERT INTO statement is also affected by the timeout period specified by the query_timeout session variable. You can increase the timeout period by running the SET query_timeout = xxx; command. The timeout period is measured in seconds.

Session variables

  • enable_insert_strict: An INSERT INTO statement does not allow you to specify a tolerable failure rate. You can set the enable_insert_strict session variable to control the tolerable failure rate for INSERT INTO statements. A value of false specifies that a success result is returned if at least one data row is imported. If some data rows fail to be imported, a label is also returned. A value of true specifies that the import fails if a data row fails to be imported. The default value is false. You can run the SET enable_insert_strict = true; command to set the variable to true.

  • query_timeout: An INSERT INTO statement is an SQL statement. Therefore, an INSERT INTO statement is affected by the timeout period specified by the query_timeout session variable. You can increase the timeout period by running the SET query_timeout = xxx; command. The timeout period is measured in seconds.

Best practices

Scenarios

  • Import a few data rows to test the features of Doris. In this scenario, you can use the INSERT INTO VALUES syntax in the same way as in MySQL databases.

  • Perform extract, transform, and load (ETL) processing on data in a Doris table and import the processed data into another Doris table. In this case, the INSERT INTO SELECT syntax is more suitable.

  • Create a foreign table and map the table to a data source. For example, you create a MySQL foreign table that maps a MySQL table, or create a broker foreign table that maps a data file in Hadoop Distributed File System (HDFS). Then, import data in the data source into a Doris table by executing an INSERT INTO SELECT statement.

Data size

Doris does not limit the size of data that can be imported by an INSERT INTO statement. You can execute an INSERT INTO statement to import a large amount of data. However, a default timeout period is specified for an INSERT INTO job. If the size of data to be imported is large and you estimate that the data cannot be imported within the default timeout period, you must increase the timeout period for the import job.

For example, if the size of the data to be imported is 36 GB, you can calculate the time required to import the data based on the following formula: 36 GB/10 MB/s = 3600s. In this formula, 10 MB/s is the maximum import speed. You must calculate the average import speed based on your cluster status and replace 10 MB/s in the formula.

Complete example

The sales database has a table named store_sales. A table named bj_store_sales is created. You want to import data for the bj region from the store_sales table to the bj_store_sales table. The size of the data to be imported is approximately 10 GB, and the average import speed of the current cluster is approximately 5 MB/s.

store_sales schema:
(id, total, user_id, sale_timestamp, region)

bj_store_sales schema:
(id, total, user_id, sale_timestamp)
  1. Determine whether to modify the default timeout period for an INSERT INTO job.

    Calculate the approximate time required to import the data
    10 GB / 5 M/s = 2000s
    
    Modify the timeout period on the FE
    insert_load_default_timeout_second = 2000
  2. Create an import job.

    You want to perform ETL processing on data in the source table and import the processed data into the destination table. An INSERT INTO statement with a subquery is suitable for this scenario.

    INSERT INTO bj_store_sales WITH LABEL `label` SELECT id, total, user_id, sale_timestamp FROM store_sales where region = "bj";