MaxCompute allows you to execute the DELETE or UPDATE statement to delete or update data of specific rows in transactional tables.

You can execute the statements that are described in this topic on the following platforms:

Prerequisites

You are granted the SELECT and UPDATE permissions to read data from and update data in a transactional table. For more information, see MaxCompute permissions.

Description

Similar to traditional SQL statements, the DELETE and UPDATE statements in MaxCompute can be used to delete or update the data of specific rows in tables.

Each time you execute the DELETE or UPDATE statement, a delta file is automatically generated to store information about the delete or update operation. This file is invisible to users. The following section describes how the delta files are generated.
  • DELETE: A delta file contains the txnid and rowid fields, both of which have a value of the BIGINT type. The rowid field indicates the deleted row in the base files of the transactional table. The txnid field indicates the delete operation that is performed on the row.

    For example, a base file of the t1 table is f1 and the content of the base file is a, b, c, a, b. When you execute the delete from t1 where c1='a'; statement, a delta file named f1.delta is generated. If the value of the txnid field is t0, the content of the f1.delta file is ((0, t0), (3, t0)). This indicates that the rows whose IDs are 0 and 3 are deleted from the t0 transaction. If you execute another DELETE statement on the t1 table again, another delta file named f2.delta is generated. The file name is generated based on the f1 base file. When you query the data in the t1 table, the system filters the deleted data based on the f1, f1.delta, and f2.delta files and returns the data that is not deleted.

  • UPDATE: The logic of an UPDATE statement is converted into the logic of executing the DELETE and INSERT INTO statements.
The DELETE and UPDATE statements have the following benefits:
  • Reduce the amount of data to be written

    Before the DELETE and UPDATE statements are provided, MaxCompute allows you to execute only the INSERT INTO or INSERT OVERWRITE statement to update or delete data in tables. For more information, see Insert or update data into a table or a static partition (INSERT INTO and INSERT OVERWRITE). If you want to execute an INSERT statement when you update a small amount of data in a table or a partition of the table, you must first execute a SELECT statement to read all data from the table and update the data. Then, you can execute the INSERT statement to insert all data into the table. This method is inefficient. However, if you use the DELETE or UPDATE statement in the preceding scenario, the system does not need to write all data in the table. This reduces the amount of data to be written.

    Note
    • If you use the pay-as-you-go billing method, you are not charged for the write operations when you execute the DELETE, UPDATE, or INSERT OVERWRITE statement. However, when you execute the DELETE or UPDATE statement, MaxCompute must filter data by partition and read the data that you want to delete or update. You are charged for the read operations based on the pay-as-you-go billing method of SQL jobs. Therefore, compared with the INSERT OVERWRITE statement, the DELETE or UPDATE statement does not help you reduce costs.
    • If you use the subscription billing method, fewer resources are consumed to write data when you execute the DELETE or UPDATE statement. Compared with the INSERT OVERWRITE statement, the DELETE or UPDATE statement allows you to run more jobs when the same amount of resources is used.
  • Read the table with the latest data

    Before the DELETE and UPDATE statements are provided, MaxCompute allows you to use history tables to update multiple data entries in a table. If you use a history table, you must add auxiliary columns such as start_date and end_date in the table. These columns indicate the lifecycle of a data entry. To query the latest data of a table, the system must identify the latest data from large amounts of data based on the timestamps. This is a time-consuming process. However, you can execute the DELETE or UPDATE statement to delete or update data. When you query the data in a table, the system reads the latest data of the table based on the base files and all delta files.

Important After you execute the DELETE and UPDATE statements multiple times on a transactional table, the transactional table occupies a larger storage space. In this case, the costs of the storage and subsequent queries on the table increase. In addition, the efficiency of subsequent queries is reduced. To resolve these issues, we recommend that you execute the ALTER TABLE COMPACT statement to merge the base files with all delta files on a regular basis. For more information, see ALTER TABLE COMPACT.

If multiple jobs are run in parallel on a table, conflicts may occur. For more information, see ACID semantics.

Scenarios

You can execute the DELETE or UPDATE statement to delete or update a small amount of data in tables or partitions of the tables at a low frequency. For example, delete or update less than 5% of the data in a table or a partition of the table on the next day after the data is generated.

The DELETE or UPDATE statement is not applicable if you want to delete or update data at a high frequency, or if you want to write data to tables in real time.

Limits

The DELETE statement, UPDATE statement, and transactional tables on which the DELETE or UPDATE statement is executed have the following limits:
  • You can execute the DELETE and UPDATE statements only on transactional tables. For more information about transactional tables, see Table operations.
  • When you create a clustered table or an external table, you cannot set the clustered table or external table to a transactional table.
  • You cannot convert between transactional tables and MaxCompute internal tables, external tables, or clustered tables.
  • Jobs from other systems, such as MaxCompute Spark, Machine Learning Platform for AI, and Graph, cannot access transactional tables.
  • CLONE TABLE and MERGE PARTITION operations are not supported.
  • Before you execute the UPDATE, DELETE, or INSERT OVERWRITE statement for important data in transactional tables, you must execute the SELECT and INSERT statements to back up the data to other tables.

Usage notes

When you execute the DELETE or UPDATE statement to delete or update data in tables or partitions of the tables, take note of the following items:
  • In specific scenarios, you may want to execute the DELETE or UPDATE statement for a small amount of data in a table and infrequently perform read and other operations in subsequent procedures. To reduce the storage space that is occupied by the table, we recommend that you merge the base files with all delta files after you execute the DELETE or UPDATE statement for the table several times. For more information, see ALTER TABLE COMPACT.
  • In specific scenarios, you may want to delete or update more than 5% of the data in a table or a partition of the table at a low frequency and perform frequent subsequent queries. We recommend that you execute the INSERT OVERWRITE or INSERT INTO statement in such scenarios. For more information, see Insert or update data into a table or a static partition (INSERT INTO and INSERT OVERWRITE).

    For example, you want to perform delete or update operations 10 times each day and delete or update 10% of the data in a table each time. In this case, we recommend that you estimate the total cost and the consumption of the subsequent read performance if you execute the DELETE or UPDATE statement on the table. Then, compare the estimated result with that of executing the INSERT OVERWRITE or INSERT INTO statement. This helps you choose an efficient method.

  • Each time you execute the DELETE statement on a table, a delta file is automatically generated. As a result, the occupied storage space may not be reduced. If you want to execute the DELETE statement to delete data to reduce storage usage, you can merge the base files with all delta files. For more information, see ALTER TABLE COMPACT.
  • MaxCompute executes multiple DELETE and UPDATE statements in jobs at a time. Each statement consumes resources and incurs fees. We recommend that you delete or update a batch of data at a time. For example, if you run a Python script to generate and submit a large number of row-level update jobs, and each statement executes on only one row or a small number of rows of data, each statement incurs fees that correspond to the amount of input data scanned by the SQL statement and consumes the related computing resources. When multiple statements are accumulated, the costs are significantly increased and the system efficiency is reduced. Examples:
    -- We recommend that you execute the following statement: 
    update table1 set col1= (select value1 from table2 where table1.id = table2.id and table1.region = table2.region);
    
    -- We recommend that you do not execute the following statements: 
    update table1 set col1=1 where id='2021063001'and region='beijing';                  
    update table1 set col1=2 where id='2021063002'and region='beijing';

DELETE

You can execute the DELETE statement to delete one or more rows that meet the specified conditions in partitioned transactional tables or non-partitioned transactional tables.

  • Syntax
    delete from <table_name> [where <where_condition>];
  • Parameters
    • table_name: required. The name of the transactional table on which you want to execute the DELETE statement.
    • where_condition: optional. A WHERE clause that is used to filter data based on conditions. For more information, see WHERE clause (where_condition). If you execute the DELETE statement on a table without a WHERE clause, all data in the table is deleted.
  • Examples
    • Example 1: Create a non-partitioned transactional table named acid_delete and insert data into the table. Then, execute the DELETE statement to delete the rows that meet the specified conditions from the table. The following statements show an example:
      -- Create a non-partitioned transactional table named acid_delete. 
      create table if not exists acid_delete(id bigint) tblproperties ("transactional"="true"); 
      
      -- Insert data into the table. 
      insert overwrite table acid_delete values(1),(2),(3),(2); 
      
      -- Query the table to check whether data is inserted. 
      select * from acid_delete; 
      
      +------------+
      | id         |
      +------------+
      | 1          |
      | 2          |
      | 3          |
      | 2          |
      +------------+
      
      -- Delete the rows whose value of the id column is 2. If you want to execute the statement on the MaxCompute client (odpscmd), you must enter yes or no to confirm the deletion. 
      delete from acid_delete where id = 2; 
      
      -- Query the table to check whether the table contains only the rows whose values of the id column are 1 and 3. 
      select * from acid_delete; 
      
      +------------+
      | id         |
      +------------+
      | 1          |
      | 3          |
      +------------+
    • Example 2: Create a partitioned transactional table named acid_delete_pt and insert data into the table. Then, execute the DELETE statement to delete the rows that meet the specified conditions from the table. The following statements show an example:
      -- Create a partitioned transactional table named acid_delete_pt.  
      create table if not exists acid_delete_pt(id bigint) partitioned by(ds string) tblproperties ("transactional"="true");
      
      -- Add partitions to the table. 
      alter table acid_delete_pt add if not exists partition (ds= '2019');
      alter table acid_delete_pt add if not exists partition (ds= '2018');
      
      -- Insert data into the table. 
      insert overwrite table acid_delete_pt partition (ds='2019') values(1),(2),(3);
      insert overwrite table acid_delete_pt partition (ds='2018') values(1),(2),(3);
      
      -- Query the table to check whether data is inserted. 
      select * from acid_delete_pt;
      
      +------------+------------+
      | id         | ds         |
      +------------+------------+
      | 1          | 2018       |
      | 2          | 2018       |
      | 3          | 2018       |
      | 1          | 2019       |
      | 2          | 2019       |
      | 3          | 2019       |
      +------------+------------+
      
      -- Delete the rows whose values of the id and ds columns are 2 and 2019. If you want to execute the statement on the MaxCompute client (odpscmd), you must enter yes or no to confirm the deletion. 
      delete from acid_delete_pt where ds='2019' and id = 2;
      
      -- Query the table to check whether the rows whose values of the id and ds columns are 2 and 2019 are deleted. 
      select * from acid_delete_pt;
      
      +------------+------------+
      | id         | ds         |
      +------------+------------+
      | 1          | 2018       |
      | 2          | 2018       |
      | 3          | 2018       |
      | 1          | 2019       |
      | 3          | 2019       |
      +------------+------------+
    • Example 3: Create a destination table named acid_delete_t and an associated table named acid_delete_s. Then, delete the rows that meet the specified conditions from the destination table based on the associated table. The following statements show an example:
      -- Create a destination table named acid_delete_t and an associated table named acid_delete_s. 
      create table if not exists acid_delete_t(id int,value1 int,value2 int) tblproperties ("transactional"="true");
      create table if not exists acid_delete_s(id int,value1 int,value2 int);
      
      -- Insert data into the tables. 
      insert overwrite table acid_delete_t values(2,20,21),(3,30,31),(4,40,41);
      insert overwrite table acid_delete_s values(1,100,101),(2,200,201),(3,300,301);
      
      -- Delete the rows in the acid_delete_t table whose value of the id column does not match that of the rows in the acid_delete_s table. If you want to execute the statement on the MaxCompute client (odpscmd), you must enter yes or no to confirm the deletion. 
      delete from acid_delete_t where not exists (select * from acid_delete_s where acid_delete_t.id=acid_delete_s.id);
      
      -- Query the acid_delete_t table to check whether the table contains only the rows whose values of the id column are 2 and 3. 
      select * from acid_delete_t;
      
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 2          | 20         | 21         |
      | 3          | 30         | 31         |
      +------------+------------+------------+

UPDATE

You can execute the UPDATE statement to update the values of one or more columns of the rows that meet the specified conditions in partitioned transactional tables or non-partitioned transactional tables.

  • Syntax
    -- Method 1
    update <table_name> set <col1_name> = <value1> [, <col2_name> = <value2> ...] [WHERE <where_condition>];
    
    -- Method 2
    update <table_name> set (<col1_name> [, <col2_name> ...]) = (<value1> [, <value2> ...])[WHERE <where_condition>];
    
    -- Method 3
    UPDATE <table_name>
           SET <col1_name> = <value1> [ , <col2_name> = <value2> , ... ]
            [ FROM <additional_tables> ]
            [ WHERE <where_condition> ]
  • Parameters
    • table_name: required. The name of the transactional table on which you want to execute the UPDATE statement.
    • col1_name and col2_name: the columns that you want to update. You must specify at least one column.
    • value1 and value2: the new values that you want to assign to the specified columns. You must update the value of at least one column.
    • where_condition: optional. A WHERE clause that is used to filter data based on conditions. For more information, see WHERE clause (where_condition). If you execute the UPDATE statement on a table without a WHERE clause, all data in the table is updated.
    • additional_tables: optional. A from clause.
      The UPDATE statement can be used with the from clause. The from clause makes the use of the UPDATE statement more convenient. The following table describes the UPDATE statements that are used with and without the from clause.
      Scenario Sample code
      Without the from clause
      update target set v =
        (select min(v) from src group by k where target.k = src.key)
        where target.k in (select k from src);                                 
      With the from clause
      update target set v = b.v
        from (select k, min(v) v from src group by k) b
        where target.k = b.k;
      The following conclusions are obtained based on the preceding sample code:
      • If you use multiple rows of data in the source table to update one row of data in the destination table, you must write aggregate operations to ensure the uniqueness of the data source. The code of aggregate operations is simpler and easier to understand when you use the from clause than that when you do not use the from clause.
      • If you need to only update the intersection data during join operations, you must use a from clause or a where clause. The where clause is more complex than the from clause.
  • Examples
    • Example 1: Create a non-partitioned table named acid_update and insert data into the table. Then, execute the UPDATE statement to update the columns of the rows that meet the specified conditions in the table. The following statements show an example:
      -- Create a non-partitioned table named acid_update.   
      create table if not exists acid_update(id bigint) tblproperties ("transactional"="true");
      
      -- Insert data into the table. 
      insert overwrite table acid_update values(1),(2),(3),(2);
      
      -- Query the table to check whether data is inserted. 
      select * from acid_update; 
      
      +------------+
      | id         |
      +------------+
      | 1          |
      | 2          |
      | 3          |
      | 2          |
      +------------+
      
      -- Update the value 2 in the id column to 4. 
      update acid_update set id = 4 where id = 2; 
      
      -- Query the table to check whether the value 2 is changed to 4 in the id column. 
      select * from acid_update; 
      
      +------------+
      | id         |
      +------------+
      | 1          |
      | 3          |
      | 4          |
      | 4          |
      +------------+
    • Example 2: Create a partitioned table named acid_update and insert data into the table. Then, execute the UPDATE statement to update the columns of the rows that meet the specified conditions in the table. The following statements show an example:
      -- Create a partitioned table named acid_update_pt.  
      create table if not exists acid_update_pt(id bigint) partitioned by(ds string) tblproperties ("transactional"="true");
      
      -- Add a partition to the table. 
      alter table acid_update_pt add if not exists partition (ds= '2019');
      
      -- Insert data into the added partition. 
      insert overwrite table acid_update_pt partition (ds='2019') values(1),(2),(3);
      -- Query the partition to check whether data is inserted.
      select * from acid_update_pt where ds = '2019';
      
      +------------+------------+
      | id         | ds         |
      +------------+------------+
      | 1          | 2019       |
      | 2          | 2019       |
      | 3          | 2019       |
      +------------+------------+
      
      -- Update the value 2 of the id column in the 2019 partition to 4. 
      update acid_update_pt set id = 4 where ds = '2019' and id = 2; 
      
      -- Query the partition to check whether the value 2 is changed to 4 in the id column. 
      select * from acid_update_pt where ds = '2019';
      
      +------------+------------+
      | id         | ds         |
      +------------+------------+
      | 4          | 2019       |
      | 1          | 2019       |
      | 3          | 2019       |
      +------------+------------+
    • Example 3: Create a transactional table named acid_update_t that you want to update and an associated table named acid_update_s. Then, update the values of multiple columns at a time in the acid_update_t table. The following statements show an example:
      -- Create a transactional table named acid_update_t that you want to update and an associated table named acid_update_s. 
      create table if not exists acid_update_t(id int,value1 int,value2 int) tblproperties ("transactional"="true");
      create table if not exists acid_update_s(id int,value1 int,value2 int);
      
      -- Insert data into the tables. 
      insert overwrite table acid_update_t values(2,20,21),(3,30,31),(4,40,41);
      insert overwrite table acid_update_s values(1,100,101),(2,200,201),(3,300,301);
      --Method 1: Update the values of specific columns with constants. 
      update acid_update_t set (value1, value2) = (60,61);
      -- Query the acid_update_t table to check whether data is updated as expected. 
      select * from acid_update_t;
      
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 2          | 60         | 61         |
      | 3          | 60         | 61         |
      | 4          | 60         | 61         |
      +------------+------------+------------+
      
      --Method 2: Use the data in the acid_update_s table to update all rows in the acid_update_t table. If specific rows in the acid_update_t table cannot be matched, null values are assigned to the rows. 
      update acid_update_t set (value1, value2) = (select value1, value2 from acid_update_s where acid_update_t.id = acid_update_s.id);
      -- Query the acid_update_t table to check whether data is updated as expected. 
      select * from acid_update_t;
      
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 2          | 200        | 201        |
      | 3          | 300        | 301        |
      | 4          | NULL       | NULL       |
      +------------+------------+------------+
      
      -- Method 3: Use the data in the acid_update_s table to update the acid_update_t table. Add filter conditions for the acid_update_t table to update only the rows that intersect with those in the acid_update_s table. 
      update acid_update_t set (value1, value2) = (select value1, value2 from acid_update_s where acid_update_t.id = acid_update_s.id) where acid_update_t.id in (select id from acid_update_s);
      -- Query the acid_update_t table to check whether data is updated as expected. 
      select * from acid_update_t;
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 4          | 40         | 41         |
      | 2          | 200        | 201        |
      | 3          | 300        | 301        |
      +------------+------------+------------+
      
      -- Method 4: Use the aggregate results of the acid_update_t and acid_update_s tables to update the acid_update_t table. 
      update acid_update_t set (id, value1, value2) = (select id, max(value1),max(value2) from acid_update_s where acid_update_t.id = acid_update_s.id group by acid_update_s.id) where acid_update_t.id in (select id from acid_update_s);
      -- Query the acid_update_t table to check whether data is updated as expected. 
      select * from acid_update_t;
      
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 4          | 40         | 41         |
      | 2          | 200        | 201        |
      | 3          | 300        | 301        |
      +------------+------------+------------+
    • Example 4: Perform join operations on two tables. Sample statements:
      -- Create a destination table named acid_update_t and create a table named acid_update_s for joins.
      create table if not exists acid_update_t(id bigint,value1 bigint,value2 bigint) tblproperties ("transactional"="true");
      create table if not exists acid_update_s(id bigint,value1 bigint,value2 bigint);
      -- Insert data into these tables.
      insert overwrite table acid_update_t values(2,20,21),(3,30,31),(4,40,41);
      insert overwrite table acid_update_s values(1,100,101),(2,200,201),(3,300,301);
      -- Query data from these tables.
      select * from acid_update_t;
      -- The following result is returned:
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 2          | 20         | 21         |
      | 3          | 30         | 31         |
      | 4          | 40         | 41         |
      +------------+------------+------------+
      select * from acid_update_s;
      -- The following result is returned:
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 1          | 100        | 101        |
      | 2          | 200        | 201        |
      | 3          | 300        | 301        |
      +------------+------------+------------+
      
      -- Use the data in the acid_update_s table to update the acid_update_t table. Add filter conditions for the acid_update_t table to update only the intersection of the two tables.
      update acid_update_t set value1 = b.value1, value2 = b.value2
      from acid_update_s b where acid_update_t.id = b.id;
      
      -- Value 20 is updated to 200, value 21 is updated to 201, value 30 is updated to 300, and value 31 is updated to 301.
      select * from acid_update_t;
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 4          | 40         | 41         |
      | 2          | 200        | 201        |
      | 3          | 300        | 301        |
      +------------+------------+------------+
                                  
    • Example 5: Perform complex join operations on more than two tables. Sample statements:
      -- Create a destination table named acid_update_t and create a table named acid_update_s for joins.
      create table if not exists acid_update_t(id bigint,value1 bigint,value2 bigint) tblproperties ("transactional"="true");
      create table if not exists acid_update_s(id bigint,value1 bigint,value2 bigint);
      create table if not exists acid_update_m(id bigint,value1 bigint,value2 bigint);
      -- Insert data into the tables.
      insert overwrite table acid_update_t
      values(2,20,21),(3,30,31),(4,40,41),(5,50,51);
      insert overwrite table acid_update_s
      values (1,100,101),(2,200,201),(3,300,301),(4,400,401),(5,500,501);
      insert overwrite table acid_update_m
      values(3,30,101),(4,400,201),(5,300,301);
      
      -- Query data from the table.
      select * from acid_update_t;
      -- The following result is returned:
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 2          | 20         | 21         |
      | 3          | 30         | 31         |
      | 4          | 40         | 41         |
      | 5          | 50         | 51         |
      +------------+------------+------------+
      select * from acid_update_s;
      -- The following result is returned:
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 1          | 100        | 101        |
      | 2          | 200        | 201        |
      | 3          | 300        | 301        |
      | 4          | 400        | 401        |
      | 5          | 500        | 501        |
      +------------+------------+------------+
      select * from acid_update_m;
      -- The following result is returned:
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 3          | 30         | 101        |
      | 4          | 400        | 201        |
      | 5          | 300        | 301        |
      +------------+------------+------------+
      -- Use the data in the acid_update_s table to update the acid_update_t table. Use the where clause to filter values in the acid_update_s, acid_update_t, and acid_update_m tables.
      update acid_update_t
      set value1 = acid_update_s.value1, value2 = acid_update_s.value2
      from acid_update_s
      where acid_update_t.id = acid_update_s.id
      and acid_update_s.id > 2
      and acid_update_t.value1 not in
      (select value1 from acid_update_m where id = acid_update_t.id)
      and acid_update_s.value1 not in
      (select value1 from acid_update_m where id = acid_update_s.id);
      
      -- Only the row of data that contains the value of 5 in the id column of the acid_update_t table meets the condition. 
      The value in the value1 column is updated to 500, and the value in the value2 column is updated to 501.
      select * from acid_update_t;
      +------------+------------+------------+
      | id         | value1     | value2     |
      +------------+------------+------------+
      | 5          | 500        | 501        |
      | 2          | 20         | 21         |
      | 3          | 30         | 31         |
      | 4          | 40         | 41         |
      +------------+------------+------------+

ALTER TABLE COMPACT

The base files and delta files of a transactional table occupy the physical storage. You cannot directly read such files. If you execute the UPDATE or DELETE statement on a transactional table, the data in the base files is not updated, but a delta file is generated for each operation. In this case, the more delete or update operations are performed, the more storage space the table occupies. The number of delta files increases. As a result, you are charged more for storage usage and subsequent queries.

If you execute the UPDATE or DELETE statement on a table or a partition of the table multiple times, a large number of delta files are generated. When the system reads data from the table, the system loads the delta files to identify the deleted and updated rows. A large number of delta files reduce the efficiency of reading data. In this case, you can merge the base files with the delta files to reduce storage usage and improve the read efficiency.

  • Syntax
    alter table <table_name> [partition (<partition_key> = '<partition_value>' [, ...])] compact {minor|major};
  • Parameters
    • table_name: required. The name of the transactional table for which you want to merge the base files and delta files.
    • partition_key: optional. The name of a partition key column in the partitioned transactional table.
    • partition_value: optional. The value of a partition key column in the partitioned transactional table.
    • major|minor: One of them must be specified. Differences between minor compaction and major compaction:
      • minor: merges each base file with all delta files that are generated based on the base file and deletes the delta files.
      • major: merges each base file with all delta files that are generated based on the base file, deletes the delta files, and merges small base files. If the size of a base file is less than 32 MB or a delta file is generated, the effect of merging files is equivalent to the effect of executing the INSERT OVERWRITE statement. However, if the size of a base file is greater than or equal to 32 MB and no delta files are generated, the data of the table is not overwritten.
  • Examples
    • Example 1: Merge files of the acid_delete table. The following statement shows an example:
      alter table acid_delete compact minor;
      The following information is returned:
      Summary:
      Nothing found to merge, set odps.merge.cross.paths=true if cross path merge is permitted.
      OK
    • Example 2: Merge files of the acid_update_pt table. The following statement shows an example:
      alter table acid_update_pt partition (ds = '2019') compact major;
      The following information is returned:
      Summary:
      table name: acid_update_pt /ds=2019  instance count: 2  run time: 6
        before merge, file count:        8  file size: 2613  file physical size: 7839
         after merge, file count:        2  file size: 679  file physical size: 2037
      
      OK

FAQ

  • Issue 1:
    • Problem description: When I execute the UPDATE statement, the following error message is returned: ODPS-0010000:System internal error - fuxi job failed, caused by: Data Set should contain exactly one row.
    • Cause: The rows that you want to update do not match the rows that are queried by subqueries. In this case, the system cannot determine which rows need to be updated. The following statement shows an example:
      update store set (s_county, s_manager) = (select d_country, d_manager from store_delta sd where sd.s_store_sk = store.s_store_sk) where s_store_sk in (select s_store_sk from store_delta);
      The select d_country, d_manager from store_delta sd where sd.s_store_sk = store.s_store_sk subquery is used to filter the data in the store_delta table. Then, the data that meets the specified condition is used to update the data of the store table. For example, three rows that have the s_store_sk column in the store table are [1, 2, 3]. If the rows that have the s_store_sk column in the store_delta table are [1, 1] and do not match the rows in the store table, the preceding error message is returned.
    • Solution: Make sure that the rows that you want to update exactly match the rows that are queried by subqueries.
  • Issue 2:
    • Problem description: When I run the compact command in DataWorks DataStudio, the following error message is returned: ODPS-0130161:[1,39] Parse exception - invalid token 'minor', expect one of 'StringLiteral','DoubleQuoteStringLiteral'.
    • Cause: The MaxCompute client version that corresponds to the exclusive resource group of DataWorks does not support the compact command.
    • Solution: Join the DataWorks DingTalk group and contact the technical support team to update the MaxCompute client version that corresponds to the exclusive resource group.