DML SQL

Last Updated: Dec 08, 2017

Statement Format:

  1. insert overwrite|into table tablename [partition (partcol1=val1, partcol2=val2 ...)]
  2. select_statement
  3. from from_statement;

Notes:

  • Insert syntax of MaxCompute is different from MySQL or Oracle Insert syntax. The keyword ‘table’ must be added following ‘insert overwrite|into’. The table name cannot be added following it directly.

‘Insert overwrite/into’ is used to save calculation results into a destination table. The difference between ‘insert into’ and ‘insert overwrite’ is that ‘insert into’ will insert added data into the table or partition, while ‘insert overwrite’ will clear source data from the table or partition before inserting data in it. In the course of processing data through MaxCompute SQL, ‘insert overwrite/into’ is the most common statement, which can save the calculation result into a table, provided for next calculation. For example, use following statements to calculate the sale detail of different regions from the table sale_detail:

  1. create table sale_detail_insert like sale_detail;
  2. alter table sale_detail_insert add partition(sale_date='2013', region='china');
  3. insert overwrite table sale_detail_insert partition (sale_date='2013', region='china')
  4. select shop_name, customer_id, total_price from sale_detail;

It is worth to Notes that the correspondence between source table and destination table depends on the column sequence in ‘select’ clause, nor the column name correspondence between two tables. The following statement is still legal:

  1. insert overwrite table sale_detail_insert partition (sale_date='2013', region='china')
  2. select customer_id, shop_name, total_price from sale_detail;
  3. -- After creating the table sale_detail_insert, the column sequence is: shop_name string, customer_id string, total_price bigint.
  4. -- While inserting data from sale_detail to sale_detail_insert, the sequence is customer_id, shop_name, total_price.
  5. -- Now the data of sale_detail.customer_id will be inserted into sale_detail_insert.shop_name. The data of sale_detail.shop_name will be inserted into sale_detail_insert.customer_id.

To insert data into a partition, the partition column cannot appear in Select list:

  1. insert overwrite table sale_detail_insert partition (sale_date='2013', region='china')
  2. select shop_name, customer_id, total_price, sale_date, region from sale_detail;
  3. -- Report error. sale_date and region are partition columns, which can not appear in Insert sentence of static partition.

MULTI INSERT

MaxCompute SQL supports inserting different result tables or partitions in a single SQL statement.

Statement Format:

  1. from from_statement
  2. insert overwrite | into table tablename1 [partition (partcol1=val1, partcol2=val2 ...)]
  3. select_statement1
  4. [insert overwrite | into table tablename2 [partition ...]
  5. select_statement2]

Description:

  • Generally, up to 128 ways of output can be written in a single SQL statement. Once exceeding 128 ways of output, report sytax error.
  • In a ‘multi insert’ statement, for a partition table, a target partition cannot appear for multiple times; for a no-partition table, this table cannot appear for multiple times.
  • Different partitions within a partition table cannot have both an ‘insert overwrite’ operation and an ‘insert into’ operation, otherwise, report an error.

As follows:

  1. create table sale_detail_multi like sale_detail;
  2. from sale_detail
  3. insert overwrite table sale_detail_multi partition (sale_date='2010', region='china' )
  4. select shop_name, customer_id, total_price
  5. insert overwrite table sale_detail_multi partition (sale_date='2011', region='china' )
  6. select shop_name, customer_id, total_price;
  7. -- Return result successfully. Insert the data of sale_detail into the 2010 sales records and 2011 sales records in China region.
  8. from sale_detail
  9. insert overwrite table sale_detail_multi partition (sale_date='2010', region='china' )
  10. select shop_name, customer_id, total_price
  11. insert overwrite table sale_detail_multi partition (sale_date='2010', region='china' )
  12. select shop_name, customer_id, total_price;
  13. -- An error is thrown. The same partition appears for multiple times.
  14. from sale_detail
  15. insert overwrite table sale_detail_multi partition (sale_date='2010', region='china' )
  16. select shop_name, customer_id, total_price
  17. insert into table sale_detail_multi partition (sale_date='2011', region='china' )
  18. select shop_name, customer_id, total_price;
  19. -- An error is thrown. Different partitions within a partition table cannot have both an insert overwrite operation and an insert into operation.

DYNAMIC PARTITION

To ‘insert overwrite’ into a partition table, you can specify the partition value in the statement. It can also be realized in a more flexible way, to specify a partition column in a partition table but not give the value. Correspondingly, the columns in Select clause are used to specify these partition values.

Statement Format:

  1. insert overwrite table tablename partition (partcol1, partcol2 ...) select_statement from from_statement;

Description:

  • Now, a single worker can only output up to 512 dynamic partitions in a distributed environment, otherwise it will lead to abnormality.
  • Currently, any dynamic partition SQL cannot generate more than 2,000 dynamic partitions; otherwise it will cause abnormality.
  • The value of dynamic partition cannot be NULL, otherwise expection will be thrown.
  • If the destination table has multiple partitions, it is allowed to specify parts of partitions to be static partitions through ‘Insert’ statement, but the static partitions must be advanced partitions.

Next is a simple example to explain dynamic partition:

  1. create table total_revenues (revenue bigint) partitioned by (region string);
  2. insert overwrite table total_revenues partition(region)
  3. select total_price as revenue, region
  4. from sale_detail;

As mentioned above, user is unable to know which partitions will be generated before running SQL. Only after the Select statement running ends, user can confirm which partitions have been generated through the value of ‘region’. This is why the partition is called ‘Dynamic Partition’.

Other Examples:

  1. create table sale_detail_dypart like sale_detail;
  2. insert overwrite table sale_detail_dypart partition (sale_date, region)
  3. select * from sale_detail;
  4. -- Return successfully.
  5. insert overwrite table sale_detail_dypart partition (sale_date='2013', region)
  6. select shop_name,customer_id,total_price,region from sale_detail;
  7. -- Return successfully; multiple partitions; specify a secondary partition.
  8. insert overwrite table sale_detail_dypart partition (sale_date='2013', region)
  9. select shop_name,customer_id,total_price from sale_detail;
  10. -- Return failure information. When inserting a dynamic partition, the dynamic partition column must appear in Select list.
  11. insert overwrite table sales partition (region='china', sale_date)
  12. select shop_name,customer_id,total_price,region from sale_detail;
  13. -- Return failure information. User can not specify the lowsubpartition only, but needs to insert advanced partition dynamically.

SELECT

Statement Format:

  1. select [all | distinct] select_expr, select_expr, ...
  2. from table_reference
  3. [where where_condition]
  4. [group by col_list]
  5. [order by order_condition]
  6. [distribute by distribute_condition [sort by sort_condition] ]
  7. [limit number]

Pay attention to the following aspects when using Select statement:

  • ‘Select’ operation can read columns from table by specifying the column name. Or use ‘*’ to read all columns. A simple Select statement is shown as follows:
  1. select * from sale_detail;

Or just read one column ‘shop_name’ in sale_detail:

  1. select shop_name from sale_detail;

In ‘where’ clause, you can specify the filter condition, such as:

  1. select * from sale_detail where shop_name like 'hang%';

Notes:

  • If you need screen display after executing Select statement, only 1000 rows of results can be displayed. If ’select’ is a clause, the ‘select’ clause will return all results to upper query without this restriction.
  • The filter conditions supported by ‘where’ clause include:
Filter Condition Description
> , < , =, >=, <=, <>
like, rlike
in, not in If adding subquery behind of ‘in/not in’, only return one column result and the quantity of return value cannot exceed 1000.

You can specify partition range in ‘Where’ clauses of ‘Select’ statement to avoid a full table scan. As follows:

  1. select sale_detail.*
  2. from sale_detail
  3. where sale_detail.sale_date >= '2008' and sale_detail.sale_date <= '2014';

The ‘Where’ clause of MaxCompute does not support querying through ‘between’ condition.

  • Nested query in table_reference is supported, such as:
  1. select * from (select region from sale_detail) t where region = 'shanghai';
  • Distinct: If there are repeated rows, using ‘distinct’ in front of the field will remove the duplicated value and only return one value, while using ‘all’ will return all repeated values. If ‘distinct’ is not specified, the default result is the same as ‘all’. Use ‘distinct’ to return a row of record, as follows:
  1. select distinct region from sale_detail;
  2. select distinct region, sale_date from sale_detail;
  3. -- If distinct is used for multiple columns; the scope of distinct is a column set of select statement, not a single column.
  • Group by: a grouping query caluse, usually used with aggregation functions. When ‘Select’ statement contains aggregation functions, next items must be noted:
  1. The key of ‘group by’ can be the column name of input table.
  2. It can also be an expression consisting of the input table columns, but cannot be the output column alias of ‘select’ statement.
  3. The priority of rule 1 is high than rule 2. If rules 1 and rule 2 have conflict, that is to say, the ‘group by’ key is the column or expression of input table also is the output column of ‘Select’ statement, follow rule 1.

If like:

  1. select region from sale_detail group by region;
  2. -- Its OK to use input table column directly as the group by key.
  3. select sum(total_price) from sale_detail group by region;
  4. -- its OK to group by region, return total price of each group.
  5. select region, sum(total_price) from sale_detail group by region;
  6. -- It OK to group by region, return the region value of each group (unique in group) and total price.
  7. select region as r from sale_detail group by r;
  8. -- Use the alias of select column and report error.
  9. select 'China-' + region as r from sale_detail group by 'China-' + region;
  10. -- You must use the whole expression of column.
  11. select region, total_price from sale_detail group by region;
  12. -- Report error. All columns of select statement, which do not contain aggretion function, must appear in group by clause.
  13. select region, total_price from sale_detail group by region, total_price;
  14. -- Its OK.

These restrictions depend on that ‘group by’ operation precedes ‘select operation’ in SQL analysis. So the ‘group by’ key can only be input table column or expression.

Notes:

  • Order by: Do globe sorting in accordance with a few columns for all data. If user wants to sort records in descending order, he can use DESC as the keyword. Because it is a global sorting, ‘order by’ must be used together with ‘limit’. For the sorting through ‘order by’ clause, NULL would be considered smaller than any value. This behavior is consistent with Mysql, but not in conformity with Oracle. Differently from ‘group by’, ‘order by’ must be followed with the column alias of Select statement. To select a column, if the alias is not specified, the column name will be lised as the column alias.
  1. select * from sale_detail order by region;
  2. -- Return error. Order by is not used together with limit’.
  3. select * from sale_detail order by region limit 100;
  4. select region as r from sale_detail order by region;
  5. -- Report error. Order by must be followed by column alias.
  6. select region as r from sale_detail order by r;
  • The number in [limit number] is a constant, to limit the output rows quantity. If the ‘limit’ option is not specified and using ‘select’ statement to view results from screen directly, only output 5000 rows mostly. This screen display limit of each project may be different and user can control it by using console panel.

  • Distribute by: do hash slice for data according to the values of several columns.The output column alias of ‘Select’ must be used.

  1. select region from sale_detail distribute by region;
  2. -- It OK because the column name is alias.
  3. select region as r from sale_detail distribute by region;
  4. -- Report error. The end must be added the column alias.
  5. select region as r from sale_detail distribute by r;
  • Sort by: for partial ordering, ‘distribute by’ must be added in fornt of it. In fact, ‘sort by’ is to do partial ordering for the result of ‘distribute by’. The output column alias of Select statement must be used.
  1. select region from sale_detail distribute by region sort by region;
  2. select region as r from sale_detail sort by region;
  3. -- -no distribute by’, report error.
  • ‘Order by’ cannot be used together with ‘distribute by/sort by’ clause. At the same time, ‘group by’ is also not allowed using together with ‘distribute by/sort by’ clause.

Notes:

  • The key of ‘order by/sort by/distribute by’ must be output column of Select statement. That is the column alias. In MaxCompute SQL analysis, ‘order by/sort by/distribute by’ operations are later that ‘Select’ operation. So they only accept the output column of Select statement as a key.

Subquery

The general ‘Select’ statement is to read data from a table, such as ‘select column_1, column_2 … from table_name’. But the object to be queried may be another ‘Select’ operation result, as follows:

  1. select * from (select shop_name from sale_detail) a;

Notes:

  • Alias must be used in subquery.

In ‘from’ clause, the subquery can be used as a table, which supports join operation with other tables or subquery. As follows:

  1. create table shop as select * from sale_detail;
  2. select a.shop_name, a.customer_id, a.total_price from
  3. (select * from shop) a join sale_detail on a.shop_name = sale_detail.shop_name;

UNION ALL

Statement Format:

  1. select_statement union all select_statement

Combine two or multiple data sets gotten from ‘Select’ operations to be a data set. If repeated rows exist in the result, all rows meeting the condition will be returned, not to remove duplicated data. Please note that MaxCompute SQL does not support combining two top query results and it must be rewritten to be a subquery format, as follows:

  1. select * from sale_detail where region = 'hangzhou'
  2. union all
  3. select * from sale_detail where region = 'shanghai';

Needs to be changed to:

  1. select * from (
  2. select * from sale_detail where region = 'hangzhou'
  3. union all
  4. select * from sale_detail where region = 'shanghai')
  5. t;

Notes:

  • For ‘union all’ operation, the column number, column name and type of each subquery must be consistent. If the column names are not consistent, use column alias to solve this problem.
  • Generally, MaxCompute allows 128 ways ‘union all’ at most. If exceeding this limit, report syntax error.

JOIN Operations

MaxCompute JOIN supports multiple joints, but does not support Cartesian product, which is a link without ‘on’ condition.

  1. join_table:
  2. table_reference join table_factor [join_condition]
  3. | table_reference {left outer|right outer|full outer|inner} join table_reference join_condition
  4. table_reference:
  5. table_factor
  6. | join_table
  7. table_factor:
  8. tbl_name [alias]
  9. | table_subquery alias
  10. | ( table_references )
  11. join_condition:
  12. on equality_expression ( and equality_expression )*

Notes:

  • equality_expression is an equality expression.

‘Left join’ will return all records from left table (shop), even if there is no matching records in right table (sale_detail).

  1. select a.shop_name as ashop, b.shop_name as bshop from shop a
  2. left outer join sale_detail b on a.shop_name=b.shop_name;
  3. -- Since the column shop_name exists both in table shop and table sale_detail’, the alias should be used in select clause to distinguish them.

‘Right outer join’ will return all records from right table, even if there is no matching records in left table, such as:

  1. select a.shop_name as ashop, b.shop_name as bshop from shop a
  2. right outer join sale_detail b on a.shop_name=b.shop_name;

‘Full outer join’ will return all records from left table and right table. Such as:

  1. select a.shop_name as ashop, b.shop_name as bshop from shop a
  2. full outer join sale_detail b on a.shop_name=b.shop_name;

If there is at least one matching record in the table, ‘inner join’ will return record. The keyword ‘inner’ can be ignored.

  1. select a.shop_name from shop a inner join sale_detail b on a.shop_name=b.shop_name;
  2. select a.shop_name from shop a join sale_detail b on a.shop_name=b.shop_name;

It’s only allowed to use ‘and’ to connect equal conditions . Only in MAPJOIN, unequal connections and using ‘or’ to connect multiple conditions are allowed.

  1. select a.* from shop a full outer join sale_detail b on a.shop_name=b.shop_name
  2. full outer join sale_detail c on a.shop_name=c.shop_name;
  3. -- support multiple joints
  4. select a.* from shop a join sale_detail b on a.shop_name != b.shop_name;
  5. -- Report error because unequal connections connected by join are not supported.

MAPJOIN HINT

When a big table needs to do ‘join’ operation with one or more small tables, ‘mapjoin’ can be used, which performance is much faster than ordinary ‘join’ operation. The basic principle of mapjoin is: SQL will load all specified small tables into the memory of a program which executes ‘join’ operation, so as to speed up the execution speed of ‘join’. The following items must be noted to use ‘mapjoin’:

  • The left table of ‘left outer join’ must be a big table.
  • The right table of ‘right outer join’ must be a big tale.
  • The left table or right table of ‘inner join’ can be considered as a big table.
  • ‘Full outer join’ cannot be used with ‘mapjoin’.
  • The ‘mapjoin’ statement supports a small table for the subquery.
  • When using ‘mapjoin’ and needing to quote small tables or subquery, an alias needs to be quoted.
  • In mapjoin, unequal connections and using ‘or’ to connect multiple conditions are allowed.
  • Now, MaxCompute mapjoin supports up to six small tables, otherwise sytax error will be reported.
  • If using ‘mapjoin’, then the memory size occupied by all small tables cannot exceed 512MB.
  • For multiple tables join, the left two tables cannot be the mapjoin tables at the same time.

Next is a simple example:

  1. select /* + mapjoin(a) */
  2. a.shop_name,
  3. b.customer_id,
  4. b.total_price
  5. from shop a join sale_detail b
  6. on a.shop_name = b.shop_name;

MaxCompute SQL does not support using unequal expressions as ‘on’ condition in general ‘join’ clause and ‘or’ join condition. But in mapjoin, these operations are supported. For example:

  1. select /*+ mapjoin(a) */
  2. a.total_price,
  3. b.total_price
  4. from shop a join sale_detail b
  5. on a.total_price < b.total_price or a.total_price + b.total_price < 500;

HAVING Clause

As WHERE in MaxCompute SQL cannot be used with aggregation function, you can use having clause.Statement Format:

  1. SELECT column_name, aggregate_function(column_name)
  2. FROM table_name
  3. WHERE column_name operator value
  4. GROUP BY column_name
  5. HAVING aggregate_function(column_name) operator value

Use scenarios: for example, there is an order table named “Orders”, including four fields: the name of the customer (Customer,), the amount of order (OrderPrice), order date (Order_date), order ID (Order_id). Now we need to find the customers whose total order amount is less than 2000.Now we can use the following sentence:

  1. SELECT Customer,SUM(OrderPrice) FROM Orders
  2. GROUP BY Customer
  3. HAVING SUM(OrderPrice)<2000
Thank you! We've received your feedback.