Run SQL

Last Updated: May 07, 2018

MaxCompute SQL does not support transactions, index, or UPDATE/DELETE operations. The MaxCompute SQL syntax differs from Oracle and MySQL, notably, you cannot seamlessly migrate SQL statements of other databases into MaxCompute. After you submit MaxCompute jobs, the jobs can be queued and scheduled for execution. MaxCompute SQL can complete queries at the second- to millisecond-level. You can process massive data in one batch processing job.

Select statements

  • The key of GROUP BY statement can be the column name of input table, and the expression consisted of input table columns, but it cannot be the output column of SELECT statements.
  1. select substr(col2, 2) from tbl group by substr(col2, 2); -- Yes, the key of group by can be the expression consisted of input table column;
  2. select col2 from tbl group by substr(col2, 2); -- No, the key of group by is not in the column of Select statement;
  3. select substr(col2, 2) as c from tbl group by c; -- No, the key of group by cannot be the column alias, i.e., the output column of Select statement;

For SQL parsing, GROUP BY operations are conducted before SELECT operations, which means GROUP BY can only use the column or expression of the input table as the key.

  • ORDER BY must be used in combination with LIMIT.
  • DISTRIBUTE BY must be added in front of SORT BY.
  • The key of ORDER BY/SORT BY/DISTRIBUTE BY must be the output column of SELECT statement, that is, the column alias.
  1. select col2 as c from tbl order by col2 limit 100 -- No, the key of order by is not the output column (column alias) of Select statement.
  2. select col2 from tbl order by col2 limit 100; -- Yes, use column name as the alases if the output column of Select statement has no alias.

For SQL parsing, ORDER BY/SORT BY/DISTRIBUTE BY by operations are conducted after SELECT operations. Therefore, they can only use the output column of SELECT statements as the key.

Insert Statement

  • To insert data into a specified partition, the partition column is not allowed in SELECT list:

    1. insert overwrite table sale_detail_insert partition (sale_date='2013', region='china')
    2. select shop_name, customer_id, total_price, sale_date, region from sale_detail;
    3. -- Return error; sale_date and region are partition columns, which are not allowed in Select statement in static partition.
  • To insert a dynamic partition, the dynamic partition column must be in the SELECT list:

  1. insert overwrite table sale_detail_dypart partition (sale_date='2013', region)
  2. select shop_name,customer_id,total_price from sale_detail;
  3. -- Failed, to insert the dynamic partition, the dynamic partition column must be in Select list.

Join

  • MaxCompute SQL supports the following JOIN operation types: {LEFT OUTER|RIGHT OUTER|FULL OUTER|INNER} JOIN.

  • MaxCompute SQL supports up to 16 concurrent JOIN operations.

  • MaxCompute supports the map JOIN up to six small tables.

Union All

Union All can combine the results returned from multiple Select operations into a data set. It returns all the results without deduplication. MaxCompute does not support union two main query results, but you can do it on two subquery results.

NOTE:

The two Select queries connected by Union All, must have the same number of columns, column names, and column types. If the original names are inconsistent, you can set the same name by the alias.

Additional information

  • MaxCompute SQL supports up to 128 concurrent union operations;

  • MaxCompute supports up to 128 concurrent insert overwrite/into operations.

SQL optimization example

Where condition in Join statement

When you join two tables, the Where condition of the master table can be written at the end of the statement, but the restriction condition of the partition in the slave table cannot be written in the Where condition. We recommend to write it in the ON condition or subquery. The partition restrictions of the master table can be written in Where condition (it is better to filter by subquery first).

Several SQL examples are as follows:

  1. select * from A join (select * from B where dt=20150301)B on B.id=A.id where A.dt=20150301;
  2. select * from A join B on B.id=A.id where B.dt=20150301; --Not allowed.
  3. select * from (select * from A where dt=20150301)A join (select * from B where dt=20150301)B on B.id=A.id;

The Join operation in the second statement runs first, data volume becomes larger and the performance can be decreased. Therefore, the second statement must be avoided.

Data skew

The root cause of data skew is that the amount of data processed by some Workers is much larger than that of other Workers, resulting in the running hours of some Workers are more than the average, which leads to the job delay.

For more information about data skew optimization, see Optimize long tail computing.

Data skew caused by Join

The reason of the data skew caused by Join operation is that keys distribution of Join on is uneven. For the preceding example, to join a large table A and a small table B, run the following statement:

  1. select * from A join B on A.value= B.value;

Copy the logview link to enter the web console page, and double click the Fuxi job that runs the Join operation. You can see a long tail in the Long-Tails tab, which indicates that the data has skewed, as shown in the following figure:

You can optimize the statement by the following methods:

  • Since table B is a small table and does not exceed 512 MB, you can optimize the preceding statement into mapjoin statement.
    1. select /*+ MAPJOIN(B) */ * from A join B on A.value= B.value;
  • Handle the skewed key with a separate logic. For example, a large number of null value of the key in both tables may usually cause data skew. It is necessary to filter out the null data or add a random number before Join operation, for example:
    1. select * from A join B
    2. on case when A.value is null then concat('value',rand() ) else A.value end = B.value;
    If you have realized that the data is skewed, but you cannot get the key information that causes the data skew, a general solution can be used to view the data skew.
    1. select * from a join b on a.key=b.key; --This Leads to data skew.
    Now you can run the following statements:
  1. select left.key, left.cnt * right.cnt from
  2. (select key, count(*) as cnt from a group by key) left
  3. join
  4. (select key, count(*) as cnt from b group by key) right
  5. on left.key=right.key;

Check the distribution of keys to view whether data skew happens when A joins B.

Group by skew

The reason of group by skew is that the key distribution of group by is uneven.

Suppose the table A has two fields: key and value. The data volume in the table is large enough, and the value distribution of key is uneven. Run the following statement:

  1. select key,count(value) from A group by key;

You can see the long tail on the web console page. To solve this problem, you must set the anti-skew parameters before running SQL statement. set odps.sql.groupby.skewindata=true must be added into the SQL statement.

Data skew caused by incorrect use of dynamic partitions

Dynamic partitions of SQL in MaxCompute by default add a Reduce function, which is used to merge the same partition data. The benefits are as following:

  • Reduce small files generated by the MaxCompute and improve the efficiency of processing.
  • Avoid occupying a large amount of memory when a Worker output many files.

When partition data is skewed, using the Reduce function lead to the appearance of long tails. The same data can only be processed by a maximum of 10 Workers, so large volume of data results in a long tails, for example:

  1. insert overwrite table A2 partition(dt)
  2. select
  3. split_part(value,'\t',1) as field1,
  4. split_part(value,'\t',2) as field2,
  5. dt
  6. from A
  7. where dt='20151010';

In this case, we recommend that you do not use dynamic partition, and modify the statement in the following way:

  1. insert overwrite table A2 partition(dt='20151010')
  2. select
  3. split_part(value,'\t',1) as field1,
  4. split_part(value,'\t',2) as field2
  5. from A
  6. where dt='20151010';

Window function optimization

If you use window functions in your SQL statement, each window function typically forms a Reduce job. If window functions are too many, they consume resources. In some specific scenarios, you can optimize window functions.

  • The content after the over keyword must be the same, with the similar grouping and sorting conditions.
  • Multiple window functions must run on the same SQL layer.

Window functions that meet these two conditions merge into Reduce implementation. An SQL example is as follows:

  1. select
  2. rank()over(partition by A order by B desc) as rank,
  3. row_number()over(partition by A order by B desc) as row_num
  4. from MyTable;

Convert the subquery to Join

A subquery is shown as follows:

  1. SELECT * FROM table_a a WHERE a.col1 IN (SELECT col1 FROM table_b b WHERE xxx);

If the number of col1 returned by the table_b subquery in this statement exceeds 1,000, the system reports an error: records returned from subquery exceeded limit of 1,000. In this case, you can use the Join statement instead:

  1. SELECT a.* FROM table_a a JOIN (SELECT DISTINCT col1 FROM table_b b WHERE xxx) c ON (a.col1 = c.col1)

NOTE:

  • If no Distinct is keyword in the statement, and the result of the subquery c returns the same col1 value, it may cause the larger number of results of table_a.
  • The Distinct subquery lead the whole query to fall into one Worker. If the subquery data is large, it may cause the whole query to be slower.
  • If you have already made sure the col1 values are distinct in the subquery from the business, for example, querying by the primary key field, to improve performance the Distinct keyword can only be removed.
Thank you! We've received your feedback.