FAQ about DQL operations in MaxCompute - - Alibaba Cloud Documentation Center

This topic provides answers to some frequently asked questions about data query language (DQL) operations in MaxCompute.

Category	FAQ
GROUP BY	What do I do if the "Repeated key in GROUP BY" error message appears when I execute a MaxCompute SQL statement? What do I do if the "Expression not in GROUP BY key" error message appears when I execute a MaxCompute SQL statement? Table B is generated after I execute GROUP BY on Table A. The number of rows in Table B is less than the number of rows in Table A, but the physical storage capacity of Table B is ten times the physical storage capacity of Table A. Why? Is query performance affected when I use GROUP BY to query 10 billion data records? Is the data amount limited when I use GROUP BY to query data?
ORDER BY	After I query data in MaxCompute, how are the query results sorted? Does MaxCompute support ORDER BY FIELD NULLS LAST? What do I do if the "ORDER BY must be used with a LIMIT clause" error message appears when I execute a MaxCompute SQL statement?
Subquery	When I execute a MaxCompute SQL statement with NOT IN that is followed by a subquery, the subquery is expected to return tens of thousands of data records. However, if a subquery that follows IN or NOT IN returns partition data, the maximum number of data records that can be returned is 1,000. How do I ensure that the subquery returns the expected number of data records and the logic of NOT IN is implemented?
INTERSECT, UNION, EXCEPT, and MINUS	How do I merge two tables that are not associated with each other? What do I do if the "ValidateJsonSize error" error message appears when I perform UNION ALL operations?
JOIN	What do I do if the "Both left and right aliases encountered in JOIN" error message appears when I perform a JOIN operation? What do I do if the "Maximum 16 join inputs allowed" error message appears when I perform a JOIN operation? What do I do if the number of returned data records is greater than the number of data records in one of the source tables after I perform a JOIN operation? When I perform a JOIN operation, does partition pruning take effect if the partition pruning condition is specified in an ON clause or a WHERE clause?
MAPJOIN	How do I use MAPJOIN to cache multiple small tables? Can I exchange the large table and small tables that are specified in the MAPJOIN statement?
Others	After I configure filter conditions in a MaxCompute SQL statement, an error message indicating that the size of the input data exceeds 100 GB appears. What do I do? Does the WHERE condition of a fuzzy query in MaxCompute SQL support regular expressions? If I want to synchronize only 100 data records, how do I use LIMIT to specify the number of data records that I want to synchronize in a WHERE clause? How can I improve the query efficiency? Can I adjust the partition settings? Does MaxCompute SQL support the WITH AS statement? How do I split one row of data into multiple rows? After I specify use_instance_tunnel=false and instance_tunnel_max_record=10 in the odps_config.ini file of the client, the SELECT statement still generates a large number of output records. Why? How do I use a regular expression to determine whether the values of a field are in Chinese?

What do I do if the "Repeated key in GROUP BY" error message appears when I execute a MaxCompute SQL statement?

Problem description
When I execute a MaxCompute SQL statement, the following error message appears:
```
FAILED: ODPS-0130071:Semantic analysis exception - Repeated key in GROUP BY. 
```
Cause
SELECT DISTINCT cannot be followed by a constant.
Solution
Split the execution logic of the SQL statement into two layers. This way, the DISTINCT logic without constants is processed at the inner layer, and constant data is added to the outer layer.

What do I do if the "Expression not in GROUP BY key" error message appears when I execute a MaxCompute SQL statement?

Problem description

When I execute a MaxCompute SQL statement, the following error message appears:

FAILED: ODPS-0130071:Semantic analysis exception - Expression not in GROUP BY key : line 1:xx 'xxx'

Cause
Columns that are not specified in the GROUP BY clause cannot be directly referenced. For more information, see GROUP BY (col_list).
Solution
Modify the SQL statement to ensure that the columns that are queried by using SELECT are the columns that are specified in the GROUP BY clause or the columns that are processed by using aggregate functions, such as SUM or COUNT.

Table B is generated after I execute GROUP BY on Table A. The number of rows in Table B is less than the number of rows in Table A, but the physical storage capacity of Table B is ten times the physical storage capacity of Table A. Why?

In MaxCompute, data is stored in columnar compression mode. If the adjacent data in the same column is similar, the data compression ratio is high. If odps.sql.groupby.skewindata is set to true, data is scattered and the data compression ratio is low. If you want to obtain a high data compression ratio, you can sort specific data when you execute an SQL statement to write data.

Is query performance affected when I use GROUP BY to query 10 billion data records? Is the data amount limited when I use GROUP BY to query data?

No, query performance is not affected. The data amount is not limited when you use GROUP BY to query data. For more information about GROUP BY, see GROUP BY (col_list).

After I query data in MaxCompute, how are the query results sorted?

The data of MaxCompute tables is arranged in random order. If you do not configure order settings, the data is also returned in random order.

If you want to obtain sorted data, specify order by xx limit n in the SQL statement to sort data.

If you want to sort full data, set n after limit to the value of the total number of data records + 1.

Important

If you sort large amounts of full data, query performance is significantly affected and memory overflow may occur. We recommend that you do not perform this operation.

Does MaxCompute support ORDER BY FIELD NULLS LAST?

Yes, MaxCompute supports ORDER BY FIELD NULLS LAST. For more information about the syntax supported by MaxCompute, see Differences in the support for SQL statements.

What do I do if the "ORDER BY must be used with a LIMIT clause" error message appears when I execute a MaxCompute SQL statement?

Problem description

When I execute a MaxCompute SQL statement, the following error message appears:

FAILED: ODPS-0130071:[1,27] Semantic analysis exception - ORDER BY must be used with a LIMIT clause, please set odps.sql.validate.orderby.limit=false to use it.

Cause
The ORDER BY clause needs to sort all data of a single node. By default, the ORDER BY clause is used with the LIMIT clause to prevent a single node from processing large amounts of data.
Solution
You can remove the limit on the simultaneous execution of the ORDER BY and LIMIT clauses for a project or session.
- To remove the limit for a project, run the setproject odps.sql.validate.orderby.limit=false; command.
- To remove the limit for a session, commit and run the set odps.sql.validate.orderby.limit=false; command with the SQL statement that you want to execute.
  Note
  If a single node has large amounts of data to sort after you remove the limit, a large number of resources and much time are consumed.

For more information about ORDER BY, see ORDER BY (ORDER_condition).

When I execute a MaxCompute SQL statement with NOT IN that is followed by a subquery, the subquery is expected to return tens of thousands of data records. However, if a subquery that follows IN or NOT IN returns partition data, the maximum number of data records that can be returned is 1,000. How do I ensure that the subquery returns the expected number of data records and the logic of NOT IN is implemented?

You can use LEFT OUTER JOIN to query data.

select * from a where a.ds not in (select ds from b);
Replace the preceding statement with the following statement: 
select a.* from a left outer join (select distinct ds from b) bb on a.ds=bb.ds where bb.ds is null;

How do I merge two tables that are not associated with each other?

You can use the UNION ALL operation to complete vertical merging and use the row_number function to complete horizontal merging. Add an ID column to both tables, associate the tables by using the ID columns, and then select the required fields. For more information, see UNION or ROW_NUMBER.

What do I do if the "ValidateJsonSize error" error message appears when I perform UNION ALL operations?

Problem description

When I execute the SQL statement select count(1) as co from client_table union all ... that contains 200 UNION ALL operations, the following error message appears:

FAILED: build/release64/task/fuxiWrapper.cpp(344): ExceptionBase: Submit fuxi Job failed, {
    "ErrCode": "RPC_FAILED_REPLY",
    "ErrMsg": "exception: ExceptionBase:build/release64/fuxi/fuximaster/fuxi_master.cpp(1018): ExceptionBase: StartAppFail: ExceptionBase:build/release64/fuxi/fuximaster/app_master_mgr.cpp(706): ExceptionBase: ValidateJsonSize error: the size of compressed plan is larger than 1024KB\nStack

Causes
- Cause 1: The length of the execution plan exceeds 1024 KB, which is the maximum size allowed by the underlying architecture. As a result, an SQL execution error is returned. The length of the execution plan is not directly related to the length of the SQL statement. Therefore, the length of the execution plan cannot be estimated.
- Cause 2: The number of partitions is too large.
- Cause 3: The number of small files is too large.
Solutions
- Solution to Cause 1: Split the SQL statement that is excessively long to prevent the generated execution plan from exceeding the maximum length.
- Solution to Cause 2: Adjust the number of partitions. For more information, see Partition.
- Solution to Cause 3: Merge small files.

What do I do if the "Both left and right aliases encountered in JOIN" error message appears when I perform a JOIN operation?

Problem description

When I execute a MaxCompute SQL statement, the following error message appears:

FAILED: ODPS-0130071:Semantic analysis exception - Both left and right aliases encountered in JOIN : line 3:3 'xx': . If you really want to perform this join, try mapjoin

Causes
- Cause 1: The ON condition in the SQL statement includes a non-equi join, such as table1.c1>table2.c3.
- Cause 2: The data on one side of the JOIN condition in the SQL statement comes from two tables, such as table1.col1 = concat(table1.col2,table2.col3).
Solutions
- Solution to Cause 1: Change the non-equi join in the ON condition in the SQL statement to an equi-join.
  Note
  If you must use a non-equi join, you can add a MAPJOIN hint to the SQL statement. For more information, see ODPS-0130071.
- Solution to Cause 2: If one of the tables is small, use the MAPJOIN method.

What do I do if the "Maximum 16 join inputs allowed" error message appears when I perform a JOIN operation?

Problem description
When I execute a MaxCompute SQL statement, the following error message appears:
```
FAILED: ODPS-0123065:Join exception - Maximum 16 join inputs allowed
```
Cause
A MaxCompute SQL statement can perform MAPJOIN on a maximum of six small tables and can consecutively join a maximum of 16 tables.
Solution
Join several small tables into a temporary table as an input table to reduce the number of input tables.

What do I do if the number of returned data records is greater than the number of data records in one of the source tables after I perform a JOIN operation?

Problem description
After I execute the following MaxCompute SQL statement, the number of returned data records is greater than the number of data records in the table1 table.
```
select count(*) from table1 a left outer join table2 b on a.ID = b.ID;
```
Cause
A left outer join returns all the data from table1, even if no matching entries are found in table2. If there are duplicate IDs in table2, it will result in an increase in the number of rows returned in the result set.
The following table provides sample data in table1.
id
values
1
a
1
b
2
c
The following table provides sample data in table2.
id
values
1
A
1
B
3
D
The following table lists the returned results after select count(*) from table1 a left outer join table2 b on a.ID = b.ID; is executed.
id1
values1
id2
values2
1
b
1
B
1
b
1
A
1
a
1
B
1
a
1
A
2
c
NULL
NULL
- Both tables have data whose value of the ID field is 1. Therefore, the Cartesian product operation is performed and four data records are returned.
- Only table1 has data whose value of the ID field is 2. Therefore, one data record is returned.
- Only table2 has data whose value of the ID field is 3. Therefore, no data is returned.
Solution
Check whether there is duplicate data in table2. Sample statement:
```
select id, count() as cnt from table2 group by id having cnt>1 limit 10;
```
If you do not want to perform the Cartesian product operation, you can use the following statement:
```
select * from table1 a left outer join (select distinct id from table2) b on a.id = b.id;
```

I specify the partition condition when I perform a JOIN operation, but the system prompts that full table scan is prohibited. Why?

Problem description

When I execute the following statement in two projects, the statement is successfully executed only in one of the projects.

select t.stat_date 
from fddev.tmp_001 t  
left outer join (select '20180830' as ds from fddev.dual ) t1 
on t.ds = 20180830
group by t.stat_date;

The following error message appears:

Table(fddev,tmp_001) is full scan with all partitions,please specify partitions predicates.

Cause
When you execute a SELECT statement, partition conditions should be specified using the WHERE clause, as the use of the ON clause for this purpose is non-standard.
The set odps.sql.outerjoin.supports.filters=false command is run in the project in which the statement is successfully executed. This configuration converts the condition in the ON clause into a filter condition to allow non-standard SQL statements. This configuration is compatible with the Hive syntax but does not comply with the SQL standard.
Solution
We recommend that you place the partition filter conditions in a WHERE clause.

When I perform a JOIN operation, does partition pruning take effect if the partition pruning condition is specified in an ON clause or a WHERE clause?

Partition pruning takes effect if the partition pruning condition is specified in a WHERE clause.
If the partition pruning condition is specified in an ON clause, partition pruning takes effect on the secondary table. Partition pruning does not take effect on the primary table. Therefore, a full table scan is triggered.

For more information about partition pruning, see Check whether partition pruning is effective.

How do I use MAPJOIN to cache multiple small tables?

MAPJOIN is an optimization technique that accelerates queries by caching small tables into memory. You can specify the aliases of the tables that you want to cache in the MAPJOIN statement.

For example, a table named iris exists in a project. The table has the following data:

+------------------------------------------+

| Field           | Type       | Label | Comment                                     |
+------------------------------------------+

| sepal_length    | double     |       |                                             |

| sepal_width     | double     |       |                                             |

| petal_length    | double     |       |                                             |

| petal_width     | double     |       |                                             |

| category        | string     |       |                                             |
+------------------------------------------+

The following sample code shows how to use MAPJOIN to cache small tables.

select 
  /*+ mapjoin(b,c) */
  a.category,
  b.cnt as cnt_category,
  c.cnt as cnt_all
from iris a
join
(
  select count() as cnt,category from iris group by category
) b
on a.category = b.category
join 
(
  select count(*) as cnt from iris
) c;

Can I exchange the large table and small tables that are specified in the MAPJOIN statement?

Yes, you can. The large table and small tables in the MAPJOIN statement are distinguished based on the size of the space that is used by each table. The system loads all data in the specified small tables to the memory to accelerate the JOIN operation.

Important

If you exchange the large table and small tables in the MAPJOIN statement, no error is returned but the processing performance decreases.

After I configure filter conditions in a MaxCompute SQL statement, an error message indicating that the size of the input data exceeds 100 GB appears. What do I do?

Filter the partition fields to retrieve the data, and then filter other non-partition fields. The size of the input table depends on the amount of data after filtering the partition fields.

Does the WHERE condition of a fuzzy query in MaxCompute SQL support regular expressions?

Yes, the WHERE condition of a fuzzy query in MaxCompute SQL supports regular expressions. For example, select * from user_info where address rlike '[0-9]{9}'; indicates that the IDs that consist of nine digits are searched.

If I want to synchronize only 100 data records, how do I use LIMIT to specify the number of data records that I want to synchronize in a WHERE clause?

LIMIT cannot be used in a WHERE clause. You can execute an SQL statement to read 100 data records before you synchronize data.

How can I improve the query efficiency? Can I adjust the partition settings?

If you use partition fields to partition a table, a full table scan is not triggered when partitions are added or when partition data is updated or read. This improves the efficiency of data processing. For more information, see Table operations.

Does MaxCompute SQL support the WITH AS statement?

Yes, MaxCompute SQL supports the WITH AS statement. MaxCompute supports SQL-compliant common table expressions (CTEs) to improve the readability and execution efficiency of SQL statements. For more information, see Common Table Expression (CTE).

How do I split one row of data into multiple rows?

You can use Lateral View with table generation functions, such as SPLIT and EXPLODE, to split one row of data into multiple rows of data and aggregate the split data.

After I specify use_instance_tunnel=false and instance_tunnel_max_record=10 in the odps_config.ini file of the client, the SELECT statement still generates a large number of output records. Why?

To use instance_tunnel_max_record to control the number of output records, you must change use_instance_tunnel=false to use_instance_tunnel=true.

How do I use a regular expression to determine whether the values of a field are in Chinese?

The following statement shows an example.

select 'Field name' rlike '[\\x{4e00}-\\x{9fa5}]+';

id1	values1	id2	values2
1	b	1	B
1	b	1	A
1	a	1	B
1	a	1	A
2	c	NULL	NULL

id	values
1	a
1	b
2	c

id	values
1	A
1	B
3	D