Write and optimize SQL statements - AnalyticDB - Alibaba Cloud Documentation Center

When you write and optimize SQL statements in AnalyticDB for MySQL, you must consider the distribution characteristics of data. This topic describes how to write and optimize SQL statements.

SQL development rules

When you write SQL statements in AnalyticDB for MySQL, the following rules apply:

Write simple SQL statements
In most cases, the performance of databases declines when the complexity of SQL queries increases. For example, databases that use single-table queries (redundant data) have better performance than databases that use table join queries.
Reduce the number of I/O operations
You can reduce the number of column scans to return less data and reduce the number of I/O operations and memory overheads.
Use distributed computing
In big data computing scenarios, local computing fully uses distributed computing resources to prevent data from being transmitted across nodes.
Use partition pruning
In business systems that require high queries per second (QPS) performance and a response time of milliseconds, partition pruning must be used for tables and SQL statements.

SQL optimization rules

Remove redundant columns
AnalyticDB for MySQL provides hybrid row-column storage. The number of columns returned affects the SQL performance. When you write SQL statements, we recommend that you specify only columns that are required for your business. We recommend that you do not use an asterisk (*) to execute statements on all columns.
Examples
- Syntax not recommended
```
select * from tab1 where c1>100 and c1<1000;
```
- Recommended syntax
```
select col1, col2 from table_name where c1>100 and c1<1000;
```
Use indexes and scans
When an SQL statement contains multiple query conditions, you can use indexes for advanced filter conditions and use scans for the remaining conditions.
AnalyticDB for MySQL provides hybrid row-column storage. You can use a single column to filter data in an efficient manner. Then, you can use internal record pointers to scan other column values. This can reduce the overheads of index-based queries on other columns.
AnalyticDB for MySQL allows you to disable the use of indexes for specific columns when you perform queries on a cluster. For more information, see Filter conditions without pushdown.
Examples
Note
The following examples are applicable to engine versions earlier than 3.14. For engine version 3.14 or later, use the filter_not_pushdown_columns hint.
- Use internal scan for the time condition
  In the following SQL statement, the c1 and time fields are used to filter data. If c1=3 is used, a small number of entries such as 10,000 entries are returned. However, if time>'2010-01-01 00:00:00' is used, a large number of entries are returned.
```
select c1,c2 from tab1 where c1=3 and time >='2010-01-01 00:00:00';
```
  To improve query efficiency and to return more valid entries, you can use only c1 to index data and use the internal scan method for the time condition. Example:
```
/*+ no_index_columns=[tab1.time] */
select c1,c2 from tab1
where c1=3 and time>='2010-01-01 00:00:00';
```
  The hint forces a scan for the time>='2010-01-01 00:00:00' condition.
  The computing engine first retrieves the index of the c1 column to obtain the row set that meets the c1=3 condition. Then, the engine reads data from the time column of each row. If the time in the time column of a row meets the time >='2010-01-01 00:00:00' condition, the engine adds this row to the result.
- Use internal scan for the <> condition
  Queries that include the <> condition such as c2<>100 cannot filter out invalid entries when index scan is used. Example:
```
select c1,c2 from tab1 where c1=3 and c2<>100;
```
  Add the no_index_columns hint to perform internal scan for the <> condition. Example:
```
/*+ no_index_columns=[tab1.c2] */
select c1,c2 from tab1 where c1=3 and c2<>100;
```
- Use internal scan for the <> condition
  The like condition is used in queries with a suffix or infix, such as like '%abc' or like '%abc%'.
  Add the no_index_columns hint to perform an internal scan for the like condition to speed up the retrieval of valid entries. Example:
```
/*+ no_index_columns=[tab1.c3] */
select c1,c2 from tab1 where c1=3 and c3 like '%abc%';
```
Index failures
If the index used for a query fails, the SQL statement scans all entries in the table. If the table contains a large number of entries, the query performance degrades.
For example, an index failure may occur in the following scenarios:
- Function conversion (on columns)
- Type conversion
- Like condition such as like '%abc%'
Examples
- In the following SQL statement, function conversion causes the index to fail. The time column is of the TIMESTAMP type and contains the value 2017-12-10 10:00:23.
```
select c1,c2 from tab1 where substr(cast(time as varchar),1,10)='2017-12-10';
```
- Recommended syntax
```
select c1,c2 from tab1 where time>='2017-12-10 00:00:00' and time<='2017-12-10 23:59:59';
```
Remove unnecessary is not null filter conditions
Examples
- Syntax not recommended
```
Select c1, c2 from tab1 where c1>100 and c1<1000 and c1 is not null;
```
- In the preceding SQL statement, c1 is not null is a redundant filter condition. The following SQL statement is an optimized one:
```
Select c1,c2 from tab1 where c1>100 and c1<1000;
```
Multi-table join
Different SQL optimization rules are used in different multi-table join scenarios.
- Include JOIN conditions of partition columns in the fact table that is joined with another fact table. Otherwise, filter out unnecessary data by using the WHERE condition.
- No limits are imposed when replicated tables are joined with fact tables.
In multi-table join queries, you must specify the filter conditions for each table in the WHERE condition. In traditional databases, the index field is used to join tables to search for data. Example:
```
Select count(*) 
from t1 C join 
t2 O on C.t1_id= O.t1_id
where O.t2_time between'2018-07-20 10:00:11' 
and '2018-09-30 10:00:11' 
and O.t2_amount=100;
```
If the t2 and t1 tables have the same time and type filters, we recommend that you modify the SQL statement to the following one:
```
Select count(*)
from t1 join t2 on t1.id=t2.id
where t1.time between '2017-12-10 00:00:00' and '2017-12-10 23:59:59'
and t1.type= 100
and t2.time between '2017-12-10 00:00:00' and '2017-12-10 23:59:59'
and t2.type=100
```