AnalyticDB for MySQL is a distributed database. When you write and optimize SQL statements, you must consider the distribution characteristics of data.
To compile and optimize SQL statements of AnalyticDB for MySQL, take note of the following concepts:
- Simple SQL statements
- In most cases, the performance of databases declines when the complexity of SQL queries increases. For example, databases that use single-table queries (redundant data) have better performance when compared with databases that use table join queries.
- Reduced I/O as a core method to optimize SQL statements
- You can reduce column scans. Index scans can return the minimum amount of data and reduce the I/O operations and memory overheads.
- Distributed computing, local computing, and parallel computing
- In scenarios that include big data computing, local computing fully uses multiple distributed computing resources to avoid the transmission of data across nodes.
- High QPS and partition pruning
- Partition pruning must be used for tables and SQL statements because business systems require high queries per second (QPS) and Real-Time (RT) in milliseconds.
Common SQL optimizations
Remove unnecessary columns
AnalyticDB for MySQL provides hybrid row-column storage. The number of columns that are returned affects the SQL performance. When you write SQL statements, we recommend that you specify only the columns that are required by your business. We recommend that you do not use an asterisk (*) to execute statements on all columns.
Incorrect syntax in which an asterisk is used and all columns are specified
select * from tab1 where c1>100 and c1<1000;
Correct syntax in which only the required columns are specified
select col1, col2 from table_name where c1>100 and c1<1000;
Index and scan
When an SQL statement contains multiple query conditions, you can use indexes for advanced filter conditions and use scans for the remaining conditions.
- AnalyticDB for MySQL provides hybrid row-column storage. You can use a single column to filter data in an efficient manner. Then, you can use internal record pointers to scan other column values. This way, the indexing and query overheads produced by other columns are reduced.
Use internal scans for the time condition
In the following example, the c1 and time fields are used in the following SQL statement to filter data. If
c1=3is used, a small number of records such as 10,000 are returned. However, if
time>'2010-01-01 00:00:00'is used, a large number of records are returned.
select c1,c2 from tab1 where c1=3 and time >='2010-01-01 00:00:00';
To improve query efficiency and to return more valid records, you can use only c1 to index data and use the internal scan method for the time condition. SQL statement example:
/* +no-index=[tab1.time] */ select c1,c2 from tab1 where c1=3 and time>='2010-01-01 00:00:00'; /* +no-index=[tab1.time] */
Hint indicates a forcible scan for the
The computing engine first retrieves the index of the c1 column to obtain the row set that meets the
c1=3condition. Then, the engine reads data from the time column of each row. If the time in the time column of a row meets the
time >='2010-01-01 00:00:00'condition, the engine adds this row to the return result.
Use internal scans for the <> condition
Queries that include the <> condition such as
c2<>100do not work the same as index scan. The
c2<>100condition cannot filter records. Example:
select c1,c2 from tab1 where c1=3 and c2<>100;
no-indexhint to perform an internal scan for the <> condition. Example:
/* +no-index=[tab1.c2] */ select c1,c2 from tab1 where c1=3 and c2<>100;
Use internal scans for the like condition
Queries with an infix or suffix such as
no-indexhint to perform an internal scan for the like condition to quickly query valid records. Example:
/* +no-index=[tab1.c3] */ select c1,c2 from tab1 where c1=3 and c3 like '%abc%';
If the index used for a query fails, the SQL statement scans all records in the table. If the table contains a large number of records, the query performance degrades.
For example, index failures may occur in the following scenarios:
- Function conversion (column)
- Type conversion
- The like condition such as
In the following SQL statement, the function conversion causes the index to fail. time is of the TIMESTAMP type and stored at 10:00:23 on December 10, 2017.
select c1,c2 from tab1 where substr(cast(time as varchar),1,10)='2017- 10-12';
Correct SQL statement:
select c1,c2 from tab1 where time>='2017-10-12 00:00:00' and time<='2017-10-12 23:59:59';
is not null
You can remove the unnecessary
is not null filter condition.
Select c1, c2 from tab1 where c1>100 and c1<1000 and c1 is not null;
In this SQL statement,
c1 is not nullis an unnecessary filter condition. The following SQL statement is optimized:
Select c1,c2 from tab1 where c1>100 and c1<1000;
- If fact tables are joined with fact tables, include JOIN conditions of partition columns. If JOIN conditions are not included in the fact table, filter out unnecessary data by using the WHERE condition.
- No limits are imposed when dimension tables are joined with fact tables.
In multi-table join queries, you must specify the filter conditions for each table in the WHERE condition. In traditional databases, the index field is used to join tables to quickly search for data. SQL statement:
Select count(*) from customer_table C join order_table O on C.customer_id= O.customer_id where O.order_time between'2018-07-20 10:00:11' and '2018-09-30 10:00:11' and O.order_amount=100;
When you confirm that the t2 and t1 tables have the same time and type filters, we recommend that you modify the SQL statement to the following statement:
Select count(*) from t1 join t2 on t1.id=t2.id where t1.time between '2017-12-10 00:00:00' and '2017-12-10 23:59:59' and t1.type= 100 and t2.time between '2017-12-10 00:00:00' and '2017-12-10 23:59:59' and t2.type=100