AnalyticDB for MySQL is a distributed database. When you write and optimize SQL statements, you must consider the distribution characteristics of data.

To compile and optimize SQL statements of AnalyticDB for MySQL, take note of the following concepts:

Simple SQL statements
In most cases, the performance of databases declines when the complexity of SQL queries increases. For example, databases that use single-table queries (redundant data) have better performance when compared with databases that use table join queries.
Reduced I/O as a core method to optimize SQL statements
You can reduce column scans. Index scans can return the minimum amount of data and reduce the I/O operations and memory overheads.
Distributed computing, local computing, and parallel computing
In scenarios that include big data computing, local computing fully uses multiple distributed computing resources to avoid the transmission of data across nodes.
High QPS and partition pruning
Partition pruning must be used for tables and SQL statements because business systems require high queries per second (QPS) and Real-Time (RT) in milliseconds.

Common SQL optimizations

Remove unnecessary columns

AnalyticDB for MySQL provides hybrid row-column storage. The number of columns that are returned affects the SQL performance. When you write SQL statements, we recommend that you specify only the columns that are required by your business. We recommend that you do not use an asterisk (*) to execute statements on all columns.

Incorrect syntax in which an asterisk is used and all columns are specified

  select * from tab1 where c1>100 and c1<1000;

Correct syntax in which only the required columns are specified

  select col1, col2 from table_name where c1>100 and c1<1000;

Index and scan

When an SQL statement contains multiple query conditions, you can use indexes for advanced filter conditions and use scans for the remaining conditions.

Concept
AnalyticDB for MySQL provides hybrid row-column storage. You can use a single column to filter data in an efficient manner. Then, you can use internal record pointers to scan other column values. This way, the indexing and query overheads produced by other columns are reduced.
Examples
  • Use internal scans for the time condition

    In the following example, the c1 and time fields are used in the following SQL statement to filter data. If c1=3 is used, a small number of records such as 10,000 are returned. However, if time>'2010-01-01 00:00:00' is used, a large number of records are returned.

    select c1,c2 from tab1 where c1=3 and time >='2010-01-01 00:00:00';

    To improve query efficiency and to return more valid records, you can use only c1 to index data and use the internal scan method for the time condition. SQL statement example:

    /* +no-index=[tab1.time] */
    select c1,c2 from tab1
    where c1=3 and time>='2010-01-01 00:00:00';
    /* +no-index=[tab1.time] */

    Hint indicates a forcible scan for the time>='2010-01-01 00:00:00' condition.

    The computing engine first retrieves the index of the c1 column to obtain the row set that meets the c1=3 condition. Then, the engine reads data from the time column of each row. If the time in the time column of a row meets the time >='2010-01-01 00:00:00' condition, the engine adds this row to the return result.

  • Use internal scans for the <> condition

    Queries that include the <> condition such as c2<>100 do not work the same as index scan. The c2<>100 condition cannot filter records. Example:

    select c1,c2 from tab1 where c1=3 and c2<>100;

    Add the no-index hint to perform an internal scan for the <> condition. Example:

    /* +no-index=[tab1.c2] */
    select c1,c2 from tab1 where c1=3 and c2<>100;
  • Use internal scans for the like condition

    Queries with an infix or suffix such as like '%abc' or like '%abc%'.

    Add the no-index hint to perform an internal scan for the like condition to quickly query valid records. Example:

    /* +no-index=[tab1.c3] */
    select c1,c2 from tab1 where c1=3 and c3 like '%abc%';

Index failure

If the index used for a query fails, the SQL statement scans all records in the table. If the table contains a large number of records, the query performance degrades.

For example, index failures may occur in the following scenarios:

  • Function conversion (column)
  • Type conversion
  • The like condition such as like '%abc%'

In the following SQL statement, the function conversion causes the index to fail. time is of the TIMESTAMP type and stored at 10:00:23 on December 10, 2017.

select c1,c2 from tab1 where substr(cast(time as varchar),1,10)='2017-
10-12';

Correct SQL statement:

select c1,c2 from tab1 where time>='2017-10-12 00:00:00' and time<='2017-10-12 23:59:59';

is not null

You can remove the unnecessary is not null filter condition.

Example
Select c1, c2 from tab1 where c1>100 and c1<1000 and c1 is not null;

In this SQL statement, c1 is not null is an unnecessary filter condition. The following SQL statement is optimized:

Select c1,c2 from tab1 where c1>100 and c1<1000;

Multi-table join

  • If fact tables are joined with fact tables, include JOIN conditions of partition columns. If JOIN conditions are not included in the fact table, filter out unnecessary data by using the WHERE condition.
  • No limits are imposed when dimension tables are joined with fact tables.

In multi-table join queries, you must specify the filter conditions for each table in the WHERE condition. In traditional databases, the index field is used to join tables to quickly search for data. SQL statement:

Select count(*) 
from customer_table C join 
order_table O on C.customer_id= O.customer_id
where O.order_time between'2018-07-20 10:00:11' 
and '2018-09-30 10:00:11' 
and O.order_amount=100;

When you confirm that the t2 and t1 tables have the same time and type filters, we recommend that you modify the SQL statement to the following statement:

Select count(*)
from t1 join t2 on t1.id=t2.id
where t1.time between '2017-12-10 00:00:00' and '2017-12-10 23:59:59'
and t1.type= 100
and t2.time between '2017-12-10 00:00:00' and '2017-12-10 23:59:59'
and t2.type=100