When you write and optimize SQL statements in AnalyticDB for MySQL, you must consider the distribution characteristics of data. This topic describes how to write and optimize SQL statements.

SQL development rules

When you write SQL statements in AnalyticDB for MySQL, the following rules apply:

  • Write simple SQL statements

    In most cases, the performance of databases declines when the complexity of SQL queries increases. For example, databases that use single-table queries (redundant data) have better performance than databases that use table join queries.

  • Reduce the number of I/O operations

    You can reduce the number of column scans to return less data and reduce the number of I/O operations and memory overheads.

  • Use distributed computing

    In big data computing scenarios, local computing fully uses distributed computing resources to prevent data from being transmitted across nodes.

  • Use partition pruning

    In business systems that require high queries per second (QPS) performance and a response time of milliseconds, partition pruning must be used for tables and SQL statements.

SQL optimization rules

  • Remove redundant columns

    AnalyticDB for MySQL provides hybrid row-column storage. The number of columns returned affects the SQL performance. When you write SQL statements, we recommend that you specify only columns that are required for your business. We recommend that you do not use an asterisk (*) to execute statements on all columns.

    Examples
    • Syntax not recommended
      select * from tab1 where c1>100 and c1<1000;
    • Recommended syntax
      select col1, col2 from table_name where c1>100 and c1<1000;
  • Use indexes and scans

    When an SQL statement contains multiple query conditions, you can use indexes for advanced filter conditions and use scans for the remaining conditions.

    AnalyticDB for MySQL provides hybrid row-column storage. You can use a single column to filter data in an efficient manner. Then, you can use internal record pointers to scan other column values. This can reduce the index-based query overheads produced by other columns.

    AnalyticDB for MySQL allows you to disable the use of indexes for specific columns when you perform queries on a cluster. For more information, see Filter conditions without pushdown.

    Examples
    Note The following examples are applicable to engine versions earlier than 3.14. For engine version 3.14 or later, use the filter_not_pushdown_columns hint.
    • Use internal scan for the time condition
      In the following SQL statement, the c1 and time fields are used to filter data. If c1=3 is used, a small number of entries such as 10,000 entries are returned. However, if time>'2010-01-01 00:00:00' is used, a large number of entries are returned.
      select c1,c2 from tab1 where c1=3 and time >='2010-01-01 00:00:00';
      To improve query efficiency and to return more valid entries, you can use only c1 to index data and use the internal scan method for the time condition. Example:
      /*+ no_index_columns=[tab1.time] */
      select c1,c2 from tab1
      where c1=3 and time>='2010-01-01 00:00:00';

      The hint forces a scan for the time>='2010-01-01 00:00:00' condition.

      The computing engine first retrieves the index of the c1 column to obtain the row set that meets the c1=3 condition. Then, the engine reads data from the time column of each row. If the time in the time column of a row meets the time >='2010-01-01 00:00:00' condition, the engine adds this row to the result.

    • Use internal scan for the <> condition
      Queries that include the <> condition such as c2<>100 cannot filter out invalid entries when index scan is used. Example:
      select c1,c2 from tab1 where c1=3 and c2<>100;
      Add the no-index hint to perform internal scan for the <> condition. Example:
      /*+ no_index_columns=[tab1.c2] */
      select c1,c2 from tab1 where c1=3 and c2<>100;
    • Use internal scan for the like condition

      The like condition is used in queries with a suffix or infix, such as like '%abc' or like '%abc%'.

      Add the no-index hint to perform internal scan for the like condition to speed up the retrieval of valid entries. Example:
      /*+ no_index_columns=[tab1.c3] */
      select c1,c2 from tab1 where c1=3 and c3 like '%abc%';
  • Index failures

    If the index used for a query fails, the SQL statement scans all entries in the table. If the table contains a large number of entries, the query performance degrades.

    For example, an index failure may occur in the following scenarios:
    • Function conversion (column)
    • Type conversion
    • Like condition such as like '%abc%'
    Examples
    • In the following SQL statement, function conversion causes the index to fail. The time value is of the TIMESTAMP type, and the storage time is 2017-12-10 10:00:23.
      select c1,c2 from tab1 where substr(cast(time as varchar),1,10)='2017-12-10';
    • Recommended syntax
      select c1,c2 from tab1 where time>='2017-12-10 00:00:00' and time<='2017-12-10 23:59:59';
  • Remove unnecessary is not null filter conditions
    Examples
    • Syntax not recommended
      Select c1, c2 from tab1 where c1>100 and c1<1000 and c1 is not null;
    • In the preceding SQL statement, c1 is not null is a redundant filter condition. The following SQL statement is an optimized one:
      Select c1,c2 from tab1 where c1>100 and c1<1000;
  • Multi-table join

    Different SQL optimization rules are used in different multi-table join scenarios.

    • Include JOIN conditions of partition columns in the fact table that is joined with another fact table. Otherwise, filter out unnecessary data by using the WHERE condition.
    • No limits are imposed when dimension tables are joined with fact tables.
    In multi-table join queries, you must specify the filter conditions for each table in the WHERE condition. In traditional databases, the index field is used to join tables to search for data. Example:
    Select count(*) 
    from t1 C join 
    t2 O on C.t1_id= O.t1_id
    where O.t2_time between'2018-07-20 10:00:11' 
    and '2018-09-30 10:00:11' 
    and O.t2_amount=100;

    If the t2 and t1 tables have the same time and type filters, we recommend that you modify the SQL statement to the following one:

    Select count(*)
    from t1 join t2 on t1.id=t2.id
    where t1.time between '2017-12-10 00:00:00' and '2017-12-10 23:59:59'
    and t1.type= 100
    and t2.time between '2017-12-10 00:00:00' and '2017-12-10 23:59:59'
    and t2.type=100