This topic describes the principles and methods commonly used to optimize SQL statements in AnalyticDB for MySQL.

SQL optimization principles

  • Simplify SQL statements

    In most cases, performance declines as the complexity of SQL queries increases. For example, single-table queries (redundant design) have a better performance than table join queries.

  • Reduce I/O operations

    You can use index scans to reduce column scans. Index scans can return the minimum amount of data and reduce the I/O operations and memory overheads.

  • Use local computing

    In big data computing scenarios, local computing avoids the transmission of data across nodes and makes full use of multiple distributed computing resources.

Common optimization methods

  • Remove unnecessary columns

    AnalyticDB for MySQL provides hybrid row-column storage. The number of columns returned affects the SQL performance. When you write SQL statements, we recommend that you specify only the columns required by your business and that you not use an asterisk (*) to execute statements on all columns.

    Incorrect syntax in which an asterisk is used and all columns are specified:

    select * from tab1 where c1>100 and c1<1000;

    Correct syntax in which only the required columns are specified:

    select col1, col2,... from tab1 where c1>100 and c1<1000;
  • Filter data as early as possible

    You can filter data at the early stage to minimize the data size. This can help accelerate the execution of SQL statements.

    select count(*)
    from t1 join t2 on t1.id=t2.id
    where t1.time 
    between '2017-12-10 00:00:00' and '2017-12-10 23:59:59' and t1.type= 100;

    In the preceding SQL statement, the t1 and t2 tables are filtered using the same time and type conditions. We recommend that you modify the SQL statement to the following one:

    select count(*)
    from t1 join t2 on t1.id=t2.id
    where t1.time between '2017-12-10 00:00:00' and '2017-12-10 23:59:59'
    and t1.type= 100
    and t2.time between '2017-12-10 00:00:00' and '2017-12-10 23:59:59'
    and t2.type=100
  • Select indexes

    By default, AnalyticDB for MySQL creates indexes for all columns and uses these indexes to scan data. However, in some cases it is more efficient not to use indexes. You can choose whether to use indexes based on data selection rates and computing capabilities.

    You can choose not to use indexes to improve query performance in the following scenarios:
    • Columns with a low selection rate
      When an SQL statement contains multiple query conditions, you can use indexes for advanced filter conditions and use table scans for the remaining conditions. For example, the c1 and time fields are used in the following SQL statement to filter data. When c1=3 is used, a small number of records are returned. However, when time>'2010-01-01 00:00:00' is used, a large number of records are returned. The selection rate based on c1 is higher than that based on time.
      select c1,c2 from tab1 where c1=3 and time >='2010-01-01 00:00:00';

      The following SQL statement uses only the index of the c1 column and uses internal scans for the time condition such as time >='2010-01-01 00:00:00'. The computing engine first retrieves the index of the c1 column to obtain the row set that meets the c1=3 condition. Then, the engine reads data from the time column corresponding to each row. If the time of a row satisfies the time >='2010-01-01 00:00:00' condition, the engine adds this row to the return result.

      /*+ no_index_columns=[tab1.time]*/ select c1,c2 from 
      tab1 where c1=3 and time >='2010-01-01 00:00:00';
    • Not Equal To comparison
      For example, select c1,c2 from tab1 where c1=3 and c2<>100; contains the Not Equal To condition c2<>100 and cannot effectively filter records. You can improve query performance by using internal scans instead of index scans.
      /*+ no_index_columns=[tab1.c2]*/ select c1,c2 from tab1 where c1=3 and c2<>100;
    • Non-prefixed indexes

      If queries contain infixes or suffixes, we recommend that you use internal scans instead of index scans.

      For example, for the select c1,c2 from tab1 where c1=3 and c3 like '%abc'%; statement, you do not need to use index scans. You can add no-index hint to the statement. The final statment is /*+ no_index_columns=[tab1.c3]*/ select c1,c2 from tab1 where c1=3 and c3 like '%abc%';.

  • Indexes and NULL

    NULL is available in AnalyticDB for MySQL databases, which is different from other types of databases. If is null or is not null is used in WHERE conditions, you can use indexes.

  • Prune partitions

    AnalyticDB for MySQL uses a distributed system where data is distributed across multiple shards and supports list partitions based on shards. In the schema design phase, after you determine the rules for sharding and partitioning, we commend that you increase the use of shard and partition fields in SQL statements to filter data and reduce the search scope.

    CREATE TABLE t_fact_mail_status (
    mail_id varchar COMMENT '',
    scan_timestamp timestamp COMMENT '',
    biz_date bigint COMMENT '',
    org_code varchar COMMENT '',
    org_name varchar COMMENT '',
    dlv_person_name varchar COMMENT '',
    receiver_name varchar COMMENT '',
    receiver_phone varchar COMMENT '',
    receiver_addr varchar COMMENT '',
    product_no varchar COMMENT '',
    mag_no varchar COMMENT '',
    op_1_timestamp bigint COMMENT '',
    op_2_timestamp bigint COMMENT '',
    op_3_timestamp bigint COMMENT '',
    op_4_timestamp bigint COMMENT '',
    op_5_timestamp bigint COMMENT '',
    PRIMARY KEY (mail_id,org_code,biz_date)
    )
    DISTRIBUTED BY (org_code) 
    PARTITIONED BY VALUE(biz_date)
    COMMENT '';

    You can use the following SQL statement to enable partition pruning for the preceding table:

    select count(*) as cn from t_fact_mail_status t
    where
    t.org_code = '21111101'and t.biz_date = 20171128
  • Redistribute data

    When multiple fact tables across shards and partitions are joined, you must redistribute data.

    This operation consumes a large amount of resources. Make sure that data can be associated within the same shard or partition. This requires that when fact tables are joined, the ON condition contain a partition column and be an equivalent condition to reduce data redistribution due to misaligned partitions.

FAQ

What will lead to index failures?

  • Function (column)

    Function conversion in SQL statements may lead to index failures. Example: select c1,c2 from tab1 where substr(cast(time as varchar),1,10)='2017-10-12';.

  • Fuzzy match

    Fuzzy match conditions in SQL statements may lead to index failures.

    select c1,c2 
    from tab1
    where time like '2017-10-12%';

    You can modify the preceding SQL statement to the following statement to avoid index failures:

    select c1,c2
    from tab1
    where time>='2017-10-12 00:00:00' and time<='2017-10-12 23:59:59';