This topic describes the principles and methods commonly used to optimize SQL statements in AnalyticDB for MySQL.
SQL optimization principles
- Simplify SQL statements
In most cases, performance declines as the complexity of SQL queries increases. For example, single-table queries (redundant design) have a better performance than table join queries.
- Reduce I/O operations
You can use index scans to reduce column scans. Index scans can return the minimum amount of data and reduce the I/O operations and memory overheads.
- Use local computing
In big data computing scenarios, local computing avoids the transmission of data across nodes and makes full use of multiple distributed computing resources.
Common optimization methods
- Remove unnecessary columns
AnalyticDB for MySQL provides hybrid row-column storage. The number of columns returned affects the SQL performance. When you write SQL statements, we recommend that you specify only the columns required by your business and that you not use an asterisk (*) to execute statements on all columns.
Incorrect syntax in which an asterisk is used and all columns are specified:
select * from tab1 where c1>100 and c1<1000;
Correct syntax in which only the required columns are specified:
select col1, col2,... from tab1 where c1>100 and c1<1000;
- Filter data as early as possible
You can filter data at the early stage to minimize the data size. This can help accelerate the execution of SQL statements.
select count(*) from t1 join t2 on t1.id=t2.id where t1.time between '2017-12-10 00:00:00' and '2017-12-10 23:59:59' and t1.type= 100;
In the preceding SQL statement, the t1 and t2 tables are filtered using the same time and type conditions. We recommend that you modify the SQL statement to the following one:
select count(*) from t1 join t2 on t1.id=t2.id where t1.time between '2017-12-10 00:00:00' and '2017-12-10 23:59:59' and t1.type= 100 and t2.time between '2017-12-10 00:00:00' and '2017-12-10 23:59:59' and t2.type=100
- Select indexes
By default, AnalyticDB for MySQL creates indexes for all columns and uses these indexes to scan data. However, in some cases it is more efficient not to use indexes. You can choose whether to use indexes based on data selection rates and computing capabilities.You can choose not to use indexes to improve query performance in the following scenarios:
- Columns with a low selection rate
When an SQL statement contains multiple query conditions, you can use indexes for advanced filter conditions and use table scans for the remaining conditions. For example, the c1 and time fields are used in the following SQL statement to filter data. When
c1=3is used, a small number of records are returned. However, when
time>'2010-01-01 00:00:00'is used, a large number of records are returned. The selection rate based on c1 is higher than that based on time.
select c1,c2 from tab1 where c1=3 and time >='2010-01-01 00:00:00';
The following SQL statement uses only the index of the c1 column and uses internal scans for the time condition such as
time >='2010-01-01 00:00:00'. The computing engine first retrieves the index of the c1 column to obtain the row set that meets the
c1=3condition. Then, the engine reads data from the time column corresponding to each row. If the time of a row satisfies the
time >='2010-01-01 00:00:00'condition, the engine adds this row to the return result.
/*+ no_index_columns=[tab1.time]*/ select c1,c2 from tab1 where c1=3 and time >='2010-01-01 00:00:00';
- Not Equal To comparison
select c1,c2 from tab1 where c1=3 and c2<>100;contains the Not Equal To condition
c2<>100and cannot effectively filter records. You can improve query performance by using internal scans instead of index scans.
/*+ no_index_columns=[tab1.c2]*/ select c1,c2 from tab1 where c1=3 and c2<>100;
- Non-prefixed indexes
If queries contain infixes or suffixes, we recommend that you use internal scans instead of index scans.
For example, for the
select c1,c2 from tab1 where c1=3 and c3 like '%abc'%;statement, you do not need to use index scans. You can add
no-index hintto the statement. The final statment is
/*+ no_index_columns=[tab1.c3]*/ select c1,c2 from tab1 where c1=3 and c3 like '%abc%';.
- Columns with a low selection rate
- Indexes and NULL
NULL is available in AnalyticDB for MySQL databases, which is different from other types of databases. If
is not nullis used in WHERE conditions, you can use indexes.
- Prune partitions
AnalyticDB for MySQL uses a distributed system where data is distributed across multiple shards and supports list partitions based on shards. In the schema design phase, after you determine the rules for sharding and partitioning, we commend that you increase the use of shard and partition fields in SQL statements to filter data and reduce the search scope.
CREATE TABLE t_fact_mail_status ( mail_id varchar COMMENT '', scan_timestamp timestamp COMMENT '', biz_date bigint COMMENT '', org_code varchar COMMENT '', org_name varchar COMMENT '', dlv_person_name varchar COMMENT '', receiver_name varchar COMMENT '', receiver_phone varchar COMMENT '', receiver_addr varchar COMMENT '', product_no varchar COMMENT '', mag_no varchar COMMENT '', op_1_timestamp bigint COMMENT '', op_2_timestamp bigint COMMENT '', op_3_timestamp bigint COMMENT '', op_4_timestamp bigint COMMENT '', op_5_timestamp bigint COMMENT '', PRIMARY KEY (mail_id,org_code,biz_date) ) DISTRIBUTED BY (org_code) PARTITIONED BY VALUE(biz_date) COMMENT '';
You can use the following SQL statement to enable partition pruning for the preceding table:
select count(*) as cn from t_fact_mail_status t where t.org_code = '21111101'and t.biz_date = 20171128
- Redistribute data
When multiple fact tables across shards and partitions are joined, you must redistribute data.
This operation consumes a large amount of resources. Make sure that data can be associated within the same shard or partition. This requires that when fact tables are joined, the ON condition contain a partition column and be an equivalent condition to reduce data redistribution due to misaligned partitions.
What will lead to index failures?
- Function (column)
Function conversion in SQL statements may lead to index failures. Example:
select c1,c2 from tab1 where substr(cast(time as varchar),1,10)='2017-10-12';.
- Fuzzy match
Fuzzy match conditions in SQL statements may lead to index failures.
select c1,c2 from tab1 where time like '2017-10-12%';
You can modify the preceding SQL statement to the following statement to avoid index failures:
select c1,c2 from tab1 where time>='2017-10-12 00:00:00' and time<='2017-10-12 23:59:59';