This topic describes how to optimize queries to improve query efficiency.

Increase the number of shards

Shards are computing resources. The calculation speed increases with the number of shards. You must ensure that the system scans only up to 50 million data entries per shard. You can split a shard to increase the number of shards. For more information, see Split a shard.
Notice You are additionally charged for the shards that are generated by splitting a shard. After you split a shard, your queries are accelerated only when you query new data. Existing data remains in the original shard.

Shorten the time range and reduce the data volume of a query

  • A long time range slows down a query.

    If you want to accelerate calculation, you can shorten the time range of a query.

  • A large data volume slows down a query.

    We recommend that you reduce the data volume of a query.

Query data repeatedly

If the result of a query is inaccurate, you can query the data repeatedly. Each time you perform a query, the underlying acceleration mechanism analyzes the data based on the existing query results. Therefore, repeated queries can return more accurate results.

Optimize analytic statements

A time-consuming query statement features the following characteristics:
  • The GROUP BY clause is used to group analysis results based on one or more columns of the string type.
  • The GROUP BY clause is used to group analysis results based on more than five columns.
  • Operations that generate strings are included in the analytic statement.
You can use the following methods to optimize analytic statements:
  • Do not include operations that generate strings.
    For example, if you use the date_format function to generate a timestamp in a specified format, the query is inefficient. We recommend that you use the date_trunc function or the time_series function to generate a timestamp.
    * | select date_format(from_unixtime(__time__) , '%H_%i') as t, count(1) group by t
  • Do not group analysis results based on one or more columns of the string type.

    For example, if you use the GROUP BY clause to group analysis results based on one or more columns of the string type, the workload for hash calculation is heavy. The workload for hash calculation accounts for more than 50% of the overall workload for calculation. Examples:

    • Efficient query statement
      * | select count(1) as pv , from_unixtime(__time__-__time__%3600) as time group by __time__-__time__%3600
    • Inefficient query statement
      * | select count(1) as pv , date_trunc('hour',__time__) as time group by time

    The two query statements are used to calculate the number of logs per hour. In the second statement, the timestamps are converted to strings. Then, the analysis results are grouped based on the strings. For example, 2021-12-12 00:00:00 is a timestamp in the string format. In the first statement, the timestamps on the hour are obtained, the analysis results are grouped based on the timestamps, and then the timestamps are converted to strings.

  • If you want to group analysis results based on multiple columns, we recommend that you place a field that has a larger number of values before a field that has a smaller number of values.
    For example, if the number of values for a specified field is 13 and the number of values for the uid field is 100 million, we recommend that you place the uid field before the specified field in the GROUP BY clause. Examples:
    • Efficient query statement
      * | select province,uid,count(1) group by uid,province
    • Inefficient query statement
       * | select province,uid,count(1) group by province,uid
  • Use approximate functions.
    The performance of approximate functions is better than the performance of functions with specified precision. Approximate functions sacrifice precision for speed. Examples:
    • Efficient query statement
       * |select approx_distinct(ip)
    • Inefficient query statement
       * | select count(distinct(ip))
  • Specify only the columns that you want to query in an analytic statement.
    In an analytic statement, we recommend that you query only the columns that are required in calculation. If you want to query all columns, use the search syntax. Examples:
    • Efficient query statement
      * |select a,b c 
    • Inefficient query statement
      * |select *
  • Place the columns that are not used for grouping in an aggregate function.
    For example, the values of the userid and username columns must match each other. You can use the GROUP BY clause to group only the values of the userid column. Examples:
    • Efficient query statement
       * | select userid, arbitrary(username), count(1) group by userid 
    • Inefficient query statement
      * | select userid, username, count(1) group by userid,username
  • Do not use the IN clause.
    We recommend that you do not use the IN clause in an analytic statement. You can use OR in a search statement. Examples:
    • Efficient query statement
      key: a or key: b or key: c | select count(1)
    • Inefficient query statement
      * | select count(1) where key in ('a','b')