The analysis efficiency varies from query to query. Common ways to optimize the query are as follows for your references:
Avoid running Group By on string columns if possible
Running Group By on strings leads to a large amount of hash calculations, which usually accounts for more than 50% of total calculations.
* | select count(1) as pv , date_trunc('hour',__time__) as time group by time * | select count(1) as pv , from_unixtime(__time__-__time__%3600) as time group by __time__-__time__%3600
Both Query 1 and Query 2 calculate the log count value every hour. However, Query 1 converts time into a string, for example,
2017-12-12 00:00:00, and then runs Group By on this string. Query 2 calculates the on-the-hour time value, runs Group By on the result, and then converts the value into a string. Query 1 is less efficient than Query 2 because the former one needs to hash strings.
List fields with relatively large dictionary values on top when running Group By on multiple columns
For example, 13 provinces have 100 million users.
Fast: * | select province,uid,count(1) group by province,uid Slow: * | select province,uid,count(1) group by uid,province
provide much stronger performance than accurate calculation. Estimation sacrifices some acceptable accuracy for fast calculation.
Fast: * | select approx_distinct(ip) Slow: * | select count(distinct(ip))
Retrieve required columns in SQL and do not read all columns if possible
Use the query syntax to retrieve all columns. To speed up calculation, retrieve only the required columns in SQL if possible.
Fast: * |select a,b c Slow: * |select *
Non-group by columns, as far as possible in aggregate Functions
For example, userid, user name, must be one corresponding, we just need to follow userid for group.
Fast: * | select userid, arbitrary(username), count(1)groupby userid Slow: * | select userid, username, count(1)groupby userid,username