You can optimize your job performance by using the APPROX_COUNT_DISTINCT function. Compared with COUNT(DISTINCT), this function returns an approximate count.
- The input data does not contain retracted messages.
- A large number of distinct keys, such as unique visits (UVs), exist. The APPROX_COUNT_DISTINCT function cannot bring obvious benefits if only a small number of distinct keys exist.
APPROX_COUNT_DISTINCT(col [, accuracy])
- col indicates the name of a field, which can be of any type.
- accuracy specifies the calculation accuracy. A larger value indicates higher accuracy, higher state overhead, and lower performance. This field is optional. Valid values: (0.0, 1.0). Default value: 0.99.
- Test data
a (VARCHAR) c (BIGINT) Hi 1 Hi 2 Hi 3 Hi 4 Hi 5 Hi 6
- Test statement
SELECT a, APPROX_COUNT_DISTINCT(b) as b, APPROX_COUNT_DISTINCT(b, 0.9) as c FROM MyTable GROUP BY a;
- Test results
a (VARCHAR) b (BIGINT) c (BIGINT) Hi 5 5