You can optimize your job performance by using the APPROX_COUNT_DISTINCT function. Compared with COUNT(DISTINCT), this function returns an approximate count.

Background information

When the COUNT(DISTINCT) function is used, distinct key information is saved in state data of the aggregate node. If a large number of distinct keys exist, the read/write overhead of state data is large. This causes a bottleneck in job performance optimization. In many cases, accurate computation is not necessary. If you are willing to achieve high job performance at the expense of accuracy, you can use the APPROX_COUNT_DISTINCT function. APPROX_COUNT_DISTINCT supports miniBatch and local-global optimization on the aggregate node. When you use this function, make sure that the following requirements are met:
  • The input data does not contain retracted messages.
  • A large number of distinct keys, such as unique visits (UVs), exist. The APPROX_COUNT_DISTINCT function cannot bring obvious benefits if only a small number of distinct keys exist.

Optimization method

Use APPROX_COUNT_DISTINCT(user) to replace COUNT(DISTINCT user) in the SQL. The syntax of APPROX_COUNT_DISTINCT(user) is:
APPROX_COUNT_DISTINCT(col [, accuracy])
where:
  • col indicates the name of a field, which can be of any type.
  • accuracy specifies the calculation accuracy. A larger value indicates higher accuracy, higher state overhead, and lower performance. This field is optional. Valid values: (0.0, 1.0). Default value: 0.99.

Sample code

  • Test data
    a (VARCHAR) c (BIGINT)
    Hi 1
    Hi 2
    Hi 3
    Hi 4
    Hi 5
    Hi 6
  • Test statement
    SELECT 
      a,
      APPROX_COUNT_DISTINCT(b) as b,
      APPROX_COUNT_DISTINCT(b, 0.9) as c
    FROM MyTable
    GROUP BY a;
  • Test results
    a (VARCHAR) b (BIGINT) c (BIGINT)
    Hi 5 5