APPROX_COUNT_DISTINCT returns an approximate count of distinct values in a column. It trades a small, configurable portion of accuracy for significantly lower state overhead on the aggregate node — making it well-suited for high-cardinality use cases like unique visitor (UV) counting where exact counts are not required.
Version requirement: Ververica Runtime (VVR) 3.0.0 or later.
When to use it
Use APPROX_COUNT_DISTINCT instead of COUNT DISTINCT when both of the following are true:
The column has a large number of distinct keys (e.g., user IDs, session tokens, URLs). The function provides no meaningful performance benefit for low-cardinality columns.
The input stream does not contain retract messages.
If either condition is not met, use COUNT DISTINCT instead.
How it works
COUNT DISTINCT must store every distinct key in state data to produce an exact result. For high-cardinality columns, this causes significant read/write overhead and becomes a performance bottleneck.
APPROX_COUNT_DISTINCT estimates the distinct count without storing every key. This eliminates the state bottleneck and enables miniBatch and local-global optimization on the aggregate node.
The accuracy parameter controls the computation accuracy. A larger value indicates a higher accuracy, which leads to a higher state overhead and weakens the performance advantage over COUNT DISTINCT.
Syntax
APPROX_COUNT_DISTINCT(col [, accuracy])Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
col | All data types | Yes | The column to count distinct values in. |
accuracy | FLOAT | No | The computation accuracy. Valid values: (0.0, 1.0). Default: 0.99. A higher value indicates higher accuracy at the cost of higher state overhead, which reduces the performance advantage over COUNT DISTINCT. |
Example
The following example compares results at the default accuracy (0.99) and a lower accuracy (0.9).
Input table T1
| a (VARCHAR) | b (BIGINT) |
|---|---|
| Hi | 1 |
| Hi | 2 |
| Hi | 3 |
| Hi | 4 |
| Hi | 5 |
| Hi | 6 |
| Hello | 1 |
| Hello | 2 |
| Hello | 3 |
| Hello | 4 |
Query
SELECT
a,
APPROX_COUNT_DISTINCT(b) AS b,
APPROX_COUNT_DISTINCT(b, 0.9) AS c
FROM T1
GROUP BY a;Results
| a (VARCHAR) | b (BIGINT) | c (BIGINT) |
|---|---|---|
| Hi | 6 | 6 |
| Hello | 4 | 4 |
Column b uses the default accuracy of 0.99; column c uses 0.9.