APPX_COUNT_DISTINCT Query Option (Impala 2.0 or higher only)
Allows multiple COUNT(DISTINCT)
operations within a single query, by internally rewriting
each COUNT(DISTINCT)
to use the NDV()
function. The resulting count is
approximate rather than precise.
Type: Boolean; recognized values are 1 and 0, or true
and false
;
any other value interpreted as false
Default: false
(shown as 0 in output of SET
statement)
Examples:
The following examples show how the APPX_COUNT_DISTINCT
lets you work around the restriction
where a query can only evaluate COUNT(DISTINCT col_name)
for a single
column. By default, you can count the distinct values of one column or another, but not both in a single
query:
[localhost:21000] > select count(distinct x) from int_t;
+-------------------+
| count(distinct x) |
+-------------------+
| 10 |
+-------------------+
[localhost:21000] > select count(distinct property) from int_t;
+--------------------------+
| count(distinct property) |
+--------------------------+
| 7 |
+--------------------------+
[localhost:21000] > select count(distinct x), count(distinct property) from int_t;
ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters
as count(DISTINCT x); deviating function: count(DISTINCT property)
When you enable the APPX_COUNT_DISTINCT
query option, now the query with multiple
COUNT(DISTINCT)
works. The reason this behavior requires a query option is that each
COUNT(DISTINCT)
is rewritten internally to use the NDV()
function instead,
which provides an approximate result rather than a precise count.
[localhost:21000] > set APPX_COUNT_DISTINCT=true;
[localhost:21000] > select count(distinct x), count(distinct property) from int_t;
+-------------------+--------------------------+
| count(distinct x) | count(distinct property) |
+-------------------+--------------------------+
| 10 | 7 |
+-------------------+--------------------------+
Related information: