DISTINCT Operator
The DISTINCT
operator in a SELECT
statement filters the result set to
remove duplicates:
-- Returns the unique values from one column.
-- NULL is included in the set of values if any rows have a NULL in this column.
select distinct c_birth_country from customer;
-- Returns the unique combinations of values from multiple columns.
select distinct c_salutation, c_last_name from customer;
You can use DISTINCT
in combination with an aggregation function, typically
COUNT()
, to find how many different values a column contains:
-- Counts the unique values from one column.
-- NULL is not included as a distinct value in the count.
select count(distinct c_birth_country) from customer;
-- Counts the unique combinations of values from multiple columns.
select count(distinct c_salutation, c_last_name) from customer;
One construct that Impala SQL does not support is using DISTINCT
in more than one
aggregation function in the same query. For example, you could not have a single query with both
COUNT(DISTINCT c_first_name)
and COUNT(DISTINCT c_last_name)
in the
SELECT
list.
Zero-length strings: For purposes of clauses such as DISTINCT
and GROUP
BY
, Impala considers zero-length strings (""
), NULL
, and space
to all be different values.
By default, Impala only allows a single COUNT(DISTINCT columns)
expression in each query.
If you do not need precise accuracy, you can produce an estimate of the distinct values for a column by
specifying NDV(column)
; a query can contain multiple instances of
NDV(column)
. To make Impala automatically rewrite
COUNT(DISTINCT)
expressions to NDV()
, enable the
APPX_COUNT_DISTINCT
query option.
To produce the same result as multiple COUNT(DISTINCT)
expressions, you can use the
following technique for queries involving a single table:
select v1.c1 result1, v2.c1 result2 from
(select count(distinct col1) as c1 from t1) v1
cross join
(select count(distinct col2) as c1 from t1) v2;
Because CROSS JOIN
is an expensive operation, prefer to use the NDV()
technique wherever practical.
In contrast with some database systems that always return DISTINCT
values in sorted order,
Impala does not do any ordering of DISTINCT
values. Always include an ORDER
BY
clause if you need the values in alphabetical or numeric sorted order.