NDV Function
An aggregate function that returns an approximate value similar to the result of COUNT(DISTINCT
col)
, the "number of distinct values". It is much faster than the
combination of COUNT
and DISTINCT
, and uses a constant amount of memory and
thus is less memory-intensive for columns with high cardinality.
Syntax:
NDV([DISTINCT | ALL] expression)
Usage notes:
This is the mechanism used internally by the COMPUTE STATS
statement for computing the
number of distinct values in a column.
Because this number is an estimate, it might not reflect the precise number of different values in the column, especially if the cardinality is very low or very high. If the estimated number is higher than the number of rows in the table, Impala adjusts the value internally during query planning.
Return type: DOUBLE
in Impala 2.0 and higher; STRING
in earlier
releases
Complex type considerations:
To access a column with a complex type (ARRAY
, STRUCT
, or MAP
)
in an aggregation function, you unpack the individual elements using join notation in the query,
and then apply the function to the final scalar item, field, key, or value at the bottom of any nested type hierarchy in the column.
See Complex Types (Impala 2.3 or higher only) for details about using complex types in Impala.
ARRAY
of STRUCT
items).
The array is unpacked inside the query using join notation.
The array elements are referenced using the ITEM
pseudocolumn, and the structure fields inside the array elements
are referenced using dot notation.
Numeric values such as SUM()
and AVG()
are computed using the numeric R_NATIONKEY
field, and
the general-purpose MAX()
and MIN()
values are computed from the string N_NAME
field.
describe region;
+-------------+-------------------------+---------+
| name | type | comment |
+-------------+-------------------------+---------+
| r_regionkey | smallint | |
| r_name | string | |
| r_comment | string | |
| r_nations | array<struct< | |
| | n_nationkey:smallint, | |
| | n_name:string, | |
| | n_comment:string | |
| | >> | |
+-------------+-------------------------+---------+
select r_name, r_nations.item.n_nationkey
from region, region.r_nations as r_nations
order by r_name, r_nations.item.n_nationkey;
+-------------+------------------+
| r_name | item.n_nationkey |
+-------------+------------------+
| AFRICA | 0 |
| AFRICA | 5 |
| AFRICA | 14 |
| AFRICA | 15 |
| AFRICA | 16 |
| AMERICA | 1 |
| AMERICA | 2 |
| AMERICA | 3 |
| AMERICA | 17 |
| AMERICA | 24 |
| ASIA | 8 |
| ASIA | 9 |
| ASIA | 12 |
| ASIA | 18 |
| ASIA | 21 |
| EUROPE | 6 |
| EUROPE | 7 |
| EUROPE | 19 |
| EUROPE | 22 |
| EUROPE | 23 |
| MIDDLE EAST | 4 |
| MIDDLE EAST | 10 |
| MIDDLE EAST | 11 |
| MIDDLE EAST | 13 |
| MIDDLE EAST | 20 |
+-------------+------------------+
select
r_name,
count(r_nations.item.n_nationkey) as count,
sum(r_nations.item.n_nationkey) as sum,
avg(r_nations.item.n_nationkey) as avg,
min(r_nations.item.n_name) as minimum,
max(r_nations.item.n_name) as maximum,
ndv(r_nations.item.n_nationkey) as distinct_vals
from
region, region.r_nations as r_nations
group by r_name
order by r_name;
+-------------+-------+-----+------+-----------+----------------+---------------+
| r_name | count | sum | avg | minimum | maximum | distinct_vals |
+-------------+-------+-----+------+-----------+----------------+---------------+
| AFRICA | 5 | 50 | 10 | ALGERIA | MOZAMBIQUE | 5 |
| AMERICA | 5 | 47 | 9.4 | ARGENTINA | UNITED STATES | 5 |
| ASIA | 5 | 68 | 13.6 | CHINA | VIETNAM | 5 |
| EUROPE | 5 | 77 | 15.4 | FRANCE | UNITED KINGDOM | 5 |
| MIDDLE EAST | 5 | 58 | 11.6 | EGYPT | SAUDI ARABIA | 5 |
+-------------+-------+-----+------+-----------+----------------+---------------+
Restrictions:
This function cannot be used in an analytic context. That is, the OVER()
clause is not allowed at all with this function.
Examples:
The following example queries a billion-row table to illustrate the relative performance of
COUNT(DISTINCT)
and NDV()
. It shows how COUNT(DISTINCT)
gives a precise answer, but is inefficient for large-scale data where an approximate result is sufficient.
The NDV()
function gives an approximate result but is much faster.
select count(distinct col1) from sample_data;
+---------------------+
| count(distinct col1)|
+---------------------+
| 100000 |
+---------------------+
Fetched 1 row(s) in 20.13s
select cast(ndv(col1) as bigint) as col1 from sample_data;
+----------+
| col1 |
+----------+
| 139017 |
+----------+
Fetched 1 row(s) in 8.91s
The following example shows how you can code multiple NDV()
calls in a single query, to
easily learn which columns have substantially more or fewer distinct values. This technique is faster than
running a sequence of queries with COUNT(DISTINCT)
calls.
select cast(ndv(col1) as bigint) as col1, cast(ndv(col2) as bigint) as col2,
cast(ndv(col3) as bigint) as col3, cast(ndv(col4) as bigint) as col4
from sample_data;
+----------+-----------+------------+-----------+
| col1 | col2 | col3 | col4 |
+----------+-----------+------------+-----------+
| 139017 | 282 | 46 | 145636240 |
+----------+-----------+------------+-----------+
Fetched 1 row(s) in 34.97s
select count(distinct col1) from sample_data;
+---------------------+
| count(distinct col1)|
+---------------------+
| 100000 |
+---------------------+
Fetched 1 row(s) in 20.13s
select count(distinct col2) from sample_data;
+----------------------+
| count(distinct col2) |
+----------------------+
| 278 |
+----------------------+
Fetched 1 row(s) in 20.09s
select count(distinct col3) from sample_data;
+-----------------------+
| count(distinct col3) |
+-----------------------+
| 46 |
+-----------------------+
Fetched 1 row(s) in 19.12s
select count(distinct col4) from sample_data;
+----------------------+
| count(distinct col4) |
+----------------------+
| 147135880 |
+----------------------+
Fetched 1 row(s) in 266.95s