# NDV Function

An aggregate function that returns an approximate value similar to the result of ```
COUNT(DISTINCT
```

, the "number of distinct values". It is much faster than the
combination of `col`)`COUNT`

and `DISTINCT`

, and uses a constant amount of memory and
thus is less memory-intensive for columns with high cardinality.

**Syntax:**

`NDV([DISTINCT | ALL] ``expression` [,scale])

`scale`

must be an integer and can be in the range
from 1 to 10 and maps to a precision used by the HyperLogLog (HLL) algorithm with the
following mapping formula:`precision = scale + 8`

Therefore a scale of 1 is mapped to a precision of 9 and a scale of 10 is mapped to a precision of 18.

Without the optional argument, the precision which determines the total number of different estimators in the HLL algorithm will be still 10.

A large precision value generally produces a better estimation with less error than a small precision value. This is due to the extra number of estimators involved. The expense is at the need of extra memory. For a given precision p, the amount of memory used by the HLL algorithm is in the order of 2^p bytes.

When provided a scale of 10 against a total of 22 distinct data sets loaded into external Impala tables, the error will be computed as abs(<true_unique_value> - <estimated_unique_value>) / <true_unique_value>

The scale of 10, mapped to the precision of 18, yielded the worst estimation error at 0.42% (for one set of 10 million integers), and average error no more than 0.17%. This was at the cost of 256Kb of memory for the internal data structure per evaluation of the HLL algorithm.

**Usage notes:**

This is the mechanism used internally by the `COMPUTE STATS`

statement for computing the
number of distinct values in a column.

Because this number is an estimate, it might not reflect the precise number of different values in the column, especially if the cardinality is very low or very high. If the estimated number is higher than the number of rows in the table, Impala adjusts the value internally during query planning.

**Return type:** `DOUBLE`

in Impala 2.0 and higher;
`STRING`

in earlier releases

**Complex type considerations:**

To access a column with a complex type (`ARRAY`

, `STRUCT`

,
or `MAP`

) in an aggregation function, you unpack the individual elements
using join notation in the query, and then apply the function to the final scalar item,
field, key, or value at the bottom of any nested type hierarchy in the column. See
Complex Types (Impala 2.3 or higher only) for details about using
complex types in Impala.

`ARRAY`

of
`STRUCT`

items). The array is unpacked inside the query using join
notation. The array elements are referenced using the `ITEM`

pseudocolumn, and the structure fields inside the array elements are referenced using
dot notation. Numeric values such as `SUM()`

and `AVG()`

are computed using the numeric `R_NATIONKEY`

field, and the
general-purpose `MAX()`

and `MIN()`

values are computed
from the string `N_NAME`

field.
```
describe region;
+-------------+-------------------------+---------+
| name | type | comment |
+-------------+-------------------------+---------+
| r_regionkey | smallint | |
| r_name | string | |
| r_comment | string | |
| r_nations | array<struct< | |
| | n_nationkey:smallint, | |
| | n_name:string, | |
| | n_comment:string | |
| | >> | |
+-------------+-------------------------+---------+
select r_name, r_nations.item.n_nationkey
from region, region.r_nations as r_nations
order by r_name, r_nations.item.n_nationkey;
+-------------+------------------+
| r_name | item.n_nationkey |
+-------------+------------------+
| AFRICA | 0 |
| AFRICA | 5 |
| AFRICA | 14 |
| AFRICA | 15 |
| AFRICA | 16 |
| AMERICA | 1 |
| AMERICA | 2 |
| AMERICA | 3 |
| AMERICA | 17 |
| AMERICA | 24 |
| ASIA | 8 |
| ASIA | 9 |
| ASIA | 12 |
| ASIA | 18 |
| ASIA | 21 |
| EUROPE | 6 |
| EUROPE | 7 |
| EUROPE | 19 |
| EUROPE | 22 |
| EUROPE | 23 |
| MIDDLE EAST | 4 |
| MIDDLE EAST | 10 |
| MIDDLE EAST | 11 |
| MIDDLE EAST | 13 |
| MIDDLE EAST | 20 |
+-------------+------------------+
select
r_name,
count(r_nations.item.n_nationkey) as count,
sum(r_nations.item.n_nationkey) as sum,
avg(r_nations.item.n_nationkey) as avg,
min(r_nations.item.n_name) as minimum,
max(r_nations.item.n_name) as maximum,
ndv(r_nations.item.n_nationkey) as distinct_vals
from
region, region.r_nations as r_nations
group by r_name
order by r_name;
+-------------+-------+-----+------+-----------+----------------+---------------+
| r_name | count | sum | avg | minimum | maximum | distinct_vals |
+-------------+-------+-----+------+-----------+----------------+---------------+
| AFRICA | 5 | 50 | 10 | ALGERIA | MOZAMBIQUE | 5 |
| AMERICA | 5 | 47 | 9.4 | ARGENTINA | UNITED STATES | 5 |
| ASIA | 5 | 68 | 13.6 | CHINA | VIETNAM | 5 |
| EUROPE | 5 | 77 | 15.4 | FRANCE | UNITED KINGDOM | 5 |
| MIDDLE EAST | 5 | 58 | 11.6 | EGYPT | SAUDI ARABIA | 5 |
+-------------+-------+-----+------+-----------+----------------+---------------+
```

**Restrictions:**

This function cannot be used in an analytic context. That is, the
`OVER()`

clause is not allowed at all with this function.

**Examples:**

The following example queries a billion-row table to illustrate the relative performance of
`COUNT(DISTINCT)`

and `NDV()`

. It shows how `COUNT(DISTINCT)`

gives a precise answer, but is inefficient for large-scale data where an approximate result is sufficient.
The `NDV()`

function gives an approximate result but is much faster.

```
select count(distinct col1) from sample_data;
+---------------------+
| count(distinct col1)|
+---------------------+
| 100000 |
+---------------------+
Fetched 1 row(s) in 20.13s
select cast(ndv(col1) as bigint) as col1 from sample_data;
+----------+
| col1 |
+----------+
| 139017 |
+----------+
Fetched 1 row(s) in 8.91s
```

The following example shows how you can code multiple `NDV()`

calls in a single query, to
easily learn which columns have substantially more or fewer distinct values. This technique is faster than
running a sequence of queries with `COUNT(DISTINCT)`

calls.

```
select cast(ndv(col1) as bigint) as col1, cast(ndv(col2) as bigint) as col2,
cast(ndv(col3) as bigint) as col3, cast(ndv(col4) as bigint) as col4
from sample_data;
+----------+-----------+------------+-----------+
| col1 | col2 | col3 | col4 |
+----------+-----------+------------+-----------+
| 139017 | 282 | 46 | 145636240 |
+----------+-----------+------------+-----------+
Fetched 1 row(s) in 34.97s
select count(distinct col1) from sample_data;
+---------------------+
| count(distinct col1)|
+---------------------+
| 100000 |
+---------------------+
Fetched 1 row(s) in 20.13s
select count(distinct col2) from sample_data;
+----------------------+
| count(distinct col2) |
+----------------------+
| 278 |
+----------------------+
Fetched 1 row(s) in 20.09s
select count(distinct col3) from sample_data;
+-----------------------+
| count(distinct col3) |
+-----------------------+
| 46 |
+-----------------------+
Fetched 1 row(s) in 19.12s
select count(distinct col4) from sample_data;
+----------------------+
| count(distinct col4) |
+----------------------+
| 147135880 |
+----------------------+
Fetched 1 row(s) in 266.95s
```