OPTIMIZE_PARTITION_KEY_SCANS Query Option (Impala 2.5 or higher only)

Enables a fast code path for queries that apply simple aggregate functions to partition key columns: MIN(key_column), MAX(key_column), or COUNT(DISTINCT key_column).

Type: Boolean; recognized values are 1 and 0, or true and false; any other value interpreted as false

Default: false (shown as 0 in output of SET statement)

Note: In Impala 2.5.0, only the value 1 enables the option, and the value true is not recognized. This limitation is tracked by the issue IMPALA-3334, which shows the releases where the problem is fixed.

Added in: Impala 2.5.0

Usage notes:

This optimization speeds up common "introspection" operations over partition key columns, for example determining the distinct values of partition keys.

This optimization does not apply to SELECT statements that reference columns that are not partition keys. It also only applies when all the partition key columns in the SELECT statement are referenced in one of the following contexts:

This optimization is enabled by a query option because it skips some consistency checks and therefore can return slightly different partition values if partitions are in the process of being added, dropped, or loaded outside of Impala. Queries might exhibit different behavior depending on the setting of this option in the following cases:

Examples:

The following example shows initial schema setup and the default behavior of queries that return just the partition key column for a table:


-- Make a partitioned table with 3 partitions.
create table t1 (s string) partitioned by (year int);
insert into t1 partition (year=2015) values ('last year');
insert into t1 partition (year=2016) values ('this year');
insert into t1 partition (year=2017) values ('next year');

-- Regardless of the option setting, this query must read the
-- data files to know how many rows to return for each year value.
explain select year from t1;
+-----------------------------------------------------+
| Explain String                                      |
+-----------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0 |
|                                                     |
| F00:PLAN FRAGMENT [UNPARTITIONED]                   |
|   00:SCAN HDFS [key_cols.t1]                        |
|      partitions=3/3 files=4 size=40B                |
|      table stats: 3 rows total                      |
|      column stats: all                              |
|      hosts=3 per-host-mem=unavailable               |
|      tuple-ids=0 row-size=4B cardinality=3          |
+-----------------------------------------------------+

-- The aggregation operation means the query does not need to read
-- the data within each partition: the result set contains exactly 1 row
-- per partition, derived from the partition key column value.
-- By default, Impala still includes a 'scan' operation in the query.
explain select distinct year from t1;
+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0                                |
|                                                                                    |
| 01:AGGREGATE [FINALIZE]                                                            |
| |  group by: year                                                                  |
| |                                                                                  |
| 00:SCAN HDFS [key_cols.t1]                                                         |
|    partitions=0/0 files=0 size=0B                                                  |
+------------------------------------------------------------------------------------+

The following examples show how the plan is made more efficient when the OPTIMIZE_PARTITION_KEY_SCANS option is enabled:


set optimize_partition_key_scans=1;
OPTIMIZE_PARTITION_KEY_SCANS set to 1

-- The aggregation operation is turned into a UNION internally,
-- with constant values known in advance based on the metadata
-- for the partitioned table.
explain select distinct year from t1;
+-----------------------------------------------------+
| Explain String                                      |
+-----------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0 |
|                                                     |
| F00:PLAN FRAGMENT [UNPARTITIONED]                   |
|   01:AGGREGATE [FINALIZE]                           |
|   |  group by: year                                 |
|   |  hosts=1 per-host-mem=unavailable               |
|   |  tuple-ids=1 row-size=4B cardinality=3          |
|   |                                                 |
|   00:UNION                                          |
|      constant-operands=3                            |
|      hosts=1 per-host-mem=unavailable               |
|      tuple-ids=0 row-size=4B cardinality=3          |
+-----------------------------------------------------+

-- The same optimization applies to other aggregation queries
-- that only return values based on partition key columns:
-- MIN, MAX, COUNT(DISTINCT), and so on.
explain select min(year) from t1;
+-----------------------------------------------------+
| Explain String                                      |
+-----------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0 |
|                                                     |
| F00:PLAN FRAGMENT [UNPARTITIONED]                   |
|   01:AGGREGATE [FINALIZE]                           |
|   |  output: min(year)                              |
|   |  hosts=1 per-host-mem=unavailable               |
|   |  tuple-ids=1 row-size=4B cardinality=1          |
|   |                                                 |
|   00:UNION                                          |
|      constant-operands=3                            |
|      hosts=1 per-host-mem=unavailable               |
|      tuple-ids=0 row-size=4B cardinality=3          |
+-----------------------------------------------------+