Impala can do better optimization for complex or multi-table queries when it has access to statistics about the volume of data and how the values are distributed. Impala uses this information to help parallelize and distribute the work for a query. For example, optimizing join queries requires a way of determining if one table is "bigger" than another, which is a function of the number of rows and the average row size for each table. The following sections describe the categories of statistics Impala can work with, and how to produce them and keep them up to date.
The Impala query planner can make use of statistics about entire tables and partitions. This information includes physical characteristics such as the number of rows, number of data files, the total size of the data files, and the file format. For partitioned tables, the numbers are calculated per partition, and as totals for the whole table. This metadata is stored in the metastore database, and can be updated by either Impala or Hive. If a number is not available, the value -1 is used as a placeholder. Some numbers, such as number and total sizes of data files, are always kept up to date because they can be calculated cheaply, as part of gathering HDFS block metadata.
The following example shows table stats for an unpartitioned Parquet table. The values
for the number and sizes of files are always available. Initially, the number of rows is
not known, because it requires a potentially expensive scan through the entire table,
and so that value is displayed as -1. The COMPUTE STATS
statement fills
in any unknown table stats values.
show table stats parquet_snappy;
+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| -1 | 96 | 23.35GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
compute stats parquet_snappy;
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 6 column(s). |
+-----------------------------------------+
show table stats parquet_snappy;
+------------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+------------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| 1000000000 | 96 | 23.35GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
+------------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
Impala performs some optimizations using this metadata on its own, and other optimizations by using a combination of table and column statistics.
To check that table statistics are available for a table, and see the details of those
statistics, use the statement SHOW TABLE STATS
table_name
. See SHOW Statement for
details.
If you use the Hive-based methods of gathering statistics, see
the
Hive wiki for information about the required configuration on the Hive side.
Where practical, use the Impala COMPUTE STATS
statement to avoid
potential configuration and scalability issues with the statistics-gathering process.
If you run the Hive statement ANALYZE TABLE COMPUTE STATISTICS FOR
COLUMNS
, Impala can only use the resulting column statistics if the table is
unpartitioned. Impala cannot use Hive-generated column statistics for a partitioned
table.
The Impala query planner can make use of statistics about individual columns when that
metadata is available in the metastore database. This technique is most valuable for
columns compared across tables in join
queries, to help estimate how many rows the query will retrieve from each table.
These statistics are also important for correlated subqueries using the
EXISTS()
or IN()
operators, which are processed
internally the same way as join queries.
The following example shows column stats for an unpartitioned Parquet table. The values
for the maximum and average sizes of some types are always available, because those
figures are constant for numeric and other fixed-size types. Initially, the number of
distinct values is not known, because it requires a potentially expensive scan through
the entire table, and so that value is displayed as -1. The same applies to maximum and
average sizes of variable-sized types, such as STRING
. The
COMPUTE STATS
statement fills in most unknown column stats values. (It
does not record the number of NULL
values, because currently Impala
does not use that figure for query optimization.)
show column stats parquet_snappy;
+-------------+----------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+-------------+----------+------------------+--------+----------+----------+
| id | BIGINT | -1 | -1 | 8 | 8 |
| val | INT | -1 | -1 | 4 | 4 |
| zerofill | STRING | -1 | -1 | -1 | -1 |
| name | STRING | -1 | -1 | -1 | -1 |
| assertion | BOOLEAN | -1 | -1 | 1 | 1 |
| location_id | SMALLINT | -1 | -1 | 2 | 2 |
+-------------+----------+------------------+--------+----------+----------+
compute stats parquet_snappy;
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 6 column(s). |
+-----------------------------------------+
show column stats parquet_snappy;
+-------------+----------+------------------+--------+----------+-------------------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+-------------+----------+------------------+--------+----------+-------------------+
| id | BIGINT | 183861280 | -1 | 8 | 8 |
| val | INT | 139017 | -1 | 4 | 4 |
| zerofill | STRING | 101761 | -1 | 6 | 6 |
| name | STRING | 145636240 | -1 | 22 | 13.00020027160645 |
| assertion | BOOLEAN | 2 | -1 | 1 | 1 |
| location_id | SMALLINT | 339 | -1 | 2 | 2 |
+-------------+----------+------------------+--------+----------+-------------------+
For column statistics to be effective in Impala, you also need to have table
statistics for the applicable tables, as described in
Overview of Table Statistics. When you use the Impala
COMPUTE STATS
statement, both table and column statistics are
automatically gathered at the same time, for all columns in the table.
COMPUTE STATS
counted the number of
NULL
values in each column and recorded that figure in the metastore
database. Because Impala does not currently use the NULL
count during
query planning, Impala 1.4.0 and higher speeds up the COMPUTE STATS
statement by skipping this NULL
counting.
To check whether column statistics are available for a particular set of columns, use
the SHOW COLUMN STATS table_name
statement, or check
the extended EXPLAIN
output for a query against that table that refers
to those columns. See SHOW Statement and
EXPLAIN Statement for details.
If you run the Hive statement ANALYZE TABLE COMPUTE STATISTICS FOR
COLUMNS
, Impala can only use the resulting column statistics if the table is
unpartitioned. Impala cannot use Hive-generated column statistics for a partitioned
table.
When you use Impala for "big data", you are highly likely to use partitioning for your biggest tables, the ones representing data that can be logically divided based on dates, geographic regions, or similar criteria. The table and column statistics are especially useful for optimizing queries on such tables. For example, a query involving one year might involve substantially more or less data than a query involving a different year, or a range of several years. Each query might be optimized differently as a result.
The following examples show how table and column stats work with a partitioned table.
The table for this example is partitioned by year, month, and day. For simplicity, the
sample data consists of 5 partitions, all from the same year and month. Table stats are
collected independently for each partition. (In fact, the SHOW
PARTITIONS
statement displays exactly the same information as SHOW
TABLE STATS
for a partitioned table.) Column stats apply to the entire table,
not to individual partitions. Because the partition key column values are represented as
HDFS directories, their characteristics are typically known in advance, even when the
values for non-key columns are shown as -1.
show partitions year_month_day;
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| 2013 | 12 | 1 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
| 2013 | 12 | 2 | -1 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
| 2013 | 12 | 3 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
| 2013 | 12 | 4 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
| 2013 | 12 | 5 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
| Total | | | -1 | 5 | 12.58MB | 0B | | | | |
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
show table stats year_month_day;
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| 2013 | 12 | 1 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
| 2013 | 12 | 2 | -1 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
| 2013 | 12 | 3 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
| 2013 | 12 | 4 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
| 2013 | 12 | 5 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
| Total | | | -1 | 5 | 12.58MB | 0B | | | | |
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
show column stats year_month_day;
+-----------+---------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+-----------+---------+------------------+--------+----------+----------+
| id | INT | -1 | -1 | 4 | 4 |
| val | INT | -1 | -1 | 4 | 4 |
| zfill | STRING | -1 | -1 | -1 | -1 |
| name | STRING | -1 | -1 | -1 | -1 |
| assertion | BOOLEAN | -1 | -1 | 1 | 1 |
| year | INT | 1 | 0 | 4 | 4 |
| month | INT | 1 | 0 | 4 | 4 |
| day | INT | 5 | 0 | 4 | 4 |
+-----------+---------+------------------+--------+----------+----------+
compute stats year_month_day;
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 5 partition(s) and 5 column(s). |
+-----------------------------------------+
show table stats year_month_day;
+-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| 2013 | 12 | 1 | 93606 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
| 2013 | 12 | 2 | 94158 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
| 2013 | 12 | 3 | 94122 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
| 2013 | 12 | 4 | 93559 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
| 2013 | 12 | 5 | 93845 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
| Total | | | 469290 | 5 | 12.58MB | 0B | | | | |
+-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
show column stats year_month_day;
+-----------+---------+------------------+--------+----------+-------------------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+-----------+---------+------------------+--------+----------+-------------------+
| id | INT | 511129 | -1 | 4 | 4 |
| val | INT | 364853 | -1 | 4 | 4 |
| zfill | STRING | 311430 | -1 | 6 | 6 |
| name | STRING | 471975 | -1 | 22 | 13.00160026550293 |
| assertion | BOOLEAN | 2 | -1 | 1 | 1 |
| year | INT | 1 | 0 | 4 | 4 |
| month | INT | 1 | 0 | 4 | 4 |
| day | INT | 5 | 0 | 4 | 4 |
+-----------+---------+------------------+--------+----------+-------------------+
If you run the Hive statement ANALYZE TABLE COMPUTE STATISTICS FOR
COLUMNS
, Impala can only use the resulting column statistics if the table is
unpartitioned. Impala cannot use Hive-generated column statistics for a partitioned
table.
Use the COMPUTE STATS
family of commands to collect table and
column statistics. The COMPUTE STATS
variants offer
different tradeoffs between computation cost, staleness, and maintenance
workflows which are explained below.
For a particular table, use either COMPUTE STATS
or COMPUTE
INCREMENTAL STATS
, but never combine the two or alternate between them. If you
switch from COMPUTE STATS
to COMPUTE INCREMENTAL STATS
during the lifetime of a table, or vice versa, drop all statistics by running
DROP STATS
before making the switch.
The COMPUTE STATS
command collects and sets the table-level
and partition-level row counts as well as all column statistics for a given
table. The collection process is CPU-intensive and can take a long time to
complete for very large tables.
COMPUTE STATS
consider the following options
which can be combined.
Limit the number of columns for which statistics are collected to increase the efficiency of COMPUTE STATS. Queries benefit from statistics for those columns involved in filters, join conditions, group by or partition by clauses. Other columns are good candidates to exclude from COMPUTE STATS. This feature is available since Impala 2.12.
Set the MT_DOP query option to use more threads within each participating impalad to compute the statistics faster - but not more efficiently. Note that computing stats on a large table with a high MT_DOP value can negatively affect other queries running at the same time if the COMPUTE STATS claims most CPU cycles. This feature is available since Impala 2.8.
Consider the experimental extrapolation and sampling features (see below) to further increase the efficiency of computing stats.
COMPUTE STATS
is intended to be run periodically,
e.g. weekly, or on-demand when the contents of a table have changed
significantly. Due to the high resource utilization and long repsonse
time of tCOMPUTE STATS
, it is most practical to run it
in a scheduled maintnance window where the Impala cluster is idle
enough to accommodate the expensive operation. The degree of change that
qualifies as "significant" depends on the query workload, but typically,
if 30% of the rows have changed then it is recommended to recompute
statistics.
If you reload a complete new set of data for a table, but the number of rows and number of distinct values for each column is relatively unchanged from before, you do not need to recompute stats for the table.
Newly added partitions do not have row count statistics. Table scans that only access those new partitions are treated as not having stats. Similarly, table scans that access both new and old partitions estimate the scan cardinality based on those old partitions that have stats, and the new partitions without stats are treated as having 0 rows.
The row counts of existing partitions become stale when data is added or dropped.
Computing stats for tables with a 100,000 or more partitions might fail or be very slow due to the high cost of updating the partition metadata in the Hive Metastore.
With transient compute resources it is important to minimize the time from starting a new cluster to successfully running queries. Since the cluster might be relatively short-lived, users might prefer to quickly collect stats that are "good enough" as opposed to spending a lot of time and resouces on computing full-fidelity stats.
The sampling feature makes COMPUTE STATS more efficient by processing a fraction of the table data, and the extrapolation feature aims to reduce the frequency at which COMPUTE STATS needs to be re-run by estimating the row count of new and modified partitions.
The sampling and extrapolation features are disabled by default. They can be enabled globally or for specific tables, as follows. Set the impalad start-up configuration "--enable_stats_extrapolation" to enable the features globally. To enable them only for a specific table, set the "impala.enable.stats.extrapolation" table property to "true" for the desired table. The table-level property overrides the global setting, so it is also possible to enable sampling and extrapolation globally, but disable it for specific tables by setting the table property to "false". Example: ALTER TABLE mytable test_table SET TBLPROPERTIES("impala.enable.stats.extrapolation"="true")
The main idea of stats extrapolation is to estimate the row count of new and modified partitions based on the result of the last COMPUTE STATS. Enabling stats extrapolation changes the behavior of COMPUTE STATS, as well as the cardinality estimation of table scans. COMPUTE STATS no longer computes and stores per-partition row counts, and instead, only computes a table-level row count together with the total number of file bytes in the table at that time. No partition metadata is modified. The input cardinality of a table scan is estimated by converting the data volume of relevant partitions to a row count, based on the table-level row count and file bytes statistics. It is assumed that within the same table, different sets of files with the same data volume correspond to the similar number of rows on average. With extrapolation enabled, the scan cardinality estimation ignores per-partition row counts. It only relies on the table-level statistics and the scanned data volume.
The SHOW TABLE STATS and EXPLAIN commands distinguish between row counts stored in the Hive Metastore, and the row counts extrapolated based on the above process. Consult the SHOW TABLE STATS and EXPLAIN documentation for more details.
A TABLESAMPLE clause may be added to COMPUTE STATS to limit the percentage of data to be processed. The final statistics are obtained by extrapolating the statistics from the data sample over the entire table. The extrapolated statistics are stored in the Hive Metastore, just as if no sampling was used. The following example runs COMPUTE STATS over a 10 percent data sample: COMPUTE STATS test_table TABLESAMPLE SYSTEM(10)
We have found that a 10 percent sampling rate typically offers a good tradeoff between statistics accuracy and execution cost. A sampling rate well below 10 percent has shown poor results and is not recommended.
In Impala 2.1.0 and higher, you can use the
COMPUTE INCREMENTAL STATS
and
DROP INCREMENTAL STATS
commands.
The INCREMENTAL
clauses work with incremental statistics,
a specialized feature for partitioned tables.
When you compute incremental statistics for a partitioned table, by default Impala only processes those partitions that do not yet have incremental statistics. By processing only newly added partitions, you can keep statistics up to date without incurring the overhead of reprocessing the entire table each time.
You can also compute or drop statistics for a specified subset of partitions by
including a PARTITION
clause in the
COMPUTE INCREMENTAL STATS
or DROP INCREMENTAL STATS
statement.
In Impala 3.0 and lower, approximately 400 bytes of metadata per column per partition are needed for caching. Tables with a big number of partitions and many columns can add up to a significant memory overhead as the metadata must be cached on the catalogd host and on every impalad host that is eligible to be a coordinator. If this metadata for all tables exceeds 2 GB, you might experience service downtime. In Impala 3.1 and higher, the issue was alleviated with an improved handling of incremental stats.
When you run COMPUTE INCREMENTAL STATS
on a table for the first time,
the statistics are computed again from scratch regardless of whether the table already
has statistics. Therefore, expect a one-time resource-intensive operation for scanning
the entire table when running COMPUTE INCREMENTAL STATS
for the first
time on a given table.
The metadata for incremental statistics is handled differently from the original style of statistics:
Issuing a COMPUTE INCREMENTAL STATS
without a partition
clause causes Impala to compute incremental stats for all partitions that
do not already have incremental stats. This might be the entire table when
running the command for the first time, but subsequent runs should only
update new partitions. You can force updating a partition that already has
incremental stats by issuing a DROP INCREMENTAL STATS
before running COMPUTE INCREMENTAL STATS
.
The SHOW TABLE STATS
and SHOW PARTITIONS
statements now include an additional column showing whether incremental statistics
are available for each column. A partition could already be covered by the original
type of statistics based on a prior COMPUTE STATS
statement, as
indicated by a value other than -1
under the #Rows
column. Impala query planning uses either kind of statistics when available.
COMPUTE INCREMENTAL STATS
takes more time than COMPUTE
STATS
for the same volume of data. Therefore it is most suitable for tables
with large data volume where new partitions are added frequently, making it
impractical to run a full COMPUTE STATS
operation for each new
partition. For unpartitioned tables, or partitioned tables that are loaded once and
not updated with new partitions, use the original COMPUTE STATS
syntax.
COMPUTE INCREMENTAL STATS
uses some memory in the
catalogd process, proportional to the number
of partitions and number of columns in the applicable table. The
memory overhead is approximately 400 bytes for each column in each
partition. This memory is reserved in the
catalogd daemon, the
statestored daemon, and in each instance of
the impalad daemon.
In cases where new files are added to an existing partition, issue a
REFRESH
statement for the table, followed by a DROP
INCREMENTAL STATS
and COMPUTE INCREMENTAL STATS
sequence
for the changed partition.
The DROP INCREMENTAL STATS
statement operates only on a single
partition at a time. To remove statistics (whether incremental or not) from all
partitions of a table, issue a DROP STATS
statement with no
INCREMENTAL
or PARTITION
clauses.
The following considerations apply to incremental statistics when the structure of an existing table is changed (known as schema evolution):
If you use an ALTER TABLE
statement to drop a column, the existing
statistics remain valid and COMPUTE INCREMENTAL STATS
does not
rescan any partitions.
If you use an ALTER TABLE
statement to add a column, Impala rescans
all partitions and fills in the appropriate column-level values the next time you
run COMPUTE INCREMENTAL STATS
.
If you use an ALTER TABLE
statement to change the data type of a
column, Impala rescans all partitions and fills in the appropriate column-level
values the next time you run COMPUTE INCREMENTAL STATS
.
If you use an ALTER TABLE
statement to change the file format of a
table, the existing statistics remain valid and a subsequent COMPUTE
INCREMENTAL STATS
does not rescan any partitions.
See COMPUTE STATS Statement and DROP STATS Statement for syntax details.
In Impala 3.0 and lower, when executing COMPUTE INCREMENTAL
STATS
on very large tables, use the configuration setting
--inc_stats_size_limit_bytes
to prevent Impala
from running out of memory while updating table metadata. If this
limit is reached, Impala will stop loading the table and return an
error. The error serves as an indication that COMPUTE
INCREMENTAL STATS
should not be used on the particular
table. Consider spitting the table and using regular COMPUTE
STATS
]if possible.
The --inc_stats_size_limit_bytes
limit is set as
a safety check, to prevent Impala from hitting the maximum limit for
the table metadata. Note that this limit is only one part of the
entire table's metadata all of which together must be below 2 GB.
The default value for
--inc_stats_size_limit_bytes
is 209715200, 200
MB.
To change the --inc_stats_size_limit_bytes
value,
restart impalad and catalogd with the new value specified in bytes,
for example, 1048576000 for 1 GB. See Modifying Impala Startup Options for the steps to
change the option and restart Impala daemons.
--inc_stats_size_limit_bytes
setting should be
increased with care. A big value for the setting, such as 1 GB or
more, can result in a spike in heap usage as well as a crash of
Impala. In Impala 3.1 and higher, Impala improved how metadata is updated
when executing COMPUTE INCREMENTAL STATS
,
significantly reducing the need for
--inc_stats_size_limit_bytes
.
You can check whether a specific table has statistics using the SHOW TABLE
STATS
statement (for any table) or the SHOW PARTITIONS
statement (for a partitioned table). Both statements display the same information. If a
table or a partition does not have any statistics, the #Rows
field
contains -1
. Once you compute statistics for the table or partition,
the #Rows
field changes to an accurate value.
The following example shows a table that initially does not have any statistics. The
SHOW TABLE STATS
statement displays different values for
#Rows
before and after the COMPUTE STATS
operation.
[localhost:21000] > create table no_stats (x int);
[localhost:21000] > show table stats no_stats;
+-------+--------+------+--------------+--------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+--------+------+--------------+--------+-------------------+
| -1 | 0 | 0B | NOT CACHED | TEXT | false |
+-------+--------+------+--------------+--------+-------------------+
[localhost:21000] > compute stats no_stats;
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 1 column(s). |
+-----------------------------------------+
[localhost:21000] > show table stats no_stats;
+-------+--------+------+--------------+--------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+--------+------+--------------+--------+-------------------+
| 0 | 0 | 0B | NOT CACHED | TEXT | false |
+-------+--------+------+--------------+--------+-------------------+
The following example shows a similar progression with a partitioned table. Initially,
#Rows
is -1
. After a COMPUTE STATS
operation, #Rows
changes to an accurate value. Any newly added
partition starts with no statistics, meaning that you must collect statistics after
adding a new partition.
[localhost:21000] > create table no_stats_partitioned (x int) partitioned by (year smallint);
[localhost:21000] > show table stats no_stats_partitioned;
+-------+-------+--------+------+--------------+--------+-------------------+
| year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+-------+--------+------+--------------+--------+-------------------+
| Total | -1 | 0 | 0B | 0B | | |
+-------+-------+--------+------+--------------+--------+-------------------+
[localhost:21000] > show partitions no_stats_partitioned;
+-------+-------+--------+------+--------------+--------+-------------------+
| year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+-------+--------+------+--------------+--------+-------------------+
| Total | -1 | 0 | 0B | 0B | | |
+-------+-------+--------+------+--------------+--------+-------------------+
[localhost:21000] > alter table no_stats_partitioned add partition (year=2013);
[localhost:21000] > compute stats no_stats_partitioned;
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 1 column(s). |
+-----------------------------------------+
[localhost:21000] > alter table no_stats_partitioned add partition (year=2014);
[localhost:21000] > show partitions no_stats_partitioned;
+-------+-------+--------+------+--------------+--------+-------------------+
| year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+-------+--------+------+--------------+--------+-------------------+
| 2013 | 0 | 0 | 0B | NOT CACHED | TEXT | false |
| 2014 | -1 | 0 | 0B | NOT CACHED | TEXT | false |
| Total | 0 | 0 | 0B | 0B | | |
+-------+-------+--------+------+--------------+--------+-------------------+
COMPUTE STATS
statement creates and updates
statistics for all partitions in a table, if you expect to frequently add new
partitions, use the COMPUTE INCREMENTAL STATS
syntax instead, which
lets you compute stats for a single specified partition, or only for those partitions
that do not already have incremental stats.
If checking each individual table is impractical, due to a large number of tables or
views that hide the underlying base tables, you can also check for missing statistics
for a particular query. Use the EXPLAIN
statement to preview query
efficiency before actually running the query. Use the query profile output available
through the PROFILE
command in impala-shell or the
web UI to verify query execution and timing after running the query. Both the
EXPLAIN
plan and the PROFILE
output display a warning
if any tables or partitions involved in the query do not have statistics.
[localhost:21000] > create table no_stats (x int);
[localhost:21000] > explain select count(*) from no_stats;
+------------------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=10.00MB VCores=1 |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| incremental_stats.no_stats |
| |
| 03:AGGREGATE [FINALIZE] |
| | output: count:merge(*) |
| | |
| 02:EXCHANGE [UNPARTITIONED] |
| | |
| 01:AGGREGATE |
| | output: count(*) |
| | |
| 00:SCAN HDFS [incremental_stats.no_stats] |
| partitions=1/1 files=0 size=0B |
+------------------------------------------------------------------------------------+
Because Impala uses the partition pruning technique when possible to only
evaluate certain partitions, if you have a partitioned table with statistics for some
partitions and not others, whether or not the EXPLAIN
statement shows
the warning depends on the actual partitions used by the query. For example, you might
see warnings or not for different queries against the same table:
-- No warning because all the partitions for the year 2012 have stats.
EXPLAIN SELECT ... FROM t1 WHERE year = 2012;
-- Missing stats warning because one or more partitions in this range
-- do not have stats.
EXPLAIN SELECT ... FROM t1 WHERE year BETWEEN 2006 AND 2009;
To confirm if any partitions at all in the table are missing statistics, you might
explain a query that scans the entire table, such as SELECT COUNT(*) FROM
table_name
.
The most crucial piece of data in all the statistics is the number of rows in the
table (for an unpartitioned or partitioned table) and for each partition (for a
partitioned table). The COMPUTE STATS
statement always gathers
statistics about all columns, as well as overall table statistics. If it is not
practical to do a full COMPUTE STATS
or COMPUTE INCREMENTAL
STATS
operation after adding a partition or inserting data, or if you can see
that Impala would produce a more efficient plan if the number of rows was different,
you can manually set the number of rows through an ALTER TABLE
statement:
-- Set total number of rows. Applies to both unpartitioned and partitioned tables.
alter table table_name set tblproperties('numRows'='new_value', 'STATS_GENERATED_VIA_STATS_TASK'='true');
-- Set total number of rows for a specific partition. Applies to partitioned tables only.
-- You must specify all the partition key columns in the PARTITION clause.
alter table table_name partition (keycol1=val1,keycol2=val2...) set tblproperties('numRows'='new_value', 'STATS_GENERATED_VIA_STATS_TASK'='true');
This statement avoids re-scanning any data files. (The requirement to include the
STATS_GENERATED_VIA_STATS_TASK
property is relatively new, as a
result of the issue
HIVE-8648
for the Hive metastore.)
create table analysis_data stored as parquet as select * from raw_data;
Inserted 1000000000 rows in 181.98s
compute stats analysis_data;
insert into analysis_data select * from smaller_table_we_forgot_before;
Inserted 1000000 rows in 15.32s
-- Now there are 1001000000 rows. We can update this single data point in the stats.
alter table analysis_data set tblproperties('numRows'='1001000000', 'STATS_GENERATED_VIA_STATS_TASK'='true');
For a partitioned table, update both the per-partition number of rows and the number of rows for the whole table:
-- If the table originally contained 1 million rows, and we add another partition with 30 thousand rows,
-- change the numRows property for the partition and the overall table.
alter table partitioned_data partition(year=2009, month=4) set tblproperties ('numRows'='30000', 'STATS_GENERATED_VIA_STATS_TASK'='true');
alter table partitioned_data set tblproperties ('numRows'='1030000', 'STATS_GENERATED_VIA_STATS_TASK'='true');
In practice, the COMPUTE STATS
statement, or COMPUTE
INCREMENTAL STATS
for a partitioned table, should be fast and convenient
enough that this technique is only useful for the very largest partitioned tables.
Because the column statistics might be left in a stale state, do not use this
technique as a replacement for COMPUTE STATS
. Only use this technique
if all other means of collecting statistics are impractical, or as a low-overhead
operation that you run in between periodic COMPUTE STATS
or
COMPUTE INCREMENTAL STATS
operations.
In Impala 2.6 and higher, you can also use the SET
COLUMN STATS
clause of ALTER TABLE
to manually set or change
column statistics. Only use this technique in cases where it is impractical to run
COMPUTE STATS
or COMPUTE INCREMENTAL STATS
frequently enough to keep up with data changes for a huge table.
numDVs
, numNulls
, avgSize
,
maxSize
. The key names and values are both quoted. This operation
applies to an entire table, not a specific partition. For example:
create table t1 (x int, s string);
insert into t1 values (1, 'one'), (2, 'two'), (2, 'deux');
show column stats t1;
+--------+--------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+--------+------------------+--------+----------+----------+
| x | INT | -1 | -1 | 4 | 4 |
| s | STRING | -1 | -1 | -1 | -1 |
+--------+--------+------------------+--------+----------+----------+
alter table t1 set column stats x ('numDVs'='2','numNulls'='0');
alter table t1 set column stats s ('numdvs'='3','maxsize'='4');
show column stats t1;
+--------+--------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+--------+------------------+--------+----------+----------+
| x | INT | 2 | 0 | 4 | 4 |
| s | STRING | 3 | -1 | 4 | -1 |
+--------+--------+------------------+--------+----------+----------+
The following examples walk through a sequence of SHOW TABLE STATS
,
SHOW COLUMN STATS
, ALTER TABLE
, and
SELECT
and INSERT
statements to illustrate various
aspects of how Impala uses statistics to help optimize queries.
This example shows table and column statistics for the STORE
column
used in the TPC-DS
benchmarks for decision support systems. It is a tiny table holding data for 12
stores. Initially, before any statistics are gathered by a COMPUTE
STATS
statement, most of the numeric fields show placeholder values of -1,
indicating that the figures are unknown. The figures that are filled in are values that
are easily countable or deducible at the physical level, such as the number of files,
total data size of the files, and the maximum and average sizes for data types that have
a constant size such as INT
, FLOAT
, and
TIMESTAMP
.
[localhost:21000] > show table stats store;
+-------+--------+--------+--------+
| #Rows | #Files | Size | Format |
+-------+--------+--------+--------+
| -1 | 1 | 3.08KB | TEXT |
+-------+--------+--------+--------+
Returned 1 row(s) in 0.03s
[localhost:21000] > show column stats store;
+--------------------+-----------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------------------+-----------+------------------+--------+----------+----------+
| s_store_sk | INT | -1 | -1 | 4 | 4 |
| s_store_id | STRING | -1 | -1 | -1 | -1 |
| s_rec_start_date | TIMESTAMP | -1 | -1 | 16 | 16 |
| s_rec_end_date | TIMESTAMP | -1 | -1 | 16 | 16 |
| s_closed_date_sk | INT | -1 | -1 | 4 | 4 |
| s_store_name | STRING | -1 | -1 | -1 | -1 |
| s_number_employees | INT | -1 | -1 | 4 | 4 |
| s_floor_space | INT | -1 | -1 | 4 | 4 |
| s_hours | STRING | -1 | -1 | -1 | -1 |
| s_manager | STRING | -1 | -1 | -1 | -1 |
| s_market_id | INT | -1 | -1 | 4 | 4 |
| s_geography_class | STRING | -1 | -1 | -1 | -1 |
| s_market_desc | STRING | -1 | -1 | -1 | -1 |
| s_market_manager | STRING | -1 | -1 | -1 | -1 |
| s_division_id | INT | -1 | -1 | 4 | 4 |
| s_division_name | STRING | -1 | -1 | -1 | -1 |
| s_company_id | INT | -1 | -1 | 4 | 4 |
| s_company_name | STRING | -1 | -1 | -1 | -1 |
| s_street_number | STRING | -1 | -1 | -1 | -1 |
| s_street_name | STRING | -1 | -1 | -1 | -1 |
| s_street_type | STRING | -1 | -1 | -1 | -1 |
| s_suite_number | STRING | -1 | -1 | -1 | -1 |
| s_city | STRING | -1 | -1 | -1 | -1 |
| s_county | STRING | -1 | -1 | -1 | -1 |
| s_state | STRING | -1 | -1 | -1 | -1 |
| s_zip | STRING | -1 | -1 | -1 | -1 |
| s_country | STRING | -1 | -1 | -1 | -1 |
| s_gmt_offset | FLOAT | -1 | -1 | 4 | 4 |
| s_tax_percentage | FLOAT | -1 | -1 | 4 | 4 |
+--------------------+-----------+------------------+--------+----------+----------+
Returned 29 row(s) in 0.04s
With the Hive ANALYZE TABLE
statement for column statistics, you had to
specify each column for which to gather statistics. The Impala COMPUTE
STATS
statement automatically gathers statistics for all columns, because it
reads through the entire table relatively quickly and can efficiently compute the values
for all the columns. This example shows how after running the COMPUTE
STATS
statement, statistics are filled in for both the table and all its
columns:
[localhost:21000] > compute stats store;
+------------------------------------------+
| summary |
+------------------------------------------+
| Updated 1 partition(s) and 29 column(s). |
+------------------------------------------+
Returned 1 row(s) in 1.88s
[localhost:21000] > show table stats store;
+-------+--------+--------+--------+
| #Rows | #Files | Size | Format |
+-------+--------+--------+--------+
| 12 | 1 | 3.08KB | TEXT |
+-------+--------+--------+--------+
Returned 1 row(s) in 0.02s
[localhost:21000] > show column stats store;
+--------------------+-----------+------------------+--------+----------+-------------------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------------------+-----------+------------------+--------+----------+-------------------+
| s_store_sk | INT | 12 | -1 | 4 | 4 |
| s_store_id | STRING | 6 | -1 | 16 | 16 |
| s_rec_start_date | TIMESTAMP | 4 | -1 | 16 | 16 |
| s_rec_end_date | TIMESTAMP | 3 | -1 | 16 | 16 |
| s_closed_date_sk | INT | 3 | -1 | 4 | 4 |
| s_store_name | STRING | 8 | -1 | 5 | 4.25 |
| s_number_employees | INT | 9 | -1 | 4 | 4 |
| s_floor_space | INT | 10 | -1 | 4 | 4 |
| s_hours | STRING | 2 | -1 | 8 | 7.083300113677979 |
| s_manager | STRING | 7 | -1 | 15 | 12 |
| s_market_id | INT | 7 | -1 | 4 | 4 |
| s_geography_class | STRING | 1 | -1 | 7 | 7 |
| s_market_desc | STRING | 10 | -1 | 94 | 55.5 |
| s_market_manager | STRING | 7 | -1 | 16 | 14 |
| s_division_id | INT | 1 | -1 | 4 | 4 |
| s_division_name | STRING | 1 | -1 | 7 | 7 |
| s_company_id | INT | 1 | -1 | 4 | 4 |
| s_company_name | STRING | 1 | -1 | 7 | 7 |
| s_street_number | STRING | 9 | -1 | 3 | 2.833300113677979 |
| s_street_name | STRING | 12 | -1 | 11 | 6.583300113677979 |
| s_street_type | STRING | 8 | -1 | 9 | 4.833300113677979 |
| s_suite_number | STRING | 11 | -1 | 9 | 8.25 |
| s_city | STRING | 2 | -1 | 8 | 6.5 |
| s_county | STRING | 1 | -1 | 17 | 17 |
| s_state | STRING | 1 | -1 | 2 | 2 |
| s_zip | STRING | 2 | -1 | 5 | 5 |
| s_country | STRING | 1 | -1 | 13 | 13 |
| s_gmt_offset | FLOAT | 1 | -1 | 4 | 4 |
| s_tax_percentage | FLOAT | 5 | -1 | 4 | 4 |
+--------------------+-----------+------------------+--------+----------+-------------------+
Returned 29 row(s) in 0.04s
The following example shows how statistics are represented for a partitioned table. In
this case, we have set up a table to hold the world's most trivial census data, a single
STRING
field, partitioned by a YEAR
column. The table
statistics include a separate entry for each partition, plus final totals for the
numeric fields. The column statistics include some easily deducible facts for the
partitioning column, such as the number of distinct values (the number of partition
subdirectories).
localhost:21000] > describe census;
+------+----------+---------+
| name | type | comment |
+------+----------+---------+
| name | string | |
| year | smallint | |
+------+----------+---------+
Returned 2 row(s) in 0.02s
[localhost:21000] > show table stats census;
+-------+-------+--------+------+---------+
| year | #Rows | #Files | Size | Format |
+-------+-------+--------+------+---------+
| 2000 | -1 | 0 | 0B | TEXT |
| 2004 | -1 | 0 | 0B | TEXT |
| 2008 | -1 | 0 | 0B | TEXT |
| 2010 | -1 | 0 | 0B | TEXT |
| 2011 | 0 | 1 | 22B | TEXT |
| 2012 | -1 | 1 | 22B | TEXT |
| 2013 | -1 | 1 | 231B | PARQUET |
| Total | 0 | 3 | 275B | |
+-------+-------+--------+------+---------+
Returned 8 row(s) in 0.02s
[localhost:21000] > show column stats census;
+--------+----------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+----------+------------------+--------+----------+----------+
| name | STRING | -1 | -1 | -1 | -1 |
| year | SMALLINT | 7 | -1 | 2 | 2 |
+--------+----------+------------------+--------+----------+----------+
Returned 2 row(s) in 0.02s
The following example shows how the statistics are filled in by a COMPUTE
STATS
statement in Impala.
[localhost:21000] > compute stats census;
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 3 partition(s) and 1 column(s). |
+-----------------------------------------+
Returned 1 row(s) in 2.16s
[localhost:21000] > show table stats census;
+-------+-------+--------+------+---------+
| year | #Rows | #Files | Size | Format |
+-------+-------+--------+------+---------+
| 2000 | -1 | 0 | 0B | TEXT |
| 2004 | -1 | 0 | 0B | TEXT |
| 2008 | -1 | 0 | 0B | TEXT |
| 2010 | -1 | 0 | 0B | TEXT |
| 2011 | 4 | 1 | 22B | TEXT |
| 2012 | 4 | 1 | 22B | TEXT |
| 2013 | 1 | 1 | 231B | PARQUET |
| Total | 9 | 3 | 275B | |
+-------+-------+--------+------+---------+
Returned 8 row(s) in 0.02s
[localhost:21000] > show column stats census;
+--------+----------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+----------+------------------+--------+----------+----------+
| name | STRING | 4 | -1 | 5 | 4.5 |
| year | SMALLINT | 7 | -1 | 2 | 2 |
+--------+----------+------------------+--------+----------+----------+
Returned 2 row(s) in 0.02s
For examples showing how some queries work differently when statistics are available,
see Examples of Join Order Optimization. You can see how Impala
executes a query differently in each case by observing the EXPLAIN
output before and after collecting statistics. Measure the before and after query times,
and examine the throughput numbers in before and after SUMMARY
or
PROFILE
output, to verify how much the improved plan speeds up
performance.