REFRESH Statement
To accurately respond to queries, the Impala node that acts as the coordinator (the node to which you are connected through impala-shell, JDBC, or ODBC) must have current metadata about those databases and tables that are referenced in Impala queries. If you are not familiar with the way Impala uses metadata and how it shares the same metastore database as Hive, see Overview of Impala Metadata and the Metastore for background information.
Syntax:
REFRESH [db_name.]table_name [PARTITION (key_col1=val1 [, key_col2=val2...])]
REFRESH FUNCTIONS db_name
Usage notes:
Use the REFRESH
statement to load the latest metastore metadata and block location data for
a particular table in these scenarios:
- After loading new data files into the HDFS data directory for the table. (Once you have set up an ETL pipeline to bring data into Impala on a regular basis, this is typically the most frequent reason why metadata needs to be refreshed.)
-
After issuing
ALTER TABLE
,INSERT
,LOAD DATA
, or other table-modifying SQL statement in Hive.
In Impala 2.3 and higher, the syntax ALTER TABLE table_name RECOVER PARTITIONS
is a faster alternative to REFRESH
when the only change to the table data is the addition of
new partition directories through Hive or manual HDFS operations.
See ALTER TABLE Statement for details.
You only need to issue the REFRESH
statement on the node to which you connect to issue
queries. The coordinator node divides the work among all the Impala nodes in a cluster, and sends read
requests for the correct HDFS blocks without relying on the metadata on the other nodes.
REFRESH
reloads the metadata for the table from the metastore database, and does an
incremental reload of the low-level block location data to account for any new data files added to the HDFS
data directory for the table. It is a low-overhead, single-table operation, specifically tuned for the common
scenario where new data files are added to HDFS.
Only the metadata for the specified table is flushed. The table must already exist and be known to Impala,
either because the CREATE TABLE
statement was run in Impala rather than Hive, or because a
previous INVALIDATE METADATA
statement caused Impala to reload its entire metadata catalog.
The catalog service broadcasts any changed metadata as a result of Impala
ALTER TABLE
, INSERT
and LOAD DATA
statements to all
Impala nodes. Thus, the REFRESH
statement is only required if you load data through Hive
or by manipulating data files in HDFS directly. See The Impala Catalog Service for
more information on the catalog service.
Another way to avoid inconsistency across nodes is to enable the
SYNC_DDL
query option before performing a DDL statement or an INSERT
or
LOAD DATA
.
The table name is a required parameter. To flush the metadata for all tables, use the
INVALIDATE METADATA
command.
Because REFRESH table_name
only works for tables that the current
Impala node is already aware of, when you create a new table in the Hive shell, enter
INVALIDATE METADATA new_table
before you can see the new table in
impala-shell. Once the table is known by Impala, you can issue REFRESH
table_name
after you add data files for that table.
INVALIDATE METADATA
and REFRESH
are counterparts: INVALIDATE
METADATA
waits to reload the metadata when needed for a subsequent query, but reloads all the
metadata for the table, which can be an expensive operation, especially for large tables with many
partitions. REFRESH
reloads the metadata immediately, but only loads the block location
data for newly added data files, making it a less expensive operation overall. If data was altered in some
more extensive way, such as being reorganized by the HDFS balancer, use INVALIDATE
METADATA
to avoid a performance penalty from reduced local reads. If you used Impala version 1.0,
the INVALIDATE METADATA
statement works just like the Impala 1.0 REFRESH
statement did, while the Impala 1.1 REFRESH
is optimized for the common use case of adding
new data files to an existing table, thus the table name argument is now required.
A metadata update for an impalad
instance is required if:
- A metadata change occurs.
- and the change is made through Hive.
- and the change is made to a metastore database to which clients such as the Impala shell or ODBC directly connect.
A metadata update for an Impala node is not required after you run ALTER TABLE
,
INSERT
, or other table-modifying statement in Impala rather than Hive. Impala handles the
metadata synchronization automatically through the catalog service.
Database and table metadata is typically modified by:
-
Hive - through
ALTER
,CREATE
,DROP
orINSERT
operations. -
Impalad - through
CREATE TABLE
,ALTER TABLE
, andINSERT
operations. Such changes are propagated to all Impala nodes by the Impala catalog service.
REFRESH
causes the metadata for that table to be immediately reloaded. For a huge table,
that process could take a noticeable amount of time; but doing the refresh up front avoids an unpredictable
delay later, for example if the next reference to the table is during a benchmark test.
Refreshing a single partition:
In Impala 2.7 and higher, the REFRESH
statement can apply to a single partition at a time,
rather than the whole table. Include the optional PARTITION (partition_spec)
clause and specify values for each of the partition key columns.
The following examples show how to make Impala aware of data added to a single partition, after data is loaded into a partition's data directory using some mechanism outside Impala, such as Hive or Spark. The partition can be one that Impala created and is already aware of, or a new partition created through Hive.
impala> create table p (x int) partitioned by (y int);
impala> insert into p (x,y) values (1,2), (2,2), (2,1);
impala> show partitions p;
+-------+-------+--------+------+...
| y | #Rows | #Files | Size |...
+-------+-------+--------+------+...
| 1 | -1 | 1 | 2B |...
| 2 | -1 | 1 | 4B |...
| Total | -1 | 2 | 6B |...
+-------+-------+--------+------+...
-- ... Data is inserted into one of the partitions by some external mechanism ...
beeline> insert into p partition (y = 1) values(1000);
impala> refresh p partition (y=1);
impala> select x from p where y=1;
+------+
| x |
+------+
| 2 | <- Original data created by Impala
| 1000 | <- Additional data inserted through Beeline
+------+
The same applies for tables with more than one partition key column.
The PARTITION
clause of the REFRESH
statement must include all the partition key columns.
impala> create table p2 (x int) partitioned by (y int, z int);
impala> insert into p2 (x,y,z) values (0,0,0), (1,2,3), (2,2,3);
impala> show partitions p2;
+-------+---+-------+--------+------+...
| y | z | #Rows | #Files | Size |...
+-------+---+-------+--------+------+...
| 0 | 0 | -1 | 1 | 2B |...
| 2 | 3 | -1 | 1 | 4B |...
| Total | | -1 | 2 | 6B |...
+-------+---+-------+--------+------+...
-- ... Data is inserted into one of the partitions by some external mechanism ...
beeline> insert into p2 partition (y = 2, z = 3) values(1000);
impala> refresh p2 partition (y=2, z=3);
impala> select x from p where y=2 and z = 3;
+------+
| x |
+------+
| 1 | <- Original data created by Impala
| 2 | <- Original data created by Impala
| 1000 | <- Additional data inserted through Beeline
+------+
The following examples show how specifying a nonexistent partition does not cause any error, and the order of the partition key columns does not have to match the column order in the table. The partition spec must include all the partition key columns; specifying an incomplete set of columns does cause an error.
-- Partition doesn't exist.
refresh p2 partition (y=0, z=3);
refresh p2 partition (y=0, z=-1)
-- Key columns specified in a different order than the table definition.
refresh p2 partition (z=1, y=0)
-- Incomplete partition spec causes an error.
refresh p2 partition (y=0)
ERROR: AnalysisException: Items in partition spec must exactly match the partition columns in the table definition: default.p2 (1 vs 2)
If you connect to different Impala nodes within an impala-shell session for
load-balancing purposes, you can enable the SYNC_DDL
query option to make each DDL
statement wait before returning, until the new or changed metadata has been received by all the Impala
nodes. See SYNC_DDL Query Option for details.
Examples:
The following example shows how you might use the REFRESH
statement after manually adding
new HDFS data files to the Impala data directory for a table:
[impalad-host:21000] > refresh t1;
[impalad-host:21000] > refresh t2;
[impalad-host:21000] > select * from t1;
...
[impalad-host:21000] > select * from t2;
...
For more examples of using REFRESH
and INVALIDATE METADATA
with a
combination of Impala and Hive operations, see Switching Back and Forth Between Impala and Hive.
Related impala-shell options:
The impala-shell option -r
issues an INVALIDATE METADATA
statement
when starting up the shell, effectively performing a REFRESH
of all tables.
Due to the expense of reloading the metadata for all tables, the impala-shell -r
option is not recommended for day-to-day use in a production environment. (This option was mainly intended as a workaround
for synchronization issues in very old Impala versions.)
HDFS permissions:
The user ID that the impalad daemon runs under,
typically the impala
user, must have execute
permissions for all the relevant directories holding table data.
(A table could have data spread across multiple directories,
or in unexpected paths, if it uses partitioning or
specifies a LOCATION
attribute for
individual partitions or the entire table.)
Issues with permissions might not cause an immediate error for this statement,
but subsequent statements such as SELECT
or SHOW TABLE STATS
could fail.
All HDFS and Sentry permissions and privileges are the same whether you refresh the entire table or a single partition.
HDFS considerations:
The REFRESH
command checks HDFS permissions of the underlying data files and directories,
caching this information so that a statement can be cancelled immediately if for example the
impala
user does not have permission to write to the data directory for the table. Impala
reports any lack of write permissions as an INFO
message in the log file, in case that
represents an oversight. If you change HDFS permissions to make data readable or writeable by the Impala
user, issue another REFRESH
to make Impala aware of the change.
COMPUTE
STATS
statement to make sure all statistics are up-to-date. Consider updating statistics for a
table after any INSERT
, LOAD DATA
, or CREATE TABLE AS
SELECT
statement in Impala, or after loading data through Hive and doing a REFRESH
table_name
in Impala. This technique is especially important for tables that
are very large, used in join queries, or both.
Amazon S3 considerations:
The REFRESH
and INVALIDATE METADATA
statements also cache metadata
for tables where the data resides in the Amazon Simple Storage Service (S3).
In particular, issue a REFRESH
for a table after adding or removing files
in the associated S3 data directory.
See Using Impala with the Amazon S3 Filesystem for details about working with S3 tables.
Cancellation: Cannot be cancelled.
Kudu considerations:
Much of the metadata for Kudu tables is handled by the underlying storage layer. Kudu tables have less reliance on the metastore database, and require less metadata caching on the Impala side. For example, information about partitions in Kudu tables is managed by Kudu, and Impala does not cache any block locality metadata for Kudu tables.
The REFRESH
and INVALIDATE METADATA
statements are needed less frequently for Kudu tables than for
HDFS-backed tables. Neither statement is needed when data is
added to, removed, or updated in a Kudu table, even if the changes
are made directly to Kudu through a client program using the Kudu API.
Run REFRESH table_name
or
INVALIDATE METADATA table_name
for a Kudu table only after making a change to the Kudu table schema,
such as adding or dropping a column, by a mechanism other than
Impala.
UDF considerations:
REFRESH FUNCTIONS
statement with the database name as an argument. Java-based UDFs can be added to the metastore
database through Hive CREATE FUNCTION
statements, and made visible to Impala
by subsequently running REFRESH FUNCTIONS
. For example:
CREATE DATABASE shared_udfs;
USE shared_udfs;
...use CREATE FUNCTION statements in Hive to create some Java-based UDFs
that Impala is not initially aware of...
REFRESH FUNCTIONS shared_udfs;
SELECT udf_created_by_hive(c1) FROM ...
Related information:
Overview of Impala Metadata and the Metastore, INVALIDATE METADATA Statement