This release of Impala contains the following changes and enhancements from previous releases.
For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the release notes or changelog for Impala 4.0.
The following sections describe the noteworthy improvements made in Impala 3.4.
For the full list of issues closed in this release, see the changelog for Impala 3.4.
Impala added the support to truncate insert-only transactional tables.
By default, Impala creates an insert-only transactional table when
you issue the CREATE TABLE
statement.
Use the Hive compaction to compact small files to improve the performance and scalability of metadata in transactional tables.
See Impala Transactions for more information.
You can use the SPOOL_QUERY_RESULTS
query option to
control how query results are returned to the client.
By default, when a client fetches a set of query results, the next set of results are fetched in batches until all the result rows are produced. If a client issues a query without fetching all the results, the query fragments continue to hold on to the resources until the query is canceled and unregistered, potentially tying up resources and causing other queries to wait in admission control.
When the query result spooling feature is enabled, the result sets of queries are eagerly fetched and buffered until they are read by the client, and resources are freed up for other queries.
See Spooling Impala Query Results for the new feature and the query options.
Starting in this version, Impala supports cookies for authentication when clients connect via HiveServer2 over HTTP.
You can use the --max_cookie_lifetime_s startup
flag
to:
See Impala Client Access for more information.
Object ownership for tables, views, and databases is enabled by
default in Impala. When you create a database, a table, or a view, as
the owner of that object, you implicitly have the privileges on the
object. The privileges that owners have are specified in Ranger on the
special user, {OWNER}
.
The {OWNER}
user must be defined in Ranger for the
object ownership privileges work in Impala.
See Impala Authorization for details.
Use the new Jaro or Jaro-Winkler functions to perform fuzzy matches on relatively short strings, e.g. to scrub user inputs of names against the records in the database.
JARO_DISTANCE
, JARO_DST
JARO_SIMILARITY
, JARO_SIM
JARO_WINKLER_DISTANCE
,
JW_DST
JARO_WINKLER_SIMILARITY
,
JW_SIM
See Impala String Functions for details.
When configuring scratch space for intermediate files used in large
sorts, joins, aggregations, or analytic function operations, use the
‑‑scratch_dirs
startup flag to optionally specify a
capacity quota per scratch directory, e.g.,
‑‑scratch_dirs=/dir1:5MB,/dir2
.
See How Impala Works with Hadoop File Formats for details.
During query plan generation, Impala samples underlying HBase tables
to estimate row count and row size, but the sampling process can
negatively impact the planning time. To alleviate the issue, when the
HBase table stats do not change much in a short time, disable the
sampling with the DISABLE_HBASE_NUM_ROWS_ESTIMATE
query option so that the Impala planner falls back to using Hive
Metastore (HMS) table stats instead.
To optimize query performance, Impala planner uses the value of the
fs.s3a.block.size
startup flag when calculating the
split size on non-block based stores, e.g. S3, ADLS, etc. Starting in
this release, Impala planner uses the
PARQUET_OBJECT_STORE_SPLIT_SIZE
query option to get
the Parquet file format specific split size.
For Parquet files, the fs.s3a.block.size
startup
flag is no longer used.
The default value of the
PARQUET_OBJECT_STORE_SPLIT_SIZE
query option is 256
MB.
See Using Impala with Amazon S3 Object Store for tuning Impala query performance for S3.
On the Query Details page of Impala Daemon Web UI, you have a new option, in addition to the existing Thrift and Text formats, to export the query profile output in the JSON format.
See Impala Web User Interface for Debugging for generating JSON query profile outputs in Web UI.
You can now use the DATE
data type to query date
values from Avro tables.
See Using the Avro File Format with Impala Tables for details.
This release adds support for primary and foreign key constraints, but in this release the constraints are advisory and intended for estimating cardinality during query planning in a future release. There is no attempt to enforce constraints. See CREATE TABLE Statement for details.
By default HMS implicitly translates internal Kudu tables to external Kudu tables with the 'external.table.purge' property set to true. These tables behave similar to internal tables. You can explicitly create such external Kudu tables. See CREATE TABLE Statement for details.
This release supports Ranger column masking, which hides sensitive columnar data in Impala query output. For example, you can define a policy that reveals only the first or last four characters of column data. Column masking is enabled by default. See Ranger Column Masking for details.
You can set the default limit for the size of the broadcast input. Such a limit can prevent possible performance problems.
In this release, you can use Read Optimized Queries on Hudi tables. See Using the Hudi File Format for details.
Impala stability and performance have been improved. Consequently, ORC reads are now
enabled in Impala by default. To disable, set --enable_orc_scanner
to
false
when starting the cluster. See Using the ORC File Format with Impala Tables for
details.
This release supports ZSTD and DEFLATE compression codecs for text files. See Using bzip2, deflate, gzip, Snappy, or zstd Text Files for details.
The following sections describe the noteworthy improvements made in Impala 3.3.
For the full list of issues closed in this release, see the changelog for Impala 3.3.
Apache Ranger: Use Apache Ranger to manage authorization in Impala. See Impala Authorization for details.
Apache Atlas: Use Apache Atlas to manage data governance in Impala.
Hive 3
To improve performance when using Parquet files, Impala can now write page indexes in Parquet files and use those indexes to skip pages for the faster scan.
See Query Performance for Impala Parquet Tables for details.
Impala can now cache remote HDFS file handles when the tables that store their data in Amazon S3 cloud storage.
See Scalability Considerations for File Handle Caching for the information on remote file handle cache.
In Impala 3.3 and Kudu 1.10, Kudu is integrated with Hive Metastore (HMS), and from Impala, you can create, update, delete, and query the tables in the Kudu services integrated with HMS.
See Using Kudu with Impala for information on using Kudu tables in Impala.
Zstandard (Zstd) is a real-time compression algorithm offering a tradeoff between speed and ratio of compression. Compression levels from 1 up to 22 are supported. The lower the level, the faster the speed at the cost of compression ratio.
Lz4 is a lossless compression algorithm providing extremely fast and scalable compression and decompression.
To improve performance on multi-cluster HDFS environments as well as on object store environments, Impala now caches data for non-local reads (e.g. S3, ABFS, ADLS) on local storage.
The data cache is enabled with the --data_cache
startup
flag.
See Impala Remote Data Cache for the information and steps to enable remote data cache.
The following features to improve metadata performance are enabled by default in this release:
Incremental stats are now compressed in memory in
catalogd
, reducing memory footprint in
catalogd
.
impalad
coordinators fetch incremental stats from
catalogd
on-demand, reducing the memory
footprint and the network requirements for broadcasting
metadata.
Time-based and memory-based automatic invalidation of metadata to
keep the size of metadata bounded and to reduce the chances of
catalogd
cache running out of memory.
Automatic invalidation of metadata
With automatic metadata management enabled, you no longer have to
issue INVALIDATE
/ REFRESH
in a
number of conditions.
In Impala 3.3, the following additional event in Hive Metastore can trigger automatic INVALIDATE / REFRESH of Metadata:
INSERT into tables and partitions from Impala or from Spark on the same or multiple cluster configuration
See Metadata Management for the information on the above features.
To offer more dynamic and flexible resource management, Impala supports the new configuration parameters that scale with the number of hosts in the resource pool. You can use the parameters to control the number of running queries, queued queries, and maximum amount of memory allocated for Impala resource pools. See Admission Control and Query Queuing for the information about the new parameters and using them for admission control.
The following information was added to the Query Profile output for better monitoring and troubleshooting of query performance.
Network I/O throughput
System disk I/O throughput
See Impala Query Profile for generating and reading query profile.
You can use the new the DATE type to describe a particular year/month/day, in the form YYYY-MM-DD.
This initial DATE type support the TEXT, Parquet, and HBASE file formats.
The support of DATE data type includes the following features:
DATE
type column as a partitioning key
columnDATE
literalDATE
and other types:
STRING
and TIMESTAMP
TIMESTAMP
now
allow the DATE
type arguments, as well.See DATE Data Type and Impala Date and Time Functions for using the DATE type.
Impala added the support to create, drop, query, and insert into the insert-only type of transactional tables.
See Impala Transactions for details.
Now client applications can connect to Impala over HTTP via HiveServer2 with the option to use the Kerberos SPNEGO and LDAP for authentication. See Impala Clients for details.
When you create a table, the default format for that table data is now Parquet.
For backward compatibility, you can use the
DEFAULT_FILE_FORMAT
query option to set the default
file format to the previous default, text, or other formats.
The GET_JSON_OBJECT()
function extracts JSON object
from a string based on the path specified and returns the extracted
JSON object.
See Impala Miscellaneous Functions. for details.
This version of Impala is certified to run on Ubuntu 18.04.
The following sections describe the noteworthy improvements made in Impala 3.2.
For the full list of issues closed in this release, see the changelog for Impala 3.2.
Impala can now cache remote
HDFS file handles when the
cache_remote_file_handles
impalad flag is set
to true
. This feature does not apply to non-HDFS
tables, such as Kudu or HBase tables, and does not apply to the
tables that store their data on cloud services, such as S3 or
ADLS. See Scalabilty Considerations
for file handle caching in Impala.
\admission
and provides the
following information about Impala resource pools:Impala
will cancel a query if the query produces more rows than the limit
specified by this query option. The limit applies only when the
results are returned to a client, e.g. for a
SELECT
query, but not an
INSERT
query. This query option is a guardrail
against users accidentally submitting queries that return a large
number of rows.
When enabled, the
catalogd
polls Hive Metastore (HMS)
notifications events at a configurable interval and syncs with
HMS. You can use the new web UI pages of the
catalogd
to check the state of the automatic
invalidate event processor.
Note: This is a preview feature in Impala 3.2.
TIMESTAMP_MILLIS
and
TIMESTAMP_MICROS
Parquet types. See Using Parquet File Format for
Impala Tables for the Parquet support in Impala.The function returns the Levenshtein distance between two input strings, the minimum number of single-character edits required to transform one string to other.
IF NOT EXISTS
clause is supported in the
ALTER TABLE
statement.DEFAULT_FILE_FORMAT
query option allows
you to set the default table file format. This removes the need for
the STORED AS <format>
clause. Set this option
if you prefer a value that is not TEXT
. The
supported formats are: TEXT
RC_FILE
SEQUENCE_FILE
AVRO
PARQUET
KUDU
ORC
EXPLAIN
output includes the following new information for queries:For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 3.1.
For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 3.0.
For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 2.12.
For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 2.11.
For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 2.10.
For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 2.9.
The following are some of the most significant new features in this release:
A new function, replace()
, which is faster than
regexp_replace()
for simple string substitutions.
See Impala String Functions for details.
Startup flags for the impalad daemon, is_executor
and is_coordinator
, let you divide the work on a large, busy cluster
between a small number of hosts acting as query coordinators, and a larger number of
hosts acting as query executors. By default, each host can act in both roles,
potentially introducing bottlenecks during heavily concurrent workloads.
See How to Configure Impala with Dedicated Coordinators for details.
Performance and scalability improvements:
The COMPUTE STATS
statement can
take advantage of multithreading.
Improved scalability for highly concurrent loads by reducing the possibility of TCP/IP timeouts.
A configuration setting, accepted_cnxn_queue_depth
, can be adjusted upwards to
avoid this type of timeout on large clusters.
Several performance improvements were made to the mechanism for generating native code:
Some queries involving analytic functions can take better advantage of native code generation.
Modules produced during intermediate code generation are organized to be easier to cache and reuse during the lifetime of a long-running or complicated query.
The COMPUTE STATS
statement is more efficient
(less time for the codegen phase) for tables with a large number
of columns, especially for tables containing TIMESTAMP
columns.
The logic for determining whether or not to use a runtime filter is more reliable, and the evaluation process itself is faster because of native code generation.
The MT_DOP
query option enables
multithreading for a number of Impala operations.
COMPUTE STATS
statements for Parquet tables
use a default of MT_DOP=4
to improve the
intra-node parallelism and CPU efficiency of this data-intensive
operation.
The COMPUTE STATS
statement is more efficient
(less time for the codegen phase) for tables with a large number
of columns.
A new hint, CLUSTERED
,
allows Impala INSERT
operations on a Parquet table
that use dynamic partitioning to process a high number of
partitions in a single statement. The data is ordered based on the
partition key columns, and each partition is only written
by a single host, reducing the amount of memory needed to buffer
Parquet data while the data blocks are being constructed.
The new configuration setting inc_stats_size_limit_bytes
lets you reduce the load on the catalog server when running the
COMPUTE INCREMENTAL STATS
statement for very large tables.
Impala folds many constant expressions within query statements,
rather than evaluating them for each row. This optimization
is especially useful when using functions to manipulate and
format TIMESTAMP
values, such as the result
of an expression such as to_date(now() - interval 1 day)
.
Parsing of complicated expressions is faster. This speedup is
especially useful for queries containing large CASE
expressions.
Evaluation is faster for IN
operators with many constant
arguments. The same performance improvement applies to other functions
with many constant arguments.
Impala optimizes identical comparison operators within multiple OR
blocks.
The reporting for wall-clock times and total CPU time in profile output is more accurate.
A new query option, SCRATCH_LIMIT
, lets you restrict the amount of
space used when a query exceeds the memory limit and activates the "spill to disk" mechanism.
This option helps to avoid runaway queries or make queries "fail fast" if they require more
memory than anticipated. You can prevent runaway queries from using excessive amounts of spill space,
without restarting the cluster to turn the spilling feature off entirely.
See SCRATCH_LIMIT Query Option for details.
Integration with Apache Kudu:
The experimental Impala support for the Kudu storage layer has been folded into the main Impala development branch. Impala can now directly access Kudu tables, opening up new capabilities such as enhanced DML operations and continuous ingestion.
The DELETE
statement is a flexible way to remove data from a Kudu table. Previously,
removing data from an Impala table involved removing or rewriting the underlying data files, dropping entire partitions,
or rewriting the entire table. This Impala statement only works for Kudu tables.
The UPDATE
statement is a flexible way to modify data within a Kudu table. Previously,
updating data in an Impala table involved replacing the underlying data files, dropping entire partitions,
or rewriting the entire table. This Impala statement only works for Kudu tables.
The UPSERT
statement is a flexible way to ingest, modify, or both data within a Kudu table. Previously,
ingesting data that might contain duplicates involved an inefficient multi-stage operation, and there was no
built-in protection against duplicate data. The UPSERT
statement, in combination with
the primary key designation for Kudu tables, lets you add or replace rows in a single operation, and
automatically avoids creating any duplicate data.
The CREATE TABLE
statement gains some new clauses that are specific to Kudu tables:
PARTITION BY
, PARTITIONS
, STORED AS KUDU
, and column
attributes PRIMARY KEY
, NULL
and NOT NULL
,
ENCODING
, COMPRESSION
, DEFAULT
, and BLOCK_SIZE
.
These clauses replace the explicit TBLPROPERTIES
settings that were required in the
early experimental phases of integration between Impala and Kudu.
The ALTER TABLE
statement can change certain attributes of Kudu tables.
You can add, drop, or rename columns.
You can add or drop range partitions.
You can change the TBLPROPERTIES
value to rename or point to a different underlying Kudu table,
independently from the Impala table name in the metastore database.
You cannot change the data type of an existing column in a Kudu table.
The SHOW PARTITIONS
statement displays information about the distribution of data
between partitions in Kudu tables. A new variation, SHOW RANGE PARTITIONS
,
displays information about the Kudu-specific partitions that apply across ranges of key values.
Not all Impala data types are supported in Kudu tables. In particular, currently the Impala
TIMESTAMP
type is not allowed in a Kudu table. Impala does not recognize the
UNIXTIME_MICROS
Kudu type when it is present in a Kudu table. (These two
representations of date/time data use different units and are not directly compatible.)
You cannot create columns of type TIMESTAMP
, DECIMAL
,
VARCHAR
, or CHAR
within a Kudu table. Within a query, you can
cast values in a result set to these types. Certain types, such as BOOLEAN
,
cannot be used as primary key columns.
Currently, Kudu tables are not interchangeable between Impala and Hive the way other kinds of Impala tables are. Although the metadata for Kudu tables is stored in the metastore database, currently Hive cannot access Kudu tables.
The INSERT
statement works for Kudu tables. The organization
of the Kudu data makes it more efficient than with HDFS-backed tables to insert
data in small batches, such as with the INSERT ... VALUES
syntax.
Some audit data is recorded for data governance purposes.
All UPDATE
, DELETE
, and UPSERT
statements are characterized
as INSERT
operations in the audit log. Currently, lineage metadata is not generated for
UPDATE
and DELETE
operations on Kudu tables.
Access to Kudu tables must be granted to roles as usual.
Currently, access to a Kudu table through Sentry is "all or nothing".
You cannot enforce finer-grained permissions such as at the column level,
or permissions on certain operations such as INSERT
.
Only users with ALL
privileges on SERVER
can create external Kudu tables.
Equality and IN
predicates in Impala queries are pushed to
Kudu and evaluated efficiently by the Kudu storage layer.
Security:
Impala can take advantage of the S3 encrypted credential store, to avoid exposing the secret key when accessing data stored on S3.
[IMPALA-1654]
Several kinds of DDL operations
can now work on a range of partitions. The partitions can be specified
using operators such as <
, >=
, and
!=
rather than just an equality predicate applying to a single
partition.
This new feature extends the syntax of several clauses
of the ALTER TABLE
statement
(DROP PARTITION
, SET [UN]CACHED
,
SET FILEFORMAT | SERDEPROPERTIES | TBLPROPERTIES
),
the SHOW FILES
statement, and the
COMPUTE INCREMENTAL STATS
statement.
It does not apply to statements that are defined to only apply to a single
partition, such as LOAD DATA
, ALTER TABLE ... ADD PARTITION
,
SET LOCATION
, and INSERT
with a static
partitioning clause.
The instr()
function has optional second and third arguments, representing
the character to position to begin searching for the substring, and the Nth occurrence
of the substring to find.
Improved error handling for malformed Avro data. In particular, incorrect
precision or scale for DECIMAL
types is now handled.
Impala debug web UI:
In addition to "inflight" and "finished" queries, the web UI now also includes a section for "queued" queries.
The /sessions tab now clarifies how many of the displayed sections are active, and lets you sort by Expired status to distinguish active sessions from expired ones.
Improved stability when DDL operations such as CREATE DATABASE
or DROP DATABASE
are run in Hive at the same time as an Impala
INVALIDATE METADATA
statement.
The "out of memory" error report was made more user-friendly, with additional diagnostic information to help identify the spot where the memory limit was exceeded.
Improved disk space usage for Java-based UDFs. Temporary copies of the associated JAR
files are removed when no longer needed, so that they do not accumulate across restarts
of the catalogd daemon and potentially cause an out-of-space condition.
These temporary files are also created in the directory specified by the local_library_dir
configuration setting, so that the storage for these temporary files can be independent
from any capacity limits on the /tmp filesystem.
Performance improvements:
[IMPALA-3206]
Speedup for queries against DECIMAL
columns in Avro tables.
The code that parses DECIMAL
values from Avro now uses
native code generation.
[IMPALA-3674] Improved efficiency in LLVM code generation can reduce codegen time, especially for short queries.
[IMPALA-2979]
Improvements to scheduling on worker nodes,
enabled by the REPLICA_PREFERENCE
query option.
See REPLICA_PREFERENCE Query Option (Impala 2.7 or higher only) for details.
[IMPALA-1683]
The REFRESH
statement can be applied to a single partition,
rather than the entire table. See REFRESH Statement
and Refreshing a Single Partition for details.
Improvements to the Impala web user interface:
[IMPALA-2767] You can now force a session to expire by clicking a link in the web UI, on the /sessions tab.
[IMPALA-3715] The /memz tab includes more information about Impala memory usage.
[IMPALA-3716] The Details page for a query now includes a Memory tab.
[IMPALA-3499] Scalability improvements to the catalog server. Impala handles internal communication more efficiently for tables with large numbers of columns and partitions, where the size of the metadata exceeds 2 GiB.
[IMPALA-3677]
You can send a SIGUSR1
signal to any Impala-related daemon to write a
Breakpad minidump. For advanced troubleshooting, you can now produce a minidump
without triggering a crash. See Breakpad Minidumps for Impala (Impala 2.6 or higher only) for
details about the Breakpad minidump feature.
[IMPALA-3687]
The schema reconciliation rules for Avro tables have changed slightly
for CHAR
and VARCHAR
columns. Now, if
the definition of such a column is changed in the Avro schema file,
the column retains its CHAR
or VARCHAR
type as specified in the SQL definition, but the column name and comment
from the Avro schema file take precedence.
See Creating Avro Tables for details about
column definitions in Avro tables.
[IMPALA-3575] Some network operations now have additional timeout and retry settings. The extra configuration helps avoid failed queries for transient network problems, to avoid hangs when a sender or receiver fails in the middle of a network transmission, and to make cancellation requests more reliable despite network issues.
Improvements to Impala support for the Amazon S3 filesystem:
Impala can now write to S3 tables through the INSERT
or LOAD DATA
statements.
See Using Impala with Amazon S3 Object Store for general information about
using Impala with S3.
A new query option, S3_SKIP_INSERT_STAGING
, lets you
trade off between fast INSERT
performance and
slower INSERT
s that are more consistent if a
problem occurs during the statement. The new behavior is enabled by default.
See S3_SKIP_INSERT_STAGING Query Option (Impala 2.6 or higher only) for details
about this option.
Performance improvements for the runtime filtering feature:
The default for the RUNTIME_FILTER_MODE
query option is changed to GLOBAL
(the highest setting).
See RUNTIME_FILTER_MODE Query Option (Impala 2.5 or higher only) for
details about this option.
The RUNTIME_BLOOM_FILTER_SIZE
setting is now only used
as a fallback if statistics are not available; otherwise, Impala
uses the statistics to estimate the appropriate size to use for each filter.
See RUNTIME_BLOOM_FILTER_SIZE Query Option (Impala 2.5 or higher only) for
details about this option.
New query options RUNTIME_FILTER_MIN_SIZE
and
RUNTIME_FILTER_MAX_SIZE
let you fine-tune
the sizes of the Bloom filter structures used for runtime filtering.
If the filter size derived from Impala internal estimates or from
the RUNTIME_FILTER_BLOOM_SIZE
falls outside the size
range specified by these options, any too-small filter size is adjusted
to the minimum, and any too-large filter size is adjusted to the maximum.
See RUNTIME_FILTER_MIN_SIZE Query Option (Impala 2.6 or higher only)
and RUNTIME_FILTER_MAX_SIZE Query Option (Impala 2.6 or higher only)
for details about these options.
Runtime filter propagation now applies to all the
operands of UNION
and UNION ALL
operators.
Runtime filters can now be produced during join queries even when the join processing activates the spill-to-disk mechanism.
Admission control and dynamic resource pools are enabled by default. See Admission Control and Query Queuing for details about admission control.
Impala can now manually set column statistics,
using the ALTER TABLE
statement with a
SET COLUMN STATS
clause.
See impala_perf_stats.html#perf_column_stats_manual for details.
Impala can now write lightweight "minidump" files, rather
than large core files, to save diagnostic information when
any of the Impala-related daemons crash. This feature uses the
open source breakpad
framework.
See Breakpad Minidumps for Impala (Impala 2.6 or higher only) for details.
The PARQUET_FALLBACK_SCHEMA_RESOLUTION
query option
lets Impala locate columns within Parquet files based on
column name rather than ordinal position.
This enhancement improves interoperability with applications
that write Parquet files with a different order or subset of
columns than are used in the Impala table.
See PARQUET_FALLBACK_SCHEMA_RESOLUTION Query Option (Impala 2.6 or higher only)
for details.
The PARQUET_ANNOTATE_STRINGS_UTF8
query option
makes Impala include the UTF-8
annotation
metadata for STRING
, CHAR
,
and VARCHAR
columns in Parquet files created
by INSERT
or CREATE TABLE AS SELECT
statements.
See PARQUET_ANNOTATE_STRINGS_UTF8 Query Option (Impala 2.6 or higher only)
for details.
Improvements to security and reduction in overhead for secure clusters:
Overall performance improvements for secure clusters. (TPC-H queries on a secure cluster were benchmarked at roughly 3x as fast as the previous release.)
Impala now recognizes the auth_to_local
setting,
specified through the HDFS configuration setting
hadoop.security.auth_to_local
.
This feature is disabled by default; to enable it,
specify --load_auth_to_local_rules=true
in the impalad configuration settings.
See Mapping Kerberos Principals to Short Names for Impala for details.
Timing improvements in the mechanism for the impalad daemon to acquire Kerberos tickets. This feature spreads out the overhead on the KDC during Impala startup, especially for large clusters.
For Kerberized clusters, the Catalog service now uses
the Kerberos principal instead of the operating sytem user that runs
the catalogd daemon.
This eliminates the requirement to configure a hadoop.user.group.static.mapping.overrides
setting to put the OS user into the Sentry administrative group, on clusters where the principal
and the OS user name for this user are different.
Overall performance improvements for join queries, by using a prefetching mechanism while building the in-memory hash table to evaluate join predicates. See PREFETCH_MODE Query Option (Impala 2.6 or higher only) for the query option to control this optimization.
The impala-shell interpreter has a new command,
SOURCE
, that lets you run a set of SQL statements
or other impala-shell commands stored in a file.
You can run additional SOURCE
commands from inside
a file, to set up flexible sequences of statements for use cases
such as schema setup, ETL, or reporting.
See impala-shell Command Reference for details
and Running Commands and SQL Statements in impala-shell
for examples.
The millisecond()
built-in function lets you extract
the fractional seconds part of a TIMESTAMP
value.
See Impala Date and Time Functions for details.
If an Avro table is created without column definitions in the
CREATE TABLE
statement, and columns are later
added through ALTER TABLE
, the resulting
table is now queryable. Missing values from the newly added
columns now default to NULL
.
See Using the Avro File Format with Impala Tables for general details about
working with Avro files.
DECIMAL
literals is
improved, no longer going through an intermediate conversion step
to DOUBLE
:
Casting a DECIMAL
value to TIMESTAMP
DOUBLE
produces a more precise
value for the TIMESTAMP
than formerly.
Certain function calls involving DECIMAL
literals
now succeed, when formerly they failed due to lack of a function
signature with a DOUBLE
argument.
Faster runtime performance for DECIMAL
constant
values, through improved native code generation for all combinations
of precision and scale.
DECIMAL
type.
Improved type accuracy for CASE
return values.
If all WHEN
clauses of the CASE
expression are of CHAR
type, the final result
is also CHAR
instead of being converted to
STRING
.
See Impala Conditional Functions
for details about the CASE
function.
Uncorrelated queries using the NOT EXISTS
operator
are now supported. Formerly, the NOT EXISTS
operator was only available for correlated subqueries.
Improved performance for reading Parquet files.
Improved performance for top-N queries, that is,
those including both ORDER BY
and
LIMIT
clauses.
Impala optionally skips an arbitrary number of header lines from text input
files on HDFS based on the skip.header.line.count
value
in the TBLPROPERTIES
field of the table metadata.
See Data Files for Text Tables for details.
Trailing comments are now allowed in queries processed by
the impala-shell options -q
and -f
.
Impala can run COUNT
queries for RCFile tables
that include complex type columns.
See Complex Types (Impala 2.3 or higher only) for
general information about working with complex types,
and ARRAY Complex Type (Impala 2.3 or higher only),
MAP Complex Type (Impala 2.3 or higher only), and STRUCT Complex Type (Impala 2.3 or higher only)
for syntax details of each type.
Dynamic partition pruning. When a query refers to a partition key column in a WHERE
clause, and the exact set of column values are not known until the query is executed,
Impala evaluates the predicate and skips the I/O for entire partitions that are not needed.
For example, if a table was partitioned by year, Impala would apply this technique to a query
such as SELECT c1 FROM partitioned_table WHERE year = (SELECT MAX(year) FROM other_table)
.
See Dynamic Partition Pruning for details.
The dynamic partition pruning optimization technique lets Impala avoid reading
data files from partitions that are not part of the result set, even when
that determination cannot be made in advance. This technique is especially valuable
when performing join queries involving partitioned tables. For example, if a join
query includes an ON
clause and a WHERE
clause
that refer to the same columns, the query can find the set of column values that
match the WHERE
clause, and only scan the associated partitions
when evaluating the ON
clause.
Dynamic partition pruning is controlled by the same settings as the runtime filtering feature.
By default, this feature is enabled at a medium level, because the maximum setting can use
slightly more memory for queries than in previous releases.
To fully enable this feature, set the query option RUNTIME_FILTER_MODE=GLOBAL
.
Runtime filtering. This is a wide-ranging set of optimizations that are especially valuable for join queries.
Using the same technique as with dynamic partition pruning,
Impala uses the predicates from WHERE
and ON
clauses
to determine the subset of column values from one of the joined tables could possibly be part of the
result set. Impala sends a compact representation of the filter condition to the hosts in the cluster,
instead of the full set of values or the entire table.
See Runtime Filtering for Impala Queries (Impala 2.5 or higher only) for details.
By default, this feature is enabled at a medium level, because the maximum setting can use
slightly more memory for queries than in previous releases.
To fully enable this feature, set the query option RUNTIME_FILTER_MODE=GLOBAL
.
See RUNTIME_FILTER_MODE Query Option (Impala 2.5 or higher only) for details.
This feature involves some new query options: RUNTIME_FILTER_MODE, MAX_NUM_RUNTIME_FILTERS, RUNTIME_BLOOM_FILTER_SIZE, RUNTIME_FILTER_WAIT_TIME_MS, and DISABLE_ROW_RUNTIME_FILTERING. See RUNTIME_FILTER_MODE, MAX_NUM_RUNTIME_FILTERS, RUNTIME_BLOOM_FILTER_SIZE, RUNTIME_FILTER_WAIT_TIME_MS, and DISABLE_ROW_RUNTIME_FILTERING for details.
More efficient use of the HDFS caching feature, to avoid
hotspots and bottlenecks that could occur if heavily used
cached data blocks were always processed by the same host.
By default, Impala now randomizes which host processes each cached
HDFS data block, when cached replicas are available on multiple hosts.
(Remember to use the WITH REPLICATION
clause with the
CREATE TABLE
or ALTER TABLE
statement
when enabling HDFS caching for a table or partition, to cache the same
data blocks across multiple hosts.)
The new query option SCHEDULE_RANDOM_REPLICA
lets you fine-tune the interaction with HDFS caching even more.
See Using HDFS Caching with Impala (Impala 2.1 or higher only) for details.
The TRUNCATE TABLE
statement now accepts an IF EXISTS
clause, making TRUNCATE TABLE
easier to use in setup or ETL scripts where the table might or
might not exist.
See TRUNCATE TABLE Statement (Impala 2.3 or higher only) for details.
DECIMAL
data type:
Using DECIMAL
values in a GROUP BY
clause now
triggers the native code generation optimization, speeding up queries that
group by values such as prices.
Checking for overflow in DECIMAL
multiplication is now substantially faster, making DECIMAL
a more practical data type in some use cases where formerly DECIMAL
was much slower than FLOAT
or DOUBLE
.
Multiplying a mixture of DECIMAL
and FLOAT
or DOUBLE
values now returns the
DOUBLE
rather than DECIMAL
. This change avoids
some cases where an intermediate value would underflow or overflow and become
NULL
unexpectedly.
For UDFs written in Java, or Hive UDFs reused for Impala, Impala now allows parameters and return values to be primitive types. Formerly, these things were required to be one of the "Writable" object types. See Using Hive UDFs with Impala for details.
Performance improvements for HDFS I/O. Impala now caches HDFS file handles to avoid the overhead of repeatedly opening the same file.
Performance improvements for queries involving nested complex types. Certain basic query types, such as counting the elements of a complex column, now use an optimized code path.
Improvements to the memory reservation mechanism for the Impala admission control feature. You can specify more settings, such as the timeout period and maximum aggregate memory used, for each resource pool instead of globally for the Impala instance. The default limit for concurrent queries (the max requests setting) is now unlimited instead of 200.
Performance improvements related to code generation.
Even in queries where code generation is not performed
for some phases of execution (such as reading data from
Parquet tables), Impala can still use code generation in
other parts of the query, such as evaluating
functions in the WHERE
clause.
Performance improvements for queries using aggregation functions
on high-cardinality columns.
Formerly, Impala could do unnecessary extra work to produce intermediate
results for operations such as DISTINCT
or GROUP BY
on columns that were unique or had few duplicate values.
Now, Impala decides at run time whether it is more efficient to
do an initial aggregation phase and pass along a smaller set of intermediate data,
or to pass raw intermediate data back to next phase of query processing to be aggregated there.
This feature is known as streaming pre-aggregation.
In case of performance regression, this feature can be turned off
using the DISABLE_STREAMING_PREAGGREGATIONS
query option.
See DISABLE_STREAMING_PREAGGREGATIONS Query Option (Impala 2.5 or higher only) for details.
Spill-to-disk feature now always recommended. In earlier releases, the spill-to-disk feature
could be turned off using a pair of configuration settings,
enable_partitioned_aggregation=false
and
enable_partitioned_hash_join=false
.
The latest improvements in the spill-to-disk mechanism, and related features that
interact with it, make this feature robust enough that disabling it is now
no longer needed or supported. In particular, some new features in Impala 2.5
and higher do not work when the spill-to-disk feature is disabled.
Improvements to scripting capability for the impala-shell command, through user-specified substitution variables that can appear in statements processed by impala-shell:
The --var
command-line option lets you pass key-value pairs to
impala-shell. The shell can substitute the values
into queries before executing them, where the query text contains the notation
${var:varname}
. For example, you might prepare a SQL file
containing a set of DDL statements and queries containing variables for
database and table names, and then pass the applicable names as part of the
impala-shell -f filename
command.
See Running Commands and SQL Statements in impala-shell for details.
The SET
and UNSET
commands within the
impala-shell interpreter now work with user-specified
substitution variables, as well as the built-in query options.
The two kinds of variables are divided in the SET
output.
As with variables defined by the --var
command-line option,
you refer to the user-specified substitution variables in queries by using
the notation ${var:varname}
in the query text. Because the substitution variables are processed by
impala-shell instead of the impalad
backend, you cannot define your own substitution variables through the
SET
statement in a JDBC or ODBC application.
See SET Statement for details.
Performance improvements for query startup. Impala better parallelizes certain work when coordinating plan distribution between impalad instances, which improves startup time for queries involving tables with many partitions on large clusters, or complicated queries with many plan fragments.
Performance and scalability improvements for tables with many partitions. The memory requirements on the coordinator node are reduced, making it substantially faster and less resource-intensive to do joins involving several tables with thousands of partitions each.
Whitelisting for access to internal APIs. For applications that need direct access
to Impala APIs, without going through the HiveServer2 or Beeswax interfaces, you can
specify a list of Kerberos users who are allowed to call those APIs. By default, the
impala
and hdfs
users are the only ones authorized
for this kind of access.
Any users not explicitly authorized through the internal_principals_whitelist
configuration setting are blocked from accessing the APIs. This setting applies to all the
Impala-related daemons, although currently it is primarily used for HDFS to control the
behavior of the catalog server.
Improvements to Impala integration and usability for Hue. (The code changes are actually on the Hue side.)
The list of tables now refreshes dynamically.
Usability improvements for case-insensitive queries.
You can now use the operators ILIKE
and IREGEXP
to perform case-insensitive wildcard matches or regular expression matches,
rather than explicitly converting column values with UPPER
or LOWER
.
See ILIKE Operator and IREGEXP Operator for details.
Performance and reliability improvements for DDL and insert operations on partitioned tables with a large number of partitions. Impala only re-evaluates metadata for partitions that are affected by a DDL operation, not all partitions in the table. While a DDL or insert statement is in progress, other Impala statements that attempt to modify metadata for the same table wait until the first one finishes.
Reliability improvements for the LOAD DATA
statement.
Previously, this statement would fail if the source HDFS directory
contained any subdirectories at all. Now, the statement ignores
any hidden subdirectories, for example _impala_insert_staging.
A new operator, IS [NOT] DISTINCT FROM
, lets you compare values
and always get a true
or false
result,
even if one or both of the values are NULL
.
The IS NOT DISTINCT FROM
operator, or its equivalent
<=>
notation, improves the efficiency of join queries that
treat key values that are NULL
in both tables as equal.
See IS DISTINCT FROM Operator for details.
Security enhancements for the impala-shell command.
A new option, --ldap_password_cmd
, lets you specify
a command to retrieve the LDAP password. The resulting password is
then used to authenticate the impala-shell command
with the LDAP server.
See impala-shell Configuration Options for details.
The CREATE TABLE AS SELECT
statement now accepts a
PARTITIONED BY
clause, which lets you create a
partitioned table and insert data into it with a single statement.
See CREATE TABLE Statement for details.
User-defined functions (UDFs and UDAFs) written in C++ now persist automatically
when the catalogd daemon is restarted. You no longer
have to run the CREATE FUNCTION
statements again after a restart.
User-defined functions (UDFs) written in Java can now persist
when the catalogd daemon is restarted, and can be shared
transparently between Impala and Hive. You must do a one-time operation to recreate these
UDFs using new CREATE FUNCTION
syntax, without a signature for arguments
or the return value. Afterwards, you no longer have to run the CREATE FUNCTION
statements again after a restart.
Although Impala does not have visibility into the UDFs that implement the
Hive built-in functions, user-created Hive UDFs are now automatically available
for calling through Impala.
See CREATE FUNCTION Statement for details.
Reliability enhancements for memory management. Some aggregation and join queries that formerly might have failed with an out-of-memory error due to memory contention, now can succeed using the spill-to-disk mechanism.
The SHOW DATABASES
statement now returns two columns rather than one.
The second column includes the associated comment string, if any, for each database.
Adjust any application code that examines the list of databases and assumes the
result set contains only a single column.
See SHOW DATABASES for details.
A new optimization speeds up aggregation operations that involve only the partition key
columns of partitioned tables. For example, a query such as SELECT COUNT(DISTINCT k), MIN(k), MAX(k) FROM t1
can avoid reading any data files if T1
is a partitioned table and K
is one of the partition key columns. Because this technique can produce different results in cases
where HDFS files in a partition are manually deleted or are empty, you must enable the optimization
by setting the query option OPTIMIZE_PARTITION_KEY_SCANS
.
See OPTIMIZE_PARTITION_KEY_SCANS Query Option (Impala 2.5 or higher only) for details.
The DESCRIBE
statement can now display metadata about a database, using the
syntax DESCRIBE DATABASE db_name
.
See DESCRIBE Statement for details.
The uuid()
built-in function generates an
alphanumeric value that you can use as a guaranteed unique identifier.
The uniqueness applies even across tables, for cases where an ascending
numeric sequence is not suitable.
See Impala Miscellaneous Functions for details.
Impala can be used on the DSSD D5 Storage Appliance. From a user perspective, the Impala features are the same as in Impala 2.3.
The following are the major new features in Impala 2.3.x. This major release contains improvements to SQL syntax (particularly new support for complex types), performance, manageability, security.
Complex data types: STRUCT
, ARRAY
, and MAP
. These
types can encode multiple named fields, positional items, or key-value pairs within a single column.
You can combine these types to produce nested types with arbitrarily deep nesting,
such as an ARRAY
of STRUCT
values,
a MAP
where each key-value pair is an ARRAY
of other MAP
values,
and so on. Currently, complex data types are only supported for the Parquet file format.
See Complex Types (Impala 2.3 or higher only) for usage details and ARRAY Complex Type (Impala 2.3 or higher only), STRUCT Complex Type (Impala 2.3 or higher only), and MAP Complex Type (Impala 2.3 or higher only) for syntax.
Column-level authorization lets you define access to particular columns within a table, rather than the entire table. This feature lets you reduce the reliance on creating views to set up authorization schemes for subsets of information. See the documentation for Apache Sentry for background details, and GRANT Statement (Impala 2.0 or higher only) and REVOKE Statement (Impala 2.0 or higher only) for Impala-specific syntax.
The TRUNCATE TABLE
statement removes all the data from a table without removing the table itself.
See TRUNCATE TABLE Statement (Impala 2.3 or higher only) for details.
Nested loop join queries. Some join queries that formerly required equality comparisons can now use
operators such as <
or >=
. This same join mechanism is used
internally to optimize queries that retrieve values from complex type columns.
See Joins in Impala SELECT Statements for details about Impala join queries.
Reduced memory usage and improved performance and robustness for spill-to-disk feature. See SQL Operations that Spill to Disk for details about this feature.
Performance improvements for querying Parquet data files containing multiple row groups and multiple data blocks:
For files written by Hive, SparkSQL, and other Parquet MR writers and spanning multiple HDFS blocks, Impala now scans the extra data blocks locally when possible, rather than using remote reads.
Impala queries benefit from the improved alignment of row groups with HDFS blocks for Parquet
files written by Hive, MapReduce, and other components. (Impala itself never writes
multiblock Parquet files, so the alignment change does not apply to Parquet files produced by Impala.)
These Parquet writers now add padding to Parquet files that they write to align row groups with HDFS blocks.
The parquet.writer.max-padding
setting specifies the maximum number of bytes, by default
8 megabytes, that can be added to the file between row groups to fill the gap at the end of one block
so that the next row group starts at the beginning of the next block.
If the gap is larger than this size, the writer attempts to fit another entire row group in the remaining space.
Include this setting in the hive-site configuration file to influence Parquet files written by Hive,
or the hdfs-site configuration file to influence Parquet files written by all non-Impala components.
See Using the Parquet File Format with Impala Tables for instructions about using Parquet data files with Impala.
Many new built-in scalar functions, for convenience and enhanced portability of SQL that uses common industry extensions.
Math functions (see Impala Mathematical Functions for details):
ATAN2
COSH
COT
DCEIL
DEXP
DFLOOR
DLOG10
DPOW
DROUND
DSQRT
DTRUNC
FACTORIAL
, and corresponding !
operator
FPOW
RADIANS
RANDOM
SINH
TANH
String functions (see Impala String Functions for details):
BTRIM
CHR
REGEXP_LIKE
SPLIT_PART
Date and time functions (see Impala Date and Time Functions for details):
INT_MONTHS_BETWEEN
MONTHS_BETWEEN
TIMEOFDAY
TIMESTAMP_CMP
Bit manipulation functions (see Impala Bit Functions for details):
BITAND
BITNOT
BITOR
BITXOR
COUNTSET
GETBIT
ROTATELEFT
ROTATERIGHT
SETBIT
SHIFTLEFT
SHIFTRIGHT
Type conversion functions (see Impala Type Conversion Functions for details):
TYPEOF
The effective_user()
function (see Impala Miscellaneous Functions for details).
New built-in analytic functions: PERCENT_RANK
, NTILE
,
CUME_DIST
.
See Impala Analytic Functions for details.
The DROP DATABASE
statement now works for a non-empty database.
When you specify the optional CASCADE
clause, any tables in the
database are dropped before the database itself is removed.
See DROP DATABASE Statement for details.
The DROP TABLE
and ALTER TABLE DROP PARTITION
statements have a new optional keyword, PURGE
.
This keyword causes Impala to immediately remove the relevant HDFS data files rather than sending them to the HDFS trashcan.
This feature can help to avoid out-of-space errors on storage devices, and to avoid files being left behind in case of
a problem with the HDFS trashcan, such as the trashcan not being configured or being in a different HDFS encryption zone
than the data files.
See DROP TABLE Statement and ALTER TABLE Statement for syntax.
The impala-shell command has a new feature for live progress reporting. This feature
is enabled through the --live_progress
and --live_summary
command-line options, or during a session through the LIVE_SUMMARY
and
LIVE_PROGRESS
query options.
See LIVE_PROGRESS Query Option (Impala 2.3 or higher only) and LIVE_SUMMARY Query Option (Impala 2.3 or higher only) for details.
The impala-shell command also now displays a random "tip of the day" when it starts.
The impala-shell option -f
now recognizes a special filename
-
to accept input from stdin.
See impala-shell Configuration Options for details about the options for running impala-shell in non-interactive mode.
Format strings for the unix_timestamp()
function can now include numeric timezone offsets.
See Impala Date and Time Functions for details.
Impala can now run a specified command to obtain the password to decrypt a private-key PEM file, rather than having the private-key file be unencrypted on disk. See Configuring TLS/SSL for Impala for details.
Impala components now can use SSL for more of their internal communication. SSL is used for
communication between all three Impala-related daemons when the configuration option
ssl_server_certificate
is enabled. SSL is used for communication with client
applications when the configuration option ssl_client_ca_certificate
is enabled.
See Configuring TLS/SSL for Impala for details.
Currently, you can only use one of server-to-server TLS/SSL encryption or Kerberos authentication. This limitation is tracked by the issue IMPALA-2598.
Improved flexibility for intermediate data types in user-defined aggregate functions (UDAFs). See Writing User-Defined Aggregate Functions (UDAFs) for details.
In Impala 2.3.2, the bug fix for IMPALA-2598 removes the restriction on using both Kerberos and SSL for internal communication between Impala components.
The following are the major new features in Impala 2.2. This release contains improvements to performance, manageability, security, and SQL syntax.
Several improvements to date and time features enable higher interoperability with Hive and other
database systems, provide more flexibility for handling time zones, and future-proof the handling of
TIMESTAMP
values:
The WITH REPLICATION
clause for the CREATE TABLE
and
ALTER TABLE
statements lets you control the replication factor for
HDFS caching for a specific table or partition. By default, each cached block is
only present on a single host, which can lead to CPU contention if the same host
processes each cached block. Increasing the replication factor lets Impala choose
different hosts to process different cached blocks, to better distribute the CPU load.
Startup flags for the impalad daemon enable a higher level of compatibility with
TIMESTAMP
values written by Hive, and more flexibility for working with date and
time data using the local time zone instead of UTC. To enable these features, set the
impalad startup flags
-use_local_tz_for_unix_timestamp_conversions=true
and
-convert_legacy_hive_parquet_utc_timestamps=true
.
The -use_local_tz_for_unix_timestamp_conversions
setting controls how the
unix_timestamp()
, from_unixtime()
, and now()
functions handle time zones. By default (when this setting is turned off), Impala considers all
TIMESTAMP
values to be in the UTC time zone when converting to or from Unix time
values. When this setting is enabled, Impala treats TIMESTAMP
values passed to or
returned from these functions to be in the local time zone. When this setting is enabled, take
particular care that all hosts in the cluster have the same timezone settings, to avoid
inconsistent results depending on which host reads or writes TIMESTAMP
data.
The -convert_legacy_hive_parquet_utc_timestamps
setting causes Impala to convert
TIMESTAMP
values to the local time zone when it reads them from Parquet files
written by Hive. This setting only applies to data using the Parquet file format, where Impala can
use metadata in the files to reliably determine that the files were written by Hive. If in the
future Hive changes the way it writes TIMESTAMP
data in Parquet, Impala will
automatically handle that new TIMESTAMP
encoding.
See TIMESTAMP Data Type for details about time zone handling and the configuration options for Impala / Hive compatibility with Parquet format.
In Impala 2.2.0 and higher, built-in functions that accept or return integers
representing TIMESTAMP
values use the BIGINT
type for
parameters and return values, rather than INT
. This change lets the
date and time functions avoid an overflow error that would otherwise occur on January
19th, 2038 (known as the
"Year
2038 problem" or "Y2K38 problem"). This change affects the
FROM_UNIXTIME()
and UNIX_TIMESTAMP()
functions. You
might need to change application code that interacts with these functions, change the
types of columns that store the return values, or add CAST()
calls to
SQL statements that call these functions.
See Impala Date and Time Functions for the current function signatures.
The SHOW FILES
statement lets you view the names and sizes of the files that make up
an entire table or a specific partition. See SHOW FILES Statement for details.
Impala can now run queries against Parquet data containing columns with complex or nested types, as long as the query only refers to columns with scalar types.
Performance improvements for queries that include IN()
operators and involve
partitioned tables.
The new -max_log_files
configuration option specifies how many log files to keep at
each severity level. The default value is 10, meaning that Impala preserves the latest 10 log files for
each severity level (INFO
, WARNING
, and ERROR
) for
each Impala-related daemon (impalad, statestored, and
catalogd). Impala checks to see if any old logs need to be removed based on the
interval specified in the logbufsecs
setting, every 5 seconds by default. See
Rotating Impala Logs for details.
Redaction of sensitive data from Impala log files. This feature protects details such as credit card numbers or tax IDs from administrators who see the text of SQL statements in the course of monitoring and troubleshooting a Hadoop cluster. See Redacting Sensitive Information from Impala Log Files for background information for Impala users, and the documentation for your Apache Hadoop distribution for usage details.
Lineage information is available for data created or queried by Impala. This feature lets you track who has accessed data through Impala SQL statements, down to the level of specific columns, and how data has been propagated between tables. See Viewing Lineage Information for Impala Data for background information for Impala users, the documentation for your Apache Hadoop distribution for usage details and how to interpret the lineage information.
Impala tables and partitions can now be located on the Amazon Simple Storage Service (S3) filesystem,
for convenience in cases where data is already located in S3 and you prefer to query it in-place.
Queries might have lower performance than when the data files reside on HDFS, because Impala uses some
HDFS-specific optimizations. Impala can query data in S3, but cannot write to S3. Therefore, statements
such as INSERT
and LOAD DATA
are not available when the destination
table or partition is in S3. See Using Impala with Amazon S3 Object Store for details.
Impala query support for Amazon S3 is included in Impala 2.2, but is not supported or recommended for production use in this version.
Improved support for HDFS encryption. The LOAD DATA
statement now works when the
source directory and destination table are in different encryption zones. See
the documentation for your Apache Hadoop distribution for details about using HDFS encryption with
Impala.
Additional arithmetic function mod()
. See
Impala Mathematical Functions for details.
Flexibility to interpret TIMESTAMP
values using the UTC time zone (the traditional
Impala behavior) or using the local time zone (for compatibility with TIMESTAMP
values
produced by Hive).
Enhanced support for ETL using tools such as Flume. Impala ignores temporary files typically produced
by these tools (filenames with suffixes .copying
and .tmp
).
The CPU requirement for Impala, which had become more restrictive in Impala 2.0.x and 2.1.x, has now been relaxed.
The prerequisite for CPU architecture has been relaxed in Impala 2.2.0 and higher. From this release onward, Impala works on CPUs that have the SSSE3 instruction set. The SSE4 instruction set is no longer required. This relaxed requirement simplifies the upgrade planning from Impala 1.x releases, which also worked on SSSE3-enabled processors.
Enhanced support for CHAR
and VARCHAR
types in the COMPUTE
STATS
statement.
The amount of memory required during setup for "spill to disk" operations is greatly reduced. This enhancement reduces the chance of a memory-intensive join or aggregation query failing with an out-of-memory error.
Several new conditional functions provide enhanced compatibility when porting code that uses industry
extensions. The new functions are: isfalse()
, isnotfalse()
,
isnottrue()
, istrue()
, nonnullvalue()
, and
nullvalue()
. See Impala Conditional Functions
for details.
The Impala debug web UI now can display a visual representation of the query plan. On the /queries tab, select Details for a particular query. The Details page includes a Plan tab with a plan diagram that you can zoom in or out (using scroll gestures through mouse wheel or trackpad).
This release contains the following enhancements to query performance and system scalability:
Impala can now collect statistics for individual partitions in a partitioned table, rather than
processing the entire table for each COMPUTE STATS
statement. This feature is known as
incremental statistics, and is controlled by the COMPUTE INCREMENTAL STATS
syntax.
(You can still use the original COMPUTE STATS
statement for nonpartitioned tables or
partitioned tables that are unchanging or whose contents are entirely replaced all at once.) See
COMPUTE STATS Statement and
Table and Column Statistics for details.
Optimization for small queries lets Impala process queries that process very few rows without the
unnecessary overhead of parallelizing and generating native code. Reducing this overhead lets Impala
clear small queries quickly, keeping YARN resources and admission control slots available for
data-intensive queries. The number of rows considered to be a "small" query is controlled by the
EXEC_SINGLE_NODE_ROWS_THRESHOLD
query option. See
EXEC_SINGLE_NODE_ROWS_THRESHOLD Query Option (Impala 2.1 or higher only) for details.
An enhancement to the statestore component lets it transmit heartbeat information independently of broadcasting metadata updates. This optimization improves reliability of health checking on large clusters with many tables and partitions.
The memory requirement for querying gzip-compressed text is reduced. Now Impala decompresses the data as it is read, rather than reading the entire gzipped file and decompressing it in memory.
The following are the major new features in Impala 2.0. This major release contains improvements to performance, scalability, security, and SQL syntax.
Queries with joins or aggregation functions involving high volumes of data can now use temporary work
areas on disk, reducing the chance of failure due to out-of-memory errors. When the required memory for
the intermediate result set exceeds the amount available on a particular node, the query automatically
uses a temporary work area on disk. This "spill to disk" mechanism is similar to the ORDER
BY
improvement from Impala 1.4. For details, see
SQL Operations that Spill to Disk.
WHERE
clause, for example with the
IN
operator.
EXISTS
and NOT EXISTS
operators are available. They are
always used in conjunction with subqueries.
IN
and NOT IN
queries can now operate on the result set from
a subquery, not just a hardcoded list of values.
IN
, and EXISTS
comparisons. For example, you might use
WHERE
clauses such as WHERE column = (SELECT
MAX(some_other_column FROM table)
or WHERE
column IN (SELECT some_other_column FROM
table WHERE conditions)
.
MAX()
, MIN()
, COUNT()
, or
AVG()
, where you would normally use a numeric value in a WHERE
clause.
For details about subqueries, see Subqueries in Impala SELECT Statements For information about new and improved operators, see EXISTS Operator and IN Operator.
Analytic functions such as RANK()
, LAG()
, LEAD()
,
and FIRST_VALUE()
let you analyze sequences of rows with flexible ordering and
grouping. Existing aggregate functions such as MAX()
, SUM()
, and
COUNT()
can also be used in an analytic context. See
Impala Analytic Functions for details. See
Impala Aggregate Functions for enhancements to existing
aggregate functions.
New data types provide greater compatibility with source code from traditional database systems:
VARCHAR
is like the STRING
data type, but with a maximum length.
See VARCHAR Data Type (Impala 2.0 or higher only) for details.
CHAR
is like the STRING
data type, but with a precise length. Short
values are padded with spaces on the right. See CHAR Data Type (Impala 2.0 or higher only) for details.
GRANT
statement. See GRANT Statement (Impala 2.0 or higher only) for details.
REVOKE
statement. See REVOKE Statement (Impala 2.0 or higher only) for details.
CREATE ROLE
statement. See CREATE ROLE Statement (Impala 2.0 or higher only) for
details.
DROP ROLE
statement. See DROP ROLE Statement (Impala 2.0 or higher only) for
details.
SHOW ROLES
and SHOW ROLE GRANT
statements. See
SHOW Statement for details.
To complement the HDFS encryption feature, a new Impala configuration option,
--disk_spill_encryption
secures sensitive data from being observed or tampered
with when temporarily stored on disk.
The new security-related SQL statements work along with the Sentry authorization framework. See Enabling Sentry Authorization for Impala for details.
Impala can now read compressed text files compressed by gzip, bzip, or Snappy. These files do not
require any special table settings to work in an Impala text table. Impala recognizes the compression
type automatically based on file extensions of .gz
, .bz2
, and
.snappy
respectively. These types of compressed text files are intended for
convenience with existing ETL pipelines. Their non-splittable nature means they are not optimal for
high-performance parallel queries. See Using bzip2, deflate, gzip, Snappy, or zstd Text Files for details.
Query hints can now use comment notation, /* +hint_name */
or
-- +hint_name
, at the same places in the query where the hints
enclosed by [ ]
are recognized. This enhancement makes it easier to reuse Impala
queries on other database systems. See Optimizer Hints for details.
A new query option, QUERY_TIMEOUT_S
, lets you specify a timeout period in seconds for
individual queries.
The working of the --idle_query_timeout
configuration option is extended. If no
QUERY_OPTION_S
query option is in effect, --idle_query_timeout
works
the same as before, setting the timeout interval. When the QUERY_OPTION_S
query option
is specified, its maximum value is capped by the value of the --idle_query_timeout
option.
That is, the system administrator sets the default and maximum timeout through the
--idle_query_timeout
startup option, and then individual users or applications can set
a lower timeout value if desired through the QUERY_TIMEOUT_S
query option. See
Setting Timeout Periods for Daemons, Queries, and Sessions and
QUERY_TIMEOUT_S Query Option (Impala 2.0 or higher only) for details.
New functions VAR_SAMP()
and VAR_POP()
are aliases for the existing
VARIANCE_SAMP()
and VARIANCE_POP()
functions.
A new date and time function, DATE_PART()
, provides similar functionality to
EXTRACT()
. You can also call the EXTRACT()
function using the SQL-99
syntax, EXTRACT(unit FROM timestamp)
. These
enhancements simplify the porting process for date-related code from other systems. See
Impala Date and Time Functions for details.
New approximation features provide a fast way to get results when absolute precision is not required:
APPX_COUNT_DISTINCT
query option lets Impala rewrite
COUNT(DISTINCT)
calls to use NDV()
instead, which speeds up the
operation and allows multiple COUNT(DISTINCT)
operations in a single query. See
APPX_COUNT_DISTINCT Query Option (Impala 2.0 or higher only) for details.
APPX_MEDIAN()
aggregate function produces an estimate for the median value of a
column by using sampling. See APPX_MEDIAN Function for details.
Impala now supports a DECODE()
function. This function works as a shorthand for a
CASE()
expression, and improves compatibility with SQL code containing vendor
extensions. See Impala Conditional Functions for details.
The STDDEV()
, STDDEV_POP()
, STDDEV_SAMP()
,
VARIANCE()
, VARIANCE_POP()
, VARIANCE_SAMP()
, and
NDV()
aggregate functions now all return DOUBLE
results rather than
STRING
. Formerly, you were required to CAST()
the result to a numeric
type before using it in arithmetic operations.
The default settings for Parquet block size, and the associated PARQUET_FILE_SIZE
query option, are changed. Now, Impala writes Parquet files with a size of 256 MB and an HDFS block
size of 256 MB. Previously, Impala attempted to write Parquet files with a size of 1 GB and an HDFS
block size of 1 GB. In practice, Impala used a conservative estimate of the disk space needed for each
Parquet block, leading to files that were typically 512 MB anyway. Thus, this change will make the file
size more accurate if you specify a value for the PARQUET_FILE_SIZE
query option. It
also reduces the amount of memory reserved during INSERT
into Parquet tables,
potentially avoiding out-of-memory errors and improving scalability when inserting data into Parquet
tables.
Anti-joins are now supported, expressed using the LEFT ANTI JOIN
and RIGHT
ANTI JOIN
clauses.
These clauses returns results from one table that have no match in the other table. You might use this
type of join in the same sorts of use cases as the NOT EXISTS
and NOT
IN
operators. See Joins in Impala SELECT Statements for details.
The SET
command in impala-shell has been promoted to a real SQL
statement. You can now set query options such as PARQUET_FILE_SIZE
,
MEM_LIMIT
, and SYNC_DDL
within JDBC, ODBC, or any other kind of
application that submits SQL without going through the impala-shell interpreter. See
SET Statement for details.
The impala-shell interpreter now reads settings from an optional configuration file, named $HOME/.impalarc by default. See impala-shell Configuration File for details.
The library used for regular expression parsing has changed from Boost to Google RE2. This
implementation change adds support for non-greedy matches using the .*?
notation. This
and other changes in the way regular expressions are interpreted means you might need to re-test
queries that use functions such as regexp_extract()
or
regexp_replace()
, or operators such as REGEXP
or
RLIKE
. See Incompatible Changes and Limitations in Apache Impala for
those details.
The following are the major new features in Impala 1.4:
The DECIMAL
data type lets you store fixed-precision values, for working with currency
or other fractional values where it is important to represent values exactly and avoid rounding errors.
This feature includes enhancements to built-in functions, numeric literals, and arithmetic expressions.
See DECIMAL Data Type (Impala 3.0 or higher only) for details.
Where the underlying HDFS support exists, Impala can take advantage of the HDFS caching feature to "pin" entire tables or individual partitions in memory, to speed up queries on frequently accessed data and reduce the CPU overhead of memory-to-memory copying. When HDFS files are cached in memory, Impala can read the cached data without any disk reads, and without making an additional copy of the data in memory. Other Hadoop components that read the same data files also experience a performance benefit.
For background information about HDFS caching, see
the documentation for your Apache Hadoop distribution. For performance information about using this feature with Impala, see
Using HDFS Caching with Impala (Impala 2.1 or higher only). For the SET CACHED
and
SET UNCACHED
clauses that let you control cached table data through DDL statements,
see CREATE TABLE Statement and
ALTER TABLE Statement.
Impala can now use Sentry-based authorization based either on the original policy file, or on rules
defined by GRANT
and REVOKE
statements issued through Hive.
See Enabling Sentry Authorization for Impala for details.
For interoperability with Parquet files created through other Hadoop components, such as Pig or MapReduce jobs, you can create an Impala table that automatically sets up the column definitions based on the layout of an existing Parquet data file. See CREATE TABLE Statement for the syntax, and Creating Parquet Tables in Impala for usage information.
ORDER BY
queries no longer require a LIMIT
clause. If the size of the
result set to be sorted exceeds the memory available to Impala, Impala uses a temporary work space on
disk to perform the sort operation. See ORDER BY Clause
for details.
LDAP connections can be secured through either SSL or TLS. See Enabling LDAP Authentication for Impala for details.
The following new built-in scalar and aggregate functions are available:
A new built-in function, EXTRACT()
, returns one date or time field from a
TIMESTAMP
value. See
Impala Date and Time Functions for details.
A new built-in function, TRUNC()
, truncates date/time values to a particular
granularity, such as year, month, day, hour, and so on. See
Impala Date and Time Functions for details.
ADD_MONTHS()
built-in function, an alias for the existing
MONTHS_ADD()
function. See
Impala Date and Time Functions for details.
A new built-in function, ROUND()
, rounds DECIMAL
values to a
specified number of fractional digits. See
Impala Mathematical Functions for details.
Several built-in aggregate functions for computing properties for statistical distributions:
STDDEV()
, STDDEV_SAMP()
, STDDEV_POP()
,
VARIANCE()
, VARIANCE_SAMP()
, and VARIANCE_POP()
.
See STDDEV, STDDEV_SAMP, STDDEV_POP Functions and
VARIANCE, VARIANCE_SAMP, VARIANCE_POP, VAR_SAMP, VAR_POP Functions for details.
Several new built-in functions, such as MAX_INT()
,
MIN_SMALLINT()
, and so on, let you conveniently check whether data values are in
an expected range. You might be able to switch a column to a smaller type, saving memory during
processing. See Impala Mathematical Functions for
details.
New built-in functions, IS_INF()
and IS_NAN()
, check for the
special values infinity and "not a number". These values could be specified as
inf
or nan
in text data files, or be produced by certain
arithmetic expressions. See
Impala Mathematical Functions for details.
The SHOW PARTITIONS
statement displays information about the structure of a
partitioned table. See SHOW Statement for details.
New configuration options for the impalad daemon let you specify initial memory usage for all queries. The initial resource requests handled by Llama and YARN can be expanded later if needed, avoiding unnecessary over-allocation and reducing the chance of out-of-memory conditions. See Resource Management for details.
CREATE TABLE
statement now has a STORED AS AVRO
clause,
allowing you to create Avro tables through Impala. See
Using the Avro File Format with Impala Tables for details and examples.
New impalad configuration options let you fine-tune the calculations Impala makes to estimate resource requirements for each query. These options can help avoid problems due to overconsumption due to too-low estimates, or underutilization due to too-high estimates. See Resource Management for details.
A new SUMMARY
command in the impala-shell interpreter provides a
high-level summary of the work performed at each stage of the explain plan. The summary is also
included in output from the PROFILE
command. See
impala-shell Command Reference and
Using the SUMMARY Report for Performance Tuning for details.
Performance improvements for the COMPUTE STATS
statement:
NDV
function is speeded up through native code generation.
NULL
count is not currently used by the Impala query planner, in Impala
1.4.0 and higher, COMPUTE STATS
does not count the NULL
values for
each column. (The #Nulls
field of the stats table is left as -1, signifying that the
value is unknown.)
See COMPUTE STATS Statement for general details about the COMPUTE
STATS
statement, and Table and Column Statistics for how to use the
statistics to improve query performance.
Performance improvements for partition pruning. This feature reduces the time spent in query planning, for partitioned tables with thousands of partitions. Previously, Impala typically queried tables with up to approximately 3000 partitions. With the performance improvement in partition pruning, now Impala can comfortably handle tables with tens of thousands of partitions. See Partition Pruning for Queries for information about partition pruning.
The documentation provides additional guidance for planning tasks. See Planning for Impala Deployment.
The impala-shell interpreter now supports UTF-8 characters for input and output. You
can control whether impala-shell ignores invalid Unicode code points through the
--strict_unicode
option. (Although this option is removed in Impala 2.0.)
No new features. This point release is exclusively a bug fix release for the IMPALA-1019 issue related to HDFS caching.
This point release is primarily a vehicle to deliver bug fixes. Any new features are minor changes resulting from fixes for performance, reliability, or usability issues.
A new impalad startup option, --insert_inherit_permissions
, causes
Impala INSERT
statements to create each new partition with the same HDFS permissions
as its parent directory. By default, INSERT
statements create directories for new
partitions using default HDFS permissions. See INSERT Statement for examples of
INSERT
statements for partitioned tables.
The SHOW FUNCTIONS
statement now displays the return type of each function, in
addition to the types of its arguments. See SHOW Statement for examples.
You can now specify the clause FIELDS TERMINATED BY '\0'
with a CREATE
TABLE
statement to use text data files that use ASCII 0 (nul
) characters as a
delimiter. See Using Text Data Files with Impala Tables for details.
In Impala 1.3.1 and higher, the REGEXP
and RLIKE
operators now match a regular expression string that occurs anywhere inside the target
string, the same as if the regular expression was enclosed on each side by
.*
. See REGEXP Operator for
examples. Previously, these operators only succeeded when the regular expression matched
the entire target string. This change improves compatibility with the regular expression
support for popular database systems. There is no change to the behavior of the
regexp_extract()
and regexp_replace()
built-in
functions.
The admission control feature lets you control and prioritize the volume and resource consumption of concurrent queries. This mechanism reduces spikes in resource usage, helping Impala to run alongside other kinds of workloads on a busy cluster. It also provides more user-friendly conflict resolution when multiple memory-intensive queries are submitted concurrently, avoiding resource contention that formerly resulted in out-of-memory errors. See Admission Control and Query Queuing for details.
Enhanced EXPLAIN
plans provide more detail in an easier-to-read format. Now there are
four levels of verbosity: the EXPLAIN_LEVEL
option can be set from 0 (most concise) to
3 (most verbose). See EXPLAIN Statement for syntax and
Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles for usage information.
The TIMESTAMP
data type accepts more kinds of input string formats through the
UNIX_TIMESTAMP
function, and produces more varieties of string formats through the
FROM_UNIXTIME
function. The documentation now also lists more functions for date
arithmetic, used for adding and subtracting INTERVAL
expressions from
TIMESTAMP
values. See Impala Date and Time Functions
for details.
New conditional functions, NULLIF()
, NULLIFZERO()
, and
ZEROIFNULL()
, simplify porting SQL containing vendor extensions to Impala. See
Impala Conditional Functions for details.
New utility function, CURRENT_DATABASE()
. See
Impala Miscellaneous Functions for details.
Integration with the YARN resource management framework. This feature makes use of the underlying YARN service, plus an additional service (Llama) that coordinates requests to YARN for Impala resources, so that the Impala query only proceeds when all requested resources are available. See Resource Management for full details.
On the Impala side, this feature involves some new startup options for the impalad daemon:
-enable_rm
-llama_host
-llama_port
-llama_callback_port
-cgroup_hierarchy_path
For details of these startup options, see Modifying Impala Startup Options.
This feature also involves several new or changed query options that you can set through the impala-shell interpreter and apply within a specific session:
MEM_LIMIT
: the function of this existing option changes when Impala resource
management is enabled.
REQUEST_POOL
: a new option. (Renamed to RESOURCE_POOL
in Impala
1.3.0.)
V_CPU_CORES
: a new option.
RESERVATION_REQUEST_TIMEOUT
: a new option.
For details of these query options, see impala_resource_management.html#rm_query_options.
On Impala startup, the metadata loading and synchronization mechanism has been improved and optimized, to give more responsiveness when starting Impala on a system with a large number of databases, tables, or partitions. The initial metadata loading happens in the background, allowing queries to be run before the entire process is finished. When a query refers to a table whose metadata is not yet loaded, the query waits until the metadata for that table is loaded, and the load operation for that table is prioritized to happen first.
Formerly, if you created a new table in Hive, you had to issue the INVALIDATE METADATA
statement (with no table name) which was an expensive operation that reloaded metadata for all tables.
Impala did not recognize the name of the Hive-created table, so you could not do INVALIDATE
METADATA new_table
to get the metadata for just that one table. Now, when
you issue INVALIDATE METADATA table_name
, Impala checks to see if
that name represents a table created in Hive, and if so recognizes the new table and loads the metadata
for it. Additionally, if the new table is in a database that was newly created in Hive, Impala also
recognizes the new database.
If you issue INVALIDATE METADATA table_name
and the table has been
dropped through Hive, Impala will recognize that the table no longer exists.
New startup options let you control the parallelism of the metadata loading during startup for the catalogd daemon:
--load_catalog_in_background
makes Impala load and cache metadata using background
threads after startup. It is true
by default. Previously, a system with a large
number of databases, tables, or partitions could be unresponsive or even time out during startup.
--num_metadata_loading_threads
determines how much parallelism Impala devotes to
loading metadata in the background. The default is 16. You might increase this value for systems
with huge numbers of databases, tables, or partitions. You might lower this value for busy systems
that are CPU-constrained due to jobs from components other than Impala.
Impala 1.2.3 contains exactly the same feature set as Impala 1.2.2. Its only difference is one additional fix for compatibility with Parquet files generated outside of Impala by components such as Hive, Pig, or MapReduce. If you are upgrading from Impala 1.2.1 or earlier, see New Features in Impala 1.2.2 for the latest added features.
Impala 1.2.2 includes new features for performance, security, and flexibility. The major enhancements over 1.2.1 are performance related, primarily for join queries.
New user-visible features include:
Join order optimizations. This highly valuable feature automatically distributes and parallelizes the
work for a join query to minimize disk I/O and network traffic. The automatic optimization reduces the
need to use query hints or to rewrite join queries with the tables in a specific order based on size or
cardinality. The new COMPUTE STATS
statement gathers statistical information about
each table that is crucial for enabling the join optimizations. See
Performance Considerations for Join Queries for details.
COMPUTE STATS
statement to collect both table statistics and column statistics with a
single statement. Intended to be more comprehensive, efficient, and reliable than the corresponding
Hive ANALYZE TABLE
statement, which collects statistics in multiple phases through
MapReduce jobs. These statistics are important for query planning for join queries, queries on
partitioned tables, and other types of data-intensive operations. For optimal planning of join queries,
you need to collect statistics for each table involved in the join. See
COMPUTE STATS Statement for details.
Reordering of tables in a join query can be overridden by the STRAIGHT_JOIN
operator,
allowing you to fine-tune the planning of the join query if necessary, by using the original technique
of ordering the joined tables in descending order of size. See
Overriding Join Reordering with STRAIGHT_JOIN for details.
The CROSS JOIN
clause in the
SELECT
statement to allow Cartesian
products in queries, that is, joins without an equality comparison between columns in both tables.
Because such queries must be carefully checked to avoid accidental overconsumption of memory, you must
use the CROSS JOIN
operator to explicitly select this kind of join. See
Cross Joins and Cartesian Products with the CROSS JOIN Operator for examples.
The ALTER TABLE
statement has new clauses that let you fine-tune table statistics. You
can use this technique as a less-expensive way to update specific statistics, in case the statistics
become stale, or to experiment with the effects of different data distributions on query planning.
LDAP username/password authentication in JDBC/ODBC. See Enabling LDAP Authentication for Impala for details.
GROUP_CONCAT() aggregate function to concatenate column values across all rows of a result set.
The INSERT
statement now accepts hints, [SHUFFLE]
and
[NOSHUFFLE]
, to influence the way work is redistributed during
INSERT...SELECT
operations. The hints are primarily useful for inserting into
partitioned Parquet tables, where using the [SHUFFLE]
hint can avoid problems due to
memory consumption and simultaneous open files in HDFS, by collecting all the new data for each
partition on a specific node.
Several built-in functions and operators are now overloaded for more numeric data types, to reduce the
requirement to use CAST()
for type coercion in INSERT
statements. For
example, the expression 2+2
in an INSERT
statement formerly produced
a BIGINT
result, requiring a CAST()
to be stored in an
INT
variable. Now, addition, subtraction, and multiplication only produce a result
that is one step "bigger" than their arguments, and numeric and conditional functions can return
SMALLINT
, FLOAT
, and other smaller types rather than always
BIGINT
or DOUBLE
.
New fnv_hash()
built-in function for constructing hashed values. See
Impala Mathematical Functions for details.
The clause STORED AS PARQUET
is accepted as an equivalent for STORED AS
PARQUETFILE
. This more concise form is recommended for new code.
Because Impala 1.2.2 builds on a number of features introduced in 1.2.1, if you are upgrading from an older
1.1.x release straight to 1.2.2, also review New Features in Impala 1.2.1 to see
features such as the SHOW TABLE STATS
and SHOW COLUMN STATS
statements,
and user-defined functions (UDFs).
Impala 1.2.1 includes new features for security, performance, and flexibility.
New user-visible features include:
SHOW TABLE STATS table_name
and SHOW COLUMN STATS
table_name
statements, to verify that statistics are available and to see
the values used during query planning.
CREATE TABLE AS SELECT
syntax, to create a new table and transfer data into it in a
single operation.
OFFSET
clause, for use with the ORDER BY
and LIMIT
clauses to produce "paged" result sets such as items 1-10, then 11-20, and so on.
NULLS FIRST
and NULLS LAST
clauses to ensure consistent placement of
NULL
values in ORDER BY
queries.
New built-in functions: least()
,
greatest()
, initcap()
.
New aggregate function: ndv()
, a fast alternative to COUNT(DISTINCT
col)
returning an approximate result.
The LIMIT
clause can now accept a numeric expression as an argument, rather than only
a literal constant.
The SHOW CREATE TABLE
statement displays the end result of all the CREATE
TABLE
and ALTER TABLE
statements for a particular table. You can use the
output to produce a simplified setup script for a schema.
The --idle_query_timeout
and --idle_session_timeout
options for
impalad control the time intervals after which idle queries are cancelled, and idle
sessions expire. See Setting Timeout Periods for Daemons, Queries, and Sessions for details.
User-defined functions (UDFs). This feature lets you transform data in very flexible ways, which is important when using Impala as part of an ETL or ELT pipeline. Prior to Impala 1.2, using UDFs required switching into Hive. Impala 1.2 can run scalar UDFs and user-defined aggregate functions (UDAs). Impala can run high-performance functions written in C++, or you can reuse existing Hive functions written in Java.
You create UDFs through the CREATE FUNCTION
statement and drop them through the
DROP FUNCTION
statement. See User-Defined Functions (UDFs) for instructions about
coding, building, and deploying UDFs, and CREATE FUNCTION Statement and
DROP FUNCTION Statement for related SQL syntax.
A new service automatically propagates changes to table data and metadata made by one Impala node,
sending the new or updated metadata to all the other Impala nodes. The automatic synchronization
mechanism eliminates the need to use the INVALIDATE METADATA
and
REFRESH
statements after issuing Impala statements such as CREATE
TABLE
, ALTER TABLE
, DROP TABLE
, INSERT
, and
LOAD DATA
.
For even more precise synchronization, you can enable the
SYNC_DDL
query option before issuing
a DDL, INSERT
, or LOAD DATA
statement. This option causes the
statement to wait, returning only after the catalog service has broadcast the applicable changes to all
Impala nodes in the cluster.
Because the catalog service only monitors operations performed through Impala, INVALIDATE
METADATA
and REFRESH
are still needed on the Impala side after creating new
tables or loading data through the Hive shell or by manipulating data files directly in HDFS. Because
the catalog service broadcasts the result of the REFRESH
and INVALIDATE
METADATA
statements to all Impala nodes, when you do need to use those statements, you can
do so a single time rather than on every Impala node.
This service is implemented by the catalogd daemon. See The Impala Catalog Service for details.
The CREATE TABLE
and ALTER TABLE
statements have new clauses
TBLPROPERTIES
and WITH SERDEPROPERTIES
. The
TBLPROPERTIES
clause lets you associate arbitrary items of metadata with a particular
table as key-value pairs. The WITH SERDEPROPERTIES
clause lets you specify the
serializer/deserializer (SerDes) classes that read and write data for a table; although Impala does not
make use of these properties, sometimes particular values are needed for Hive compatibility. See
CREATE TABLE Statement and
ALTER TABLE Statement for details.
Delegation support lets you authorize certain OS users associated with applications (for example,
hue
), to submit requests using the credentials of other users.
See Configuring Impala Delegation for Clients for details.
Enhancements to EXPLAIN
output. In particular, when you enable the new
EXPLAIN_LEVEL
query option, the EXPLAIN
and PROFILE
statements produce more verbose output showing estimated resource requirements and whether table and
column statistics are available for the applicable tables and columns. See
EXPLAIN Statement for details.
SHOW CREATE TABLE
summarizes the effects of the original CREATE TABLE
statement and any subsequent ALTER TABLE
statements, giving you a CREATE
TABLE
statement that will re-create the current structure and layout for a table.
The LIMIT
clause for queries now accepts an arithmetic expression, in addition to
numeric literals.
The Impala 1.2.0 beta includes new features for security, performance, and flexibility.
New user-visible features include:
User-defined functions (UDFs). This feature lets you transform data in very flexible ways, which is important when using Impala as part of an ETL or ELT pipeline. Prior to Impala 1.2, using UDFs required switching into Hive. Impala 1.2 can run scalar UDFs and user-defined aggregate functions (UDAs). Impala can run high-performance functions written in C++, or you can reuse existing Hive functions written in Java.
You create UDFs through the CREATE FUNCTION
statement and drop them through the
DROP FUNCTION
statement. See User-Defined Functions (UDFs) for instructions about
coding, building, and deploying UDFs, and CREATE FUNCTION Statement and
DROP FUNCTION Statement for related SQL syntax.
A new service automatically propagates changes to table data and metadata made by one Impala node,
sending the new or updated metadata to all the other Impala nodes. The automatic synchronization
mechanism eliminates the need to use the INVALIDATE METADATA
and
REFRESH
statements after issuing Impala statements such as CREATE
TABLE
, ALTER TABLE
, DROP TABLE
, INSERT
, and
LOAD DATA
.
Because this service only monitors operations performed through Impala, INVALIDATE
METADATA
and REFRESH
are still needed on the Impala side after creating new
tables or loading data through the Hive shell or by manipulating data files directly in HDFS. Because
the catalog service broadcasts the result of the REFRESH
and INVALIDATE
METADATA
statements to all Impala nodes, when you do need to use those statements, you can
do so a single time rather than on every Impala node.
This service is implemented by the catalogd daemon. See The Impala Catalog Service for details.
Integration with the YARN resource management framework. This feature makes use of the underlying YARN service, plus an additional service (Llama) that coordinates requests to YARN for Impala resources, so that the Impala query only proceeds when all requested resources are available. See Resource Management for full details.
On the Impala side, this feature involves some new startup options for the impalad daemon:
-enable_rm
-llama_host
-llama_port
-llama_callback_port
-cgroup_hierarchy_path
For details of these startup options, see Modifying Impala Startup Options.
This feature also involves several new or changed query options that you can set through the impala-shell interpreter and apply within a specific session:
MEM_LIMIT
: the function of this existing option changes when Impala resource
management is enabled.
YARN_POOL
: a new option. (Renamed to RESOURCE_POOL
in Impala
1.3.0.)
V_CPU_CORES
: a new option.
RESERVATION_REQUEST_TIMEOUT
: a new option.
For details of these query options, see impala_resource_management.html#rm_query_options.
CREATE TABLE ... AS SELECT
syntax, to create a table and copy data into it in a single
operation. See CREATE TABLE Statement for details.
The CREATE TABLE
and ALTER TABLE
statements have a new
TBLPROPERTIES
clause that lets you associate arbitrary items of metadata with a
particular table as key-value pairs. See CREATE TABLE Statement and
ALTER TABLE Statement for details.
Delegation support lets you authorize certain OS users associated with applications (for example,
hue
), to submit requests using the credentials of other users.
See Configuring Impala Delegation for Clients for details.
Enhancements to EXPLAIN
output. In particular, when you enable the new
EXPLAIN_LEVEL
query option, the EXPLAIN
and PROFILE
statements produce more verbose output showing estimated resource requirements and whether table and
column statistics are available for the applicable tables and columns. See
EXPLAIN Statement for details.
Impala 1.1.1 includes new features for security and stability.
New user-visible features include:
Impala 1.1 includes new features for security, performance, and usability.
New user-visible features include:
WITH
clause for SELECT
statements lets you simplify complicated
queries in a way similar to creating a view. The effects of the WITH
clause only last
for the duration of one query, unlike views, which are persistent schema objects that can be used by
multiple sessions or applications. See WITH Clause.
DESCRIBE
statement, DESCRIBE FORMATTED
table_name
, displays more detailed information about the table. This
information includes the file format, location, delimiter, ownership, external or internal, creation and
access times, and partitions. The information is returned as a result set that can be interpreted and
used by a management or monitoring application. See DESCRIBE Statement.
NULL
values. Or you can specify the columns in any order in the destination table,
rather than having to match the order of the corresponding columns in the source. VALUES
clause. This feature is known as "column permutation". See INSERT Statement.
LOAD DATA
statement lets you load data into a table directly from an HDFS data
file. This technique lets you minimize the number of steps in your ETL process, and provides more
flexibility. For example, you can bring data into an Impala table in one step. Formerly, you might have
created an external table where the data files are not entirely under your control, or copied the data
files to Impala data directories manually, or loaded the original data into one table and then used the
INSERT
statement to copy it to a new table with a different file format, partitioning
scheme, and so on. See LOAD DATA Statement.
HBASE_CACHE_BLOCKS
and HBASE_CACHING
.
REFRESH
as a SQL statement through any of the programming interfaces that
Impala supports. REFRESH
formerly had to be issued as a command through the
impala-shell interpreter, and was not available through a JDBC or ODBC API call. As
part of this change, the functionality of the REFRESH
statement is divided between two
statements. In Impala 1.1, REFRESH
requires a table name argument and immediately
reloads the metadata; the new INVALIDATE METADATA
statement works the same as the Impala
1.0 REFRESH
did: the table name argument is optional, and the metadata for one or all
tables is marked as stale, but not actually reloaded until the table is queried. When you create a new
table in the Hive shell or through a different Impala node, you must enter INVALIDATE
METADATA
with no table parameter before you can see the new table in
impala-shell. See REFRESH Statement and
INVALIDATE METADATA Statement.
New user-visible features include:
VALUES
clause lets you INSERT
one or more rows using literals,
function return values, or other expressions. For performance and scalability, you should still use
INSERT ... SELECT
for bringing large quantities of data into an Impala table. The
VALUES
clause is a convenient way to set up small tables, particularly for initial
testing of SQL features that do not require large amounts of data. See
VALUES Clause for details.
-B
and -o
options of the impala-shell
command can
turn query results into delimited text files and store them in an output file. The plain text results are
useful for using with other Hadoop components or Unix tools. In benchmark tests, it is also faster to
produce plain rather than pretty-printed results, and write to a file rather than to the screen, giving a
more accurate picture of the actual query time.
This version has multiple performance improvements and adds the following functionality:
ALTER TABLE
statement.
REFRESH
for a single table.
This version has multiple performance improvements and adds the following functionality:
CLASSPATH
on the client to include the
JARs.
${IMPALA_HOME}/www
. This can be disabled by setting
--enable_webserver_doc_root=false
on the command line. As a result, Impala now uses the
Twitter Bootstrap library to style its debug webpages, and the /queries
page now tracks
the last 25 queries run by each Impala daemon.
state-store-service binary
has been renamed statestored
.
/usr/lib/impala/conf
directory to the /etc/impala/conf
directory.
impalad
. The format is:
-default_query_options='key=value;key=value'