New Features in Apache Impala

New Features in Impala 4.0

For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the release notes or changelog for Impala 4.0.

New Features in Impala 3.4

The following sections describe the noteworthy improvements made in Impala 3.4.

For the full list of issues closed in this release, see the changelog for Impala 3.4.

Support for Hive Insert-Only Transactional Tables

Impala added the support to truncate insert-only transactional tables.

By default, Impala creates an insert-only transactional table when you issue the CREATE TABLE statement.

Use the Hive compaction to compact small files to improve the performance and scalability of metadata in transactional tables.

See Impala Transactions for more information.

Server-side Spooling of Query Results

You can use the SPOOL_QUERY_RESULTS query option to control how query results are returned to the client.

By default, when a client fetches a set of query results, the next set of results are fetched in batches until all the result rows are produced. If a client issues a query without fetching all the results, the query fragments continue to hold on to the resources until the query is canceled and unregistered, potentially tying up resources and causing other queries to wait in admission control.

When the query result spooling feature is enabled, the result sets of queries are eagerly fetched and buffered until they are read by the client, and resources are freed up for other queries.

See Spooling Impala Query Results for the new feature and the query options.

Cookie-based Authentication

Starting in this version, Impala supports cookies for authentication when clients connect via HiveServer2 over HTTP.

You can use the --max_cookie_lifetime_s startup flag to:

Disable the use of cookies
Control how long generated cookies are valid for

See Impala Client Access for more information.

Object Ownership Support

Object ownership for tables, views, and databases is enabled by default in Impala. When you create a database, a table, or a view, as the owner of that object, you implicitly have the privileges on the object. The privileges that owners have are specified in Ranger on the special user, {OWNER}.

The {OWNER} user must be defined in Ranger for the object ownership privileges work in Impala.

See Impala Authorization for details.

New Built-in Functions for Fuzzy Matching of Strings

Use the new Jaro or Jaro-Winkler functions to perform fuzzy matches on relatively short strings, e.g. to scrub user inputs of names against the records in the database.

JARO_DISTANCE, JARO_DST
JARO_SIMILARITY, JARO_SIM
JARO_WINKLER_DISTANCE, JW_DST
JARO_WINKLER_SIMILARITY, JW_SIM

See Impala String Functions for details.

Capacity Quota for Scratch Disks

When configuring scratch space for intermediate files used in large sorts, joins, aggregations, or analytic function operations, use the ‑‑scratch_dirs startup flag to optionally specify a capacity quota per scratch directory, e.g., ‑‑scratch_dirs=/dir1:5MB,/dir2.

See How Impala Works with Hadoop File Formats for details.

Query Option for Disabling HBase Row Estimation

During query plan generation, Impala samples underlying HBase tables to estimate row count and row size, but the sampling process can negatively impact the planning time. To alleviate the issue, when the HBase table stats do not change much in a short time, disable the sampling with the DISABLE_HBASE_NUM_ROWS_ESTIMATE query option so that the Impala planner falls back to using Hive Metastore (HMS) table stats instead.

See DISABLE_HBASE_NUM_ROWS_ESTIMATE Query Option.

Query Option for Controlling Size of Parquet Splits on Non-block Stores

To optimize query performance, Impala planner uses the value of the fs.s3a.block.size startup flag when calculating the split size on non-block based stores, e.g. S3, ADLS, etc. Starting in this release, Impala planner uses the PARQUET_OBJECT_STORE_SPLIT_SIZE query option to get the Parquet file format specific split size.

For Parquet files, the fs.s3a.block.size startup flag is no longer used.

The default value of the PARQUET_OBJECT_STORE_SPLIT_SIZE query option is 256 MB.

See Using Impala with Amazon S3 Object Store for tuning Impala query performance for S3.

Query Profile Exported to JSON

On the Query Details page of Impala Daemon Web UI, you have a new option, in addition to the existing Thrift and Text formats, to export the query profile output in the JSON format.

See Impala Web User Interface for Debugging for generating JSON query profile outputs in Web UI.

DATE Data Type Supported in Avro Tables

You can now use the DATE data type to query date values from Avro tables.

See Using the Avro File Format with Impala Tables for details.

Primary Key and Foreign Key Constraints

This release adds support for primary and foreign key constraints, but in this release the constraints are advisory and intended for estimating cardinality during query planning in a future release. There is no attempt to enforce constraints. See CREATE TABLE Statement for details.

Enhanced External Kudu Table

By default HMS implicitly translates internal Kudu tables to external Kudu tables with the 'external.table.purge' property set to true. These tables behave similar to internal tables. You can explicitly create such external Kudu tables. See CREATE TABLE Statement for details.

Ranger Column Masking

This release supports Ranger column masking, which hides sensitive columnar data in Impala query output. For example, you can define a policy that reveals only the first or last four characters of column data. Column masking is enabled by default. See Ranger Column Masking for details.

BROADCAST_BYTES_LIMIT query option

You can set the default limit for the size of the broadcast input. Such a limit can prevent possible performance problems.

Experimental Support for Apache Hudi

In this release, you can use Read Optimized Queries on Hudi tables. See Using the Hudi File Format for details.

ORC Reads Enabled by Default

Impala stability and performance have been improved. Consequently, ORC reads are now enabled in Impala by default. To disable, set --enable_orc_scanner to false when starting the cluster. See Using the ORC File Format with Impala Tables for details.

Support for ZSTD and DEFLATE

This release supports ZSTD and DEFLATE compression codecs for text files. See Using bzip2, deflate, gzip, Snappy, or zstd Text Files for details.

New Features in Impala 3.3

The following sections describe the noteworthy improvements made in Impala 3.3.

For the full list of issues closed in this release, see the changelog for Impala 3.3.

Increased Compatibility with Apache Projects

Impala is integrate with the following components:

Apache Ranger: Use Apache Ranger to manage authorization in Impala. See Impala Authorization for details.
Apache Atlas: Use Apache Atlas to manage data governance in Impala.
Hive 3

Parquet Page Index

To improve performance when using Parquet files, Impala can now write page indexes in Parquet files and use those indexes to skip pages for the faster scan.

See Query Performance for Impala Parquet Tables for details.

The Remote File Handle Cache Supports S3

Impala can now cache remote HDFS file handles when the tables that store their data in Amazon S3 cloud storage.

See Scalability Considerations for File Handle Caching for the information on remote file handle cache.

Support for Kudu Integrated with Hive Metastore

In Impala 3.3 and Kudu 1.10, Kudu is integrated with Hive Metastore (HMS), and from Impala, you can create, update, delete, and query the tables in the Kudu services integrated with HMS.

See Using Kudu with Impala for information on using Kudu tables in Impala.

Zstd Compression for Parquet files

Zstandard (Zstd) is a real-time compression algorithm offering a tradeoff between speed and ratio of compression. Compression levels from 1 up to 22 are supported. The lower the level, the faster the speed at the cost of compression ratio.

Lz4 Compression for Parquet files

Lz4 is a lossless compression algorithm providing extremely fast and scalable compression and decompression.

Data Cache for Remote Reads

To improve performance on multi-cluster HDFS environments as well as on object store environments, Impala now caches data for non-local reads (e.g. S3, ABFS, ADLS) on local storage.

The data cache is enabled with the --data_cache startup flag.

See Impala Remote Data Cache for the information and steps to enable remote data cache.

Metadata Performance Improvements

The following features to improve metadata performance are enabled by default in this release:

Incremental stats are now compressed in memory in catalogd, reducing memory footprint in catalogd.
impaladcoordinators fetch incremental stats from catalogd on-demand, reducing the memory footprint and the network requirements for broadcasting metadata.
Time-based and memory-based automatic invalidation of metadata to keep the size of metadata bounded and to reduce the chances of catalogdcache running out of memory.
Automatic invalidation of metadata

With automatic metadata management enabled, you no longer have to issue INVALIDATE / REFRESH in a number of conditions.

In Impala 3.3, the following additional event in Hive Metastore can trigger automatic INVALIDATE / REFRESH of Metadata:
- INSERT into tables and partitions from Impala or from Spark on the same or multiple cluster configuration

See Metadata Management for the information on the above features.

Scalable Pool Configuration in Admission Controller

To offer more dynamic and flexible resource management, Impala supports the new configuration parameters that scale with the number of hosts in the resource pool. You can use the parameters to control the number of running queries, queued queries, and maximum amount of memory allocated for Impala resource pools. See Admission Control and Query Queuing for the information about the new parameters and using them for admission control.

Query Profile

The following information was added to the Query Profile output for better monitoring and troubleshooting of query performance.

Network I/O throughput
System disk I/O throughput

See Impala Query Profile for generating and reading query profile.

DATE Data Type and Functions

You can use the new the DATE type to describe a particular year/month/day, in the form YYYY-MM-DD.

This initial DATE type support the TEXT, Parquet, and HBASE file formats.

The support of DATE data type includes the following features:

DATE type column as a partitioning key column
DATE literal
Implicit casting between DATE and other types: STRING and TIMESTAMP
Most of the built-in functions for TIMESTAMP now allow the DATE type arguments, as well.

See DATE Data Type and Impala Date and Time Functions for using the DATE type.

Support Hive Insert-Only Transactional Tables

Impala added the support to create, drop, query, and insert into the insert-only type of transactional tables.

See Impala Transactions for details.

HiveServer2 HTTP Connection for Clients

Now client applications can connect to Impala over HTTP via HiveServer2 with the option to use the Kerberos SPNEGO and LDAP for authentication. See Impala Clients for details.

Default File Format Changed to Parquet

When you create a table, the default format for that table data is now Parquet.

For backward compatibility, you can use the DEFAULT_FILE_FORMAT query option to set the default file format to the previous default, text, or other formats.

Built-in Function to Process JSON Objects

The GET_JSON_OBJECT() function extracts JSON object from a string based on the path specified and returns the extracted JSON object.

See Impala Miscellaneous Functions. for details.

Ubuntu 18.04

This version of Impala is certified to run on Ubuntu 18.04.

New Features in Impala 3.2

The following sections describe the noteworthy improvements made in Impala 3.2.

For the full list of issues closed in this release, see the changelog for Impala 3.2.

Multi-cluster Support

Remote File Handle Cache
Impala can now cache remote HDFS file handles when the cache_remote_file_handles impalad flag is set to true. This feature does not apply to non-HDFS tables, such as Kudu or HBase tables, and does not apply to the tables that store their data on cloud services, such as S3 or ADLS. See Scalabilty Considerations for file handle caching in Impala.

Enhancements in Resource Management and Admission Control

Admission Debug page is available in Impala Daemon (impalad) web UI at \admission and provides the following information about Impala resource pools:
- Pool configuration
- Relevant pool stats
- Queued queries in order of being queued (local to the coordinator)
- Running queries (local to this coordinator)
- Histogram of the distribution of peak memory usage by admitted queries

A new query option, NUM_ROWS_PRODUCED_LIMIT, was added to limit the number of rows returned from queries.
Impala will cancel a query if the query produces more rows than the limit specified by this query option. The limit applies only when the results are returned to a client, e.g. for a SELECT query, but not an INSERT query. This query option is a guardrail against users accidentally submitting queries that return a large number of rows.

Metadata Performance Improvements

Automatic Metadata Sync using Hive Metastore Notification Events
When enabled, the catalogd polls Hive Metastore (HMS) notifications events at a configurable interval and syncs with HMS. You can use the new web UI pages of the catalogd to check the state of the automatic invalidate event processor.

Note: This is a preview feature in Impala 3.2.

Compatibility and Usability Enhancements

Impala can now read the TIMESTAMP_MILLIS and TIMESTAMP_MICROS Parquet types. See Using Parquet File Format for Impala Tables for the Parquet support in Impala.
Impala can now read the complex types in ORC such as ARRAY, STRUCT, and MAP. See Using ORC File Format for Impala Tables for the ORC support in Impala.
The LEVENSHTEIN string function is supported.
The function returns the Levenshtein distance between two input strings, the minimum number of single-character edits required to transform one string to other.
The IF NOT EXISTS clause is supported in the ALTER TABLE statement.
The new DEFAULT_FILE_FORMAT query option allows you to set the default table file format. This removes the need for the STORED AS <format> clause. Set this option if you prefer a value that is not TEXT. The supported formats are:
- TEXT
- RC_FILE
- SEQUENCE_FILE
- AVRO
- PARQUET
- KUDU
- ORC
The extended or verbose EXPLAIN output includes the following new information for queries:
- The text of the analyzed query that may have been rewritten to include various optimizations and implicit casts.
- The implicit casts and literals shown with the actual types.
CPU resource utilization (user, system, iowait) metrics were added to the Impala profile output.

Security Enhancement

The REFRESH AUTHORIZATION statement was implemented for refreshing authorization data.

New Features in Impala 3.1

For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 3.1.

New Features in Impala 3.0

For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 3.0.

New Features in Impala 2.12

For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 2.12.

New Features in Impala 2.11

For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 2.11.

New Features in Impala 2.10

For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 2.10.

New Features in Impala 2.9

For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 2.9.

The following are some of the most significant new features in this release:

A new function, replace(), which is faster than regexp_replace() for simple string substitutions. See Impala String Functions for details.
Startup flags for the impalad daemon, is_executor and is_coordinator, let you divide the work on a large, busy cluster between a small number of hosts acting as query coordinators, and a larger number of hosts acting as query executors. By default, each host can act in both roles, potentially introducing bottlenecks during heavily concurrent workloads. See How to Configure Impala with Dedicated Coordinators for details.

New Features in Impala 2.8

Performance and scalability improvements:
- The COMPUTE STATS statement can take advantage of multithreading.
- Improved scalability for highly concurrent loads by reducing the possibility of TCP/IP timeouts. A configuration setting, accepted_cnxn_queue_depth, can be adjusted upwards to avoid this type of timeout on large clusters.
- Several performance improvements were made to the mechanism for generating native code:
  - Some queries involving analytic functions can take better advantage of native code generation.
  - Modules produced during intermediate code generation are organized to be easier to cache and reuse during the lifetime of a long-running or complicated query.
  - The COMPUTE STATS statement is more efficient (less time for the codegen phase) for tables with a large number of columns, especially for tables containing TIMESTAMP columns.
  - The logic for determining whether or not to use a runtime filter is more reliable, and the evaluation process itself is faster because of native code generation.
- The MT_DOP query option enables multithreading for a number of Impala operations. COMPUTE STATS statements for Parquet tables use a default of MT_DOP=4 to improve the intra-node parallelism and CPU efficiency of this data-intensive operation.
- The COMPUTE STATS statement is more efficient (less time for the codegen phase) for tables with a large number of columns.
- A new hint, CLUSTERED, allows Impala INSERT operations on a Parquet table that use dynamic partitioning to process a high number of partitions in a single statement. The data is ordered based on the partition key columns, and each partition is only written by a single host, reducing the amount of memory needed to buffer Parquet data while the data blocks are being constructed.
- The new configuration setting inc_stats_size_limit_bytes lets you reduce the load on the catalog server when running the COMPUTE INCREMENTAL STATS statement for very large tables.
- Impala folds many constant expressions within query statements, rather than evaluating them for each row. This optimization is especially useful when using functions to manipulate and format TIMESTAMP values, such as the result of an expression such as to_date(now() - interval 1 day).
- Parsing of complicated expressions is faster. This speedup is especially useful for queries containing large CASE expressions.
- Evaluation is faster for IN operators with many constant arguments. The same performance improvement applies to other functions with many constant arguments.
- Impala optimizes identical comparison operators within multiple OR blocks.
- The reporting for wall-clock times and total CPU time in profile output is more accurate.
- A new query option, SCRATCH_LIMIT, lets you restrict the amount of space used when a query exceeds the memory limit and activates the "spill to disk" mechanism. This option helps to avoid runaway queries or make queries "fail fast" if they require more memory than anticipated. You can prevent runaway queries from using excessive amounts of spill space, without restarting the cluster to turn the spilling feature off entirely. See SCRATCH_LIMIT Query Option for details.
Integration with Apache Kudu:
- The experimental Impala support for the Kudu storage layer has been folded into the main Impala development branch. Impala can now directly access Kudu tables, opening up new capabilities such as enhanced DML operations and continuous ingestion.
- The DELETE statement is a flexible way to remove data from a Kudu table. Previously, removing data from an Impala table involved removing or rewriting the underlying data files, dropping entire partitions, or rewriting the entire table. This Impala statement only works for Kudu tables.
- The UPDATE statement is a flexible way to modify data within a Kudu table. Previously, updating data in an Impala table involved replacing the underlying data files, dropping entire partitions, or rewriting the entire table. This Impala statement only works for Kudu tables.
- The UPSERT statement is a flexible way to ingest, modify, or both data within a Kudu table. Previously, ingesting data that might contain duplicates involved an inefficient multi-stage operation, and there was no built-in protection against duplicate data. The UPSERT statement, in combination with the primary key designation for Kudu tables, lets you add or replace rows in a single operation, and automatically avoids creating any duplicate data.
- The CREATE TABLE statement gains some new clauses that are specific to Kudu tables: PARTITION BY, PARTITIONS, STORED AS KUDU, and column attributes PRIMARY KEY, NULL and NOT NULL, ENCODING, COMPRESSION, DEFAULT, and BLOCK_SIZE. These clauses replace the explicit TBLPROPERTIES settings that were required in the early experimental phases of integration between Impala and Kudu.
- The ALTER TABLE statement can change certain attributes of Kudu tables. You can add, drop, or rename columns. You can add or drop range partitions. You can change the TBLPROPERTIES value to rename or point to a different underlying Kudu table, independently from the Impala table name in the metastore database. You cannot change the data type of an existing column in a Kudu table.
- The SHOW PARTITIONS statement displays information about the distribution of data between partitions in Kudu tables. A new variation, SHOW RANGE PARTITIONS, displays information about the Kudu-specific partitions that apply across ranges of key values.
- Not all Impala data types are supported in Kudu tables. In particular, currently the Impala TIMESTAMP type is not allowed in a Kudu table. Impala does not recognize the UNIXTIME_MICROS Kudu type when it is present in a Kudu table. (These two representations of date/time data use different units and are not directly compatible.) You cannot create columns of type TIMESTAMP, DECIMAL, VARCHAR, or CHAR within a Kudu table. Within a query, you can cast values in a result set to these types. Certain types, such as BOOLEAN, cannot be used as primary key columns.
- Currently, Kudu tables are not interchangeable between Impala and Hive the way other kinds of Impala tables are. Although the metadata for Kudu tables is stored in the metastore database, currently Hive cannot access Kudu tables.
- The INSERT statement works for Kudu tables. The organization of the Kudu data makes it more efficient than with HDFS-backed tables to insert data in small batches, such as with the INSERT ... VALUES syntax.
- Some audit data is recorded for data governance purposes. All UPDATE, DELETE, and UPSERT statements are characterized as INSERT operations in the audit log. Currently, lineage metadata is not generated for UPDATE and DELETE operations on Kudu tables.
- Currently, Kudu tables have limited support for Sentry:
  - Access to Kudu tables must be granted to roles as usual.
  - Currently, access to a Kudu table through Sentry is "all or nothing". You cannot enforce finer-grained permissions such as at the column level, or permissions on certain operations such as INSERT.
  - Only users with ALL privileges on SERVER can create external Kudu tables.
  Because non-SQL APIs can access Kudu data without going through Sentry authorization, currently the Sentry support is considered preliminary.
- Equality and IN predicates in Impala queries are pushed to Kudu and evaluated efficiently by the Kudu storage layer.
Security:
- Impala can take advantage of the S3 encrypted credential store, to avoid exposing the secret key when accessing data stored on S3.
[IMPALA-1654] Several kinds of DDL operations can now work on a range of partitions. The partitions can be specified using operators such as <, >=, and != rather than just an equality predicate applying to a single partition. This new feature extends the syntax of several clauses of the ALTER TABLE statement (DROP PARTITION, SET [UN]CACHED, SET FILEFORMAT | SERDEPROPERTIES | TBLPROPERTIES), the SHOW FILES statement, and the COMPUTE INCREMENTAL STATS statement. It does not apply to statements that are defined to only apply to a single partition, such as LOAD DATA, ALTER TABLE ... ADD PARTITION, SET LOCATION, and INSERT with a static partitioning clause.
The instr() function has optional second and third arguments, representing the character to position to begin searching for the substring, and the Nth occurrence of the substring to find.
Improved error handling for malformed Avro data. In particular, incorrect precision or scale for DECIMAL types is now handled.
Impala debug web UI:
- In addition to "inflight" and "finished" queries, the web UI now also includes a section for "queued" queries.
- The /sessions tab now clarifies how many of the displayed sections are active, and lets you sort by Expired status to distinguish active sessions from expired ones.
Improved stability when DDL operations such as CREATE DATABASE or DROP DATABASE are run in Hive at the same time as an Impala INVALIDATE METADATA statement.
The "out of memory" error report was made more user-friendly, with additional diagnostic information to help identify the spot where the memory limit was exceeded.
Improved disk space usage for Java-based UDFs. Temporary copies of the associated JAR files are removed when no longer needed, so that they do not accumulate across restarts of the catalogd daemon and potentially cause an out-of-space condition. These temporary files are also created in the directory specified by the local_library_dir configuration setting, so that the storage for these temporary files can be independent from any capacity limits on the /tmp filesystem.

New Features in Impala 2.7

Performance improvements:
- [IMPALA-3206] Speedup for queries against DECIMAL columns in Avro tables. The code that parses DECIMAL values from Avro now uses native code generation.
- [IMPALA-3674] Improved efficiency in LLVM code generation can reduce codegen time, especially for short queries.
- [IMPALA-2979] Improvements to scheduling on worker nodes, enabled by the REPLICA_PREFERENCE query option. See REPLICA_PREFERENCE Query Option (Impala 2.7 or higher only) for details.
[IMPALA-1683] The REFRESH statement can be applied to a single partition, rather than the entire table. See REFRESH Statement and Refreshing a Single Partition for details.
Improvements to the Impala web user interface:
- [IMPALA-2767] You can now force a session to expire by clicking a link in the web UI, on the /sessions tab.
- [IMPALA-3715] The /memz tab includes more information about Impala memory usage.
- [IMPALA-3716] The Details page for a query now includes a Memory tab.
[IMPALA-3499] Scalability improvements to the catalog server. Impala handles internal communication more efficiently for tables with large numbers of columns and partitions, where the size of the metadata exceeds 2 GiB.
[IMPALA-3677] You can send a SIGUSR1 signal to any Impala-related daemon to write a Breakpad minidump. For advanced troubleshooting, you can now produce a minidump without triggering a crash. See Breakpad Minidumps for Impala (Impala 2.6 or higher only) for details about the Breakpad minidump feature.
[IMPALA-3687] The schema reconciliation rules for Avro tables have changed slightly for CHAR and VARCHAR columns. Now, if the definition of such a column is changed in the Avro schema file, the column retains its CHAR or VARCHAR type as specified in the SQL definition, but the column name and comment from the Avro schema file take precedence. See Creating Avro Tables for details about column definitions in Avro tables.
[IMPALA-3575] Some network operations now have additional timeout and retry settings. The extra configuration helps avoid failed queries for transient network problems, to avoid hangs when a sender or receiver fails in the middle of a network transmission, and to make cancellation requests more reliable despite network issues.

New Features in Impala 2.6

Improvements to Impala support for the Amazon S3 filesystem:
- Impala can now write to S3 tables through the INSERT or LOAD DATA statements. See Using Impala with Amazon S3 Object Store for general information about using Impala with S3.
- A new query option, S3_SKIP_INSERT_STAGING, lets you trade off between fast INSERT performance and slower INSERTs that are more consistent if a problem occurs during the statement. The new behavior is enabled by default. See S3_SKIP_INSERT_STAGING Query Option (Impala 2.6 or higher only) for details about this option.
Performance improvements for the runtime filtering feature:
- The default for the RUNTIME_FILTER_MODE query option is changed to GLOBAL (the highest setting). See RUNTIME_FILTER_MODE Query Option (Impala 2.5 or higher only) for details about this option.
- The RUNTIME_BLOOM_FILTER_SIZE setting is now only used as a fallback if statistics are not available; otherwise, Impala uses the statistics to estimate the appropriate size to use for each filter. See RUNTIME_BLOOM_FILTER_SIZE Query Option (Impala 2.5 or higher only) for details about this option.
- New query options RUNTIME_FILTER_MIN_SIZE and RUNTIME_FILTER_MAX_SIZE let you fine-tune the sizes of the Bloom filter structures used for runtime filtering. If the filter size derived from Impala internal estimates or from the RUNTIME_FILTER_BLOOM_SIZE falls outside the size range specified by these options, any too-small filter size is adjusted to the minimum, and any too-large filter size is adjusted to the maximum. See RUNTIME_FILTER_MIN_SIZE Query Option (Impala 2.6 or higher only) and RUNTIME_FILTER_MAX_SIZE Query Option (Impala 2.6 or higher only) for details about these options.
- Runtime filter propagation now applies to all the operands of UNION and UNION ALL operators.
- Runtime filters can now be produced during join queries even when the join processing activates the spill-to-disk mechanism.
See Runtime Filtering for Impala Queries (Impala 2.5 or higher only) for general information about the runtime filtering feature.
Admission control and dynamic resource pools are enabled by default. See Admission Control and Query Queuing for details about admission control.
Impala can now manually set column statistics, using the ALTER TABLE statement with a SET COLUMN STATS clause. See impala_perf_stats.html#perf_column_stats_manual for details.
Impala can now write lightweight "minidump" files, rather than large core files, to save diagnostic information when any of the Impala-related daemons crash. This feature uses the open source breakpad framework. See Breakpad Minidumps for Impala (Impala 2.6 or higher only) for details.
New query options improve interoperability with Parquet files:
- The PARQUET_FALLBACK_SCHEMA_RESOLUTION query option lets Impala locate columns within Parquet files based on column name rather than ordinal position. This enhancement improves interoperability with applications that write Parquet files with a different order or subset of columns than are used in the Impala table. See PARQUET_FALLBACK_SCHEMA_RESOLUTION Query Option (Impala 2.6 or higher only) for details.
- The PARQUET_ANNOTATE_STRINGS_UTF8 query option makes Impala include the UTF-8 annotation metadata for STRING, CHAR, and VARCHAR columns in Parquet files created by INSERT or CREATE TABLE AS SELECT statements. See PARQUET_ANNOTATE_STRINGS_UTF8 Query Option (Impala 2.6 or higher only) for details.
See Using the Parquet File Format with Impala Tables for general information about working with Parquet files.
Improvements to security and reduction in overhead for secure clusters:
- Overall performance improvements for secure clusters. (TPC-H queries on a secure cluster were benchmarked at roughly 3x as fast as the previous release.)
- Impala now recognizes the auth_to_local setting, specified through the HDFS configuration setting hadoop.security.auth_to_local. This feature is disabled by default; to enable it, specify --load_auth_to_local_rules=true in the impalad configuration settings. See Mapping Kerberos Principals to Short Names for Impala for details.
- Timing improvements in the mechanism for the impalad daemon to acquire Kerberos tickets. This feature spreads out the overhead on the KDC during Impala startup, especially for large clusters.
- For Kerberized clusters, the Catalog service now uses the Kerberos principal instead of the operating sytem user that runs the catalogd daemon. This eliminates the requirement to configure a hadoop.user.group.static.mapping.overrides setting to put the OS user into the Sentry administrative group, on clusters where the principal and the OS user name for this user are different.
Overall performance improvements for join queries, by using a prefetching mechanism while building the in-memory hash table to evaluate join predicates. See PREFETCH_MODE Query Option (Impala 2.6 or higher only) for the query option to control this optimization.
The impala-shell interpreter has a new command, SOURCE, that lets you run a set of SQL statements or other impala-shell commands stored in a file. You can run additional SOURCE commands from inside a file, to set up flexible sequences of statements for use cases such as schema setup, ETL, or reporting. See impala-shell Command Reference for details and Running Commands and SQL Statements in impala-shell for examples.
The millisecond() built-in function lets you extract the fractional seconds part of a TIMESTAMP value. See Impala Date and Time Functions for details.
If an Avro table is created without column definitions in the CREATE TABLE statement, and columns are later added through ALTER TABLE, the resulting table is now queryable. Missing values from the newly added columns now default to NULL. See Using the Avro File Format with Impala Tables for general details about working with Avro files.
The mechanism for interpreting DECIMAL literals is improved, no longer going through an intermediate conversion step to DOUBLE:
- Casting a DECIMAL value to TIMESTAMP DOUBLE produces a more precise value for the TIMESTAMP than formerly.
- Certain function calls involving DECIMAL literals now succeed, when formerly they failed due to lack of a function signature with a DOUBLE argument.
- Faster runtime performance for DECIMAL constant values, through improved native code generation for all combinations of precision and scale.
See DECIMAL Data Type (Impala 3.0 or higher only) for details about the DECIMAL type.
Improved type accuracy for CASE return values. If all WHEN clauses of the CASE expression are of CHAR type, the final result is also CHAR instead of being converted to STRING. See Impala Conditional Functions for details about the CASE function.
Uncorrelated queries using the NOT EXISTS operator are now supported. Formerly, the NOT EXISTS operator was only available for correlated subqueries.
Improved performance for reading Parquet files.
Improved performance for top-N queries, that is, those including both ORDER BY and LIMIT clauses.
Impala optionally skips an arbitrary number of header lines from text input files on HDFS based on the skip.header.line.count value in the TBLPROPERTIES field of the table metadata. See Data Files for Text Tables for details.
Trailing comments are now allowed in queries processed by the impala-shell options -q and -f.
Impala can run COUNT queries for RCFile tables that include complex type columns. See Complex Types (Impala 2.3 or higher only) for general information about working with complex types, and ARRAY Complex Type (Impala 2.3 or higher only), MAP Complex Type (Impala 2.3 or higher only), and STRUCT Complex Type (Impala 2.3 or higher only) for syntax details of each type.

New Features in Impala 2.5

Dynamic partition pruning. When a query refers to a partition key column in a WHERE clause, and the exact set of column values are not known until the query is executed, Impala evaluates the predicate and skips the I/O for entire partitions that are not needed. For example, if a table was partitioned by year, Impala would apply this technique to a query such as SELECT c1 FROM partitioned_table WHERE year = (SELECT MAX(year) FROM other_table). See Dynamic Partition Pruning for details.

The dynamic partition pruning optimization technique lets Impala avoid reading data files from partitions that are not part of the result set, even when that determination cannot be made in advance. This technique is especially valuable when performing join queries involving partitioned tables. For example, if a join query includes an ON clause and a WHERE clause that refer to the same columns, the query can find the set of column values that match the WHERE clause, and only scan the associated partitions when evaluating the ON clause.

Dynamic partition pruning is controlled by the same settings as the runtime filtering feature. By default, this feature is enabled at a medium level, because the maximum setting can use slightly more memory for queries than in previous releases. To fully enable this feature, set the query option RUNTIME_FILTER_MODE=GLOBAL.
Runtime filtering. This is a wide-ranging set of optimizations that are especially valuable for join queries. Using the same technique as with dynamic partition pruning, Impala uses the predicates from WHERE and ON clauses to determine the subset of column values from one of the joined tables could possibly be part of the result set. Impala sends a compact representation of the filter condition to the hosts in the cluster, instead of the full set of values or the entire table. See Runtime Filtering for Impala Queries (Impala 2.5 or higher only) for details.

By default, this feature is enabled at a medium level, because the maximum setting can use slightly more memory for queries than in previous releases. To fully enable this feature, set the query option RUNTIME_FILTER_MODE=GLOBAL. See RUNTIME_FILTER_MODE Query Option (Impala 2.5 or higher only) for details.

This feature involves some new query options: RUNTIME_FILTER_MODE, MAX_NUM_RUNTIME_FILTERS, RUNTIME_BLOOM_FILTER_SIZE, RUNTIME_FILTER_WAIT_TIME_MS, and DISABLE_ROW_RUNTIME_FILTERING. See RUNTIME_FILTER_MODE, MAX_NUM_RUNTIME_FILTERS, RUNTIME_BLOOM_FILTER_SIZE, RUNTIME_FILTER_WAIT_TIME_MS, and DISABLE_ROW_RUNTIME_FILTERING for details.
More efficient use of the HDFS caching feature, to avoid hotspots and bottlenecks that could occur if heavily used cached data blocks were always processed by the same host. By default, Impala now randomizes which host processes each cached HDFS data block, when cached replicas are available on multiple hosts. (Remember to use the WITH REPLICATION clause with the CREATE TABLE or ALTER TABLE statement when enabling HDFS caching for a table or partition, to cache the same data blocks across multiple hosts.) The new query option SCHEDULE_RANDOM_REPLICA lets you fine-tune the interaction with HDFS caching even more. See Using HDFS Caching with Impala (Impala 2.1 or higher only) for details.
The TRUNCATE TABLE statement now accepts an IF EXISTS clause, making TRUNCATE TABLE easier to use in setup or ETL scripts where the table might or might not exist. See TRUNCATE TABLE Statement (Impala 2.3 or higher only) for details.
Improved performance and reliability for the DECIMAL data type:
- Using DECIMAL values in a GROUP BY clause now triggers the native code generation optimization, speeding up queries that group by values such as prices.
- Checking for overflow in DECIMAL multiplication is now substantially faster, making DECIMAL a more practical data type in some use cases where formerly DECIMAL was much slower than FLOAT or DOUBLE.
- Multiplying a mixture of DECIMAL and FLOAT or DOUBLE values now returns the DOUBLE rather than DECIMAL. This change avoids some cases where an intermediate value would underflow or overflow and become NULL unexpectedly.
See DECIMAL Data Type (Impala 3.0 or higher only) for details.
For UDFs written in Java, or Hive UDFs reused for Impala, Impala now allows parameters and return values to be primitive types. Formerly, these things were required to be one of the "Writable" object types. See Using Hive UDFs with Impala for details.
Performance improvements for HDFS I/O. Impala now caches HDFS file handles to avoid the overhead of repeatedly opening the same file.
Performance improvements for queries involving nested complex types. Certain basic query types, such as counting the elements of a complex column, now use an optimized code path.
Improvements to the memory reservation mechanism for the Impala admission control feature. You can specify more settings, such as the timeout period and maximum aggregate memory used, for each resource pool instead of globally for the Impala instance. The default limit for concurrent queries (the max requests setting) is now unlimited instead of 200.
Performance improvements related to code generation. Even in queries where code generation is not performed for some phases of execution (such as reading data from Parquet tables), Impala can still use code generation in other parts of the query, such as evaluating functions in the WHERE clause.
Performance improvements for queries using aggregation functions on high-cardinality columns. Formerly, Impala could do unnecessary extra work to produce intermediate results for operations such as DISTINCT or GROUP BY on columns that were unique or had few duplicate values. Now, Impala decides at run time whether it is more efficient to do an initial aggregation phase and pass along a smaller set of intermediate data, or to pass raw intermediate data back to next phase of query processing to be aggregated there. This feature is known as streaming pre-aggregation. In case of performance regression, this feature can be turned off using the DISABLE_STREAMING_PREAGGREGATIONS query option. See DISABLE_STREAMING_PREAGGREGATIONS Query Option (Impala 2.5 or higher only) for details.
Spill-to-disk feature now always recommended. In earlier releases, the spill-to-disk feature could be turned off using a pair of configuration settings, enable_partitioned_aggregation=false and enable_partitioned_hash_join=false. The latest improvements in the spill-to-disk mechanism, and related features that interact with it, make this feature robust enough that disabling it is now no longer needed or supported. In particular, some new features in Impala 2.5 and higher do not work when the spill-to-disk feature is disabled.
Improvements to scripting capability for the impala-shell command, through user-specified substitution variables that can appear in statements processed by impala-shell:
- The --var command-line option lets you pass key-value pairs to impala-shell. The shell can substitute the values into queries before executing them, where the query text contains the notation ${var:varname}. For example, you might prepare a SQL file containing a set of DDL statements and queries containing variables for database and table names, and then pass the applicable names as part of the impala-shell -f filename command. See Running Commands and SQL Statements in impala-shell for details.
- The SET and UNSET commands within the impala-shell interpreter now work with user-specified substitution variables, as well as the built-in query options. The two kinds of variables are divided in the SET output. As with variables defined by the --var command-line option, you refer to the user-specified substitution variables in queries by using the notation ${var:varname} in the query text. Because the substitution variables are processed by impala-shell instead of the impalad backend, you cannot define your own substitution variables through the SET statement in a JDBC or ODBC application. See SET Statement for details.
Performance improvements for query startup. Impala better parallelizes certain work when coordinating plan distribution between impalad instances, which improves startup time for queries involving tables with many partitions on large clusters, or complicated queries with many plan fragments.
Performance and scalability improvements for tables with many partitions. The memory requirements on the coordinator node are reduced, making it substantially faster and less resource-intensive to do joins involving several tables with thousands of partitions each.
Whitelisting for access to internal APIs. For applications that need direct access to Impala APIs, without going through the HiveServer2 or Beeswax interfaces, you can specify a list of Kerberos users who are allowed to call those APIs. By default, the impala and hdfs users are the only ones authorized for this kind of access. Any users not explicitly authorized through the internal_principals_whitelist configuration setting are blocked from accessing the APIs. This setting applies to all the Impala-related daemons, although currently it is primarily used for HDFS to control the behavior of the catalog server.
Improvements to Impala integration and usability for Hue. (The code changes are actually on the Hue side.)
- The list of tables now refreshes dynamically.
Usability improvements for case-insensitive queries. You can now use the operators ILIKE and IREGEXP to perform case-insensitive wildcard matches or regular expression matches, rather than explicitly converting column values with UPPER or LOWER. See ILIKE Operator and IREGEXP Operator for details.
Performance and reliability improvements for DDL and insert operations on partitioned tables with a large number of partitions. Impala only re-evaluates metadata for partitions that are affected by a DDL operation, not all partitions in the table. While a DDL or insert statement is in progress, other Impala statements that attempt to modify metadata for the same table wait until the first one finishes.
Reliability improvements for the LOAD DATA statement. Previously, this statement would fail if the source HDFS directory contained any subdirectories at all. Now, the statement ignores any hidden subdirectories, for example _impala_insert_staging.
A new operator, IS [NOT] DISTINCT FROM, lets you compare values and always get a true or false result, even if one or both of the values are NULL. The IS NOT DISTINCT FROM operator, or its equivalent <=> notation, improves the efficiency of join queries that treat key values that are NULL in both tables as equal. See IS DISTINCT FROM Operator for details.
Security enhancements for the impala-shell command. A new option, --ldap_password_cmd, lets you specify a command to retrieve the LDAP password. The resulting password is then used to authenticate the impala-shell command with the LDAP server. See impala-shell Configuration Options for details.
The CREATE TABLE AS SELECT statement now accepts a PARTITIONED BY clause, which lets you create a partitioned table and insert data into it with a single statement. See CREATE TABLE Statement for details.
User-defined functions (UDFs and UDAFs) written in C++ now persist automatically when the catalogd daemon is restarted. You no longer have to run the CREATE FUNCTION statements again after a restart.
User-defined functions (UDFs) written in Java can now persist when the catalogd daemon is restarted, and can be shared transparently between Impala and Hive. You must do a one-time operation to recreate these UDFs using new CREATE FUNCTION syntax, without a signature for arguments or the return value. Afterwards, you no longer have to run the CREATE FUNCTION statements again after a restart. Although Impala does not have visibility into the UDFs that implement the Hive built-in functions, user-created Hive UDFs are now automatically available for calling through Impala. See CREATE FUNCTION Statement for details.
Reliability enhancements for memory management. Some aggregation and join queries that formerly might have failed with an out-of-memory error due to memory contention, now can succeed using the spill-to-disk mechanism.
The SHOW DATABASES statement now returns two columns rather than one. The second column includes the associated comment string, if any, for each database. Adjust any application code that examines the list of databases and assumes the result set contains only a single column. See SHOW DATABASES for details.
A new optimization speeds up aggregation operations that involve only the partition key columns of partitioned tables. For example, a query such as SELECT COUNT(DISTINCT k), MIN(k), MAX(k) FROM t1 can avoid reading any data files if T1 is a partitioned table and K is one of the partition key columns. Because this technique can produce different results in cases where HDFS files in a partition are manually deleted or are empty, you must enable the optimization by setting the query option OPTIMIZE_PARTITION_KEY_SCANS. See OPTIMIZE_PARTITION_KEY_SCANS Query Option (Impala 2.5 or higher only) for details.
The DESCRIBE statement can now display metadata about a database, using the syntax DESCRIBE DATABASE db_name. See DESCRIBE Statement for details.
The uuid() built-in function generates an alphanumeric value that you can use as a guaranteed unique identifier. The uniqueness applies even across tables, for cases where an ascending numeric sequence is not suitable. See Impala Miscellaneous Functions for details.

New Features in Impala 2.4

Impala can be used on the DSSD D5 Storage Appliance. From a user perspective, the Impala features are the same as in Impala 2.3.

New Features in Impala 2.3

The following are the major new features in Impala 2.3.x. This major release contains improvements to SQL syntax (particularly new support for complex types), performance, manageability, security.

Complex data types: STRUCT, ARRAY, and MAP. These types can encode multiple named fields, positional items, or key-value pairs within a single column. You can combine these types to produce nested types with arbitrarily deep nesting, such as an ARRAY of STRUCT values, a MAP where each key-value pair is an ARRAY of other MAP values, and so on. Currently, complex data types are only supported for the Parquet file format. See Complex Types (Impala 2.3 or higher only) for usage details and ARRAY Complex Type (Impala 2.3 or higher only), STRUCT Complex Type (Impala 2.3 or higher only), and MAP Complex Type (Impala 2.3 or higher only) for syntax.
Column-level authorization lets you define access to particular columns within a table, rather than the entire table. This feature lets you reduce the reliance on creating views to set up authorization schemes for subsets of information. See the documentation for Apache Sentry for background details, and GRANT Statement (Impala 2.0 or higher only) and REVOKE Statement (Impala 2.0 or higher only) for Impala-specific syntax.
The TRUNCATE TABLE statement removes all the data from a table without removing the table itself. See TRUNCATE TABLE Statement (Impala 2.3 or higher only) for details.
Nested loop join queries. Some join queries that formerly required equality comparisons can now use operators such as < or >=. This same join mechanism is used internally to optimize queries that retrieve values from complex type columns. See Joins in Impala SELECT Statements for details about Impala join queries.
Reduced memory usage and improved performance and robustness for spill-to-disk feature. See SQL Operations that Spill to Disk for details about this feature.
Performance improvements for querying Parquet data files containing multiple row groups and multiple data blocks:
- For files written by Hive, SparkSQL, and other Parquet MR writers and spanning multiple HDFS blocks, Impala now scans the extra data blocks locally when possible, rather than using remote reads.
- Impala queries benefit from the improved alignment of row groups with HDFS blocks for Parquet files written by Hive, MapReduce, and other components. (Impala itself never writes multiblock Parquet files, so the alignment change does not apply to Parquet files produced by Impala.) These Parquet writers now add padding to Parquet files that they write to align row groups with HDFS blocks. The parquet.writer.max-padding setting specifies the maximum number of bytes, by default 8 megabytes, that can be added to the file between row groups to fill the gap at the end of one block so that the next row group starts at the beginning of the next block. If the gap is larger than this size, the writer attempts to fit another entire row group in the remaining space. Include this setting in the hive-site configuration file to influence Parquet files written by Hive, or the hdfs-site configuration file to influence Parquet files written by all non-Impala components.
See Using the Parquet File Format with Impala Tables for instructions about using Parquet data files with Impala.
Many new built-in scalar functions, for convenience and enhanced portability of SQL that uses common industry extensions.

Math functions (see Impala Mathematical Functions for details):
- ATAN2
- COSH
- COT
- DCEIL
- DEXP
- DFLOOR
- DLOG10
- DPOW
- DROUND
- DSQRT
- DTRUNC
- FACTORIAL, and corresponding ! operator
- FPOW
- RADIANS
- RANDOM
- SINH
- TANH
String functions (see Impala String Functions for details):
- BTRIM
- CHR
- REGEXP_LIKE
- SPLIT_PART
Date and time functions (see Impala Date and Time Functions for details):
- INT_MONTHS_BETWEEN
- MONTHS_BETWEEN
- TIMEOFDAY
- TIMESTAMP_CMP
Bit manipulation functions (see Impala Bit Functions for details):
- BITAND
- BITNOT
- BITOR
- BITXOR
- COUNTSET
- GETBIT
- ROTATELEFT
- ROTATERIGHT
- SETBIT
- SHIFTLEFT
- SHIFTRIGHT
Type conversion functions (see Impala Type Conversion Functions for details):
- TYPEOF
The effective_user() function (see Impala Miscellaneous Functions for details).
New built-in analytic functions: PERCENT_RANK, NTILE, CUME_DIST. See Impala Analytic Functions for details.
The DROP DATABASE statement now works for a non-empty database. When you specify the optional CASCADE clause, any tables in the database are dropped before the database itself is removed. See DROP DATABASE Statement for details.
The DROP TABLE and ALTER TABLE DROP PARTITION statements have a new optional keyword, PURGE. This keyword causes Impala to immediately remove the relevant HDFS data files rather than sending them to the HDFS trashcan. This feature can help to avoid out-of-space errors on storage devices, and to avoid files being left behind in case of a problem with the HDFS trashcan, such as the trashcan not being configured or being in a different HDFS encryption zone than the data files. See DROP TABLE Statement and ALTER TABLE Statement for syntax.
The impala-shell command has a new feature for live progress reporting. This feature is enabled through the --live_progress and --live_summary command-line options, or during a session through the LIVE_SUMMARY and LIVE_PROGRESS query options. See LIVE_PROGRESS Query Option (Impala 2.3 or higher only) and LIVE_SUMMARY Query Option (Impala 2.3 or higher only) for details.
The impala-shell command also now displays a random "tip of the day" when it starts.
The impala-shell option -f now recognizes a special filename - to accept input from stdin. See impala-shell Configuration Options for details about the options for running impala-shell in non-interactive mode.
Format strings for the unix_timestamp() function can now include numeric timezone offsets. See Impala Date and Time Functions for details.
Impala can now run a specified command to obtain the password to decrypt a private-key PEM file, rather than having the private-key file be unencrypted on disk. See Configuring TLS/SSL for Impala for details.
Impala components now can use SSL for more of their internal communication. SSL is used for communication between all three Impala-related daemons when the configuration option ssl_server_certificate is enabled. SSL is used for communication with client applications when the configuration option ssl_client_ca_certificate is enabled. See Configuring TLS/SSL for Impala for details.

Currently, you can only use one of server-to-server TLS/SSL encryption or Kerberos authentication. This limitation is tracked by the issue IMPALA-2598.
Improved flexibility for intermediate data types in user-defined aggregate functions (UDAFs). See Writing User-Defined Aggregate Functions (UDAFs) for details.

In Impala 2.3.2, the bug fix for IMPALA-2598 removes the restriction on using both Kerberos and SSL for internal communication between Impala components.

New Features in Impala 2.8

The following are the major new features in Impala 2.2. This release contains improvements to performance, manageability, security, and SQL syntax.

Several improvements to date and time features enable higher interoperability with Hive and other database systems, provide more flexibility for handling time zones, and future-proof the handling of TIMESTAMP values:
- The WITH REPLICATION clause for the CREATE TABLE and ALTER TABLE statements lets you control the replication factor for HDFS caching for a specific table or partition. By default, each cached block is only present on a single host, which can lead to CPU contention if the same host processes each cached block. Increasing the replication factor lets Impala choose different hosts to process different cached blocks, to better distribute the CPU load.
- Startup flags for the impalad daemon enable a higher level of compatibility with TIMESTAMP values written by Hive, and more flexibility for working with date and time data using the local time zone instead of UTC. To enable these features, set the impalad startup flags -use_local_tz_for_unix_timestamp_conversions=true and -convert_legacy_hive_parquet_utc_timestamps=true.
  
  The -use_local_tz_for_unix_timestamp_conversions setting controls how the unix_timestamp(), from_unixtime(), and now() functions handle time zones. By default (when this setting is turned off), Impala considers all TIMESTAMP values to be in the UTC time zone when converting to or from Unix time values. When this setting is enabled, Impala treats TIMESTAMP values passed to or returned from these functions to be in the local time zone. When this setting is enabled, take particular care that all hosts in the cluster have the same timezone settings, to avoid inconsistent results depending on which host reads or writes TIMESTAMP data.
  
  The -convert_legacy_hive_parquet_utc_timestamps setting causes Impala to convert TIMESTAMP values to the local time zone when it reads them from Parquet files written by Hive. This setting only applies to data using the Parquet file format, where Impala can use metadata in the files to reliably determine that the files were written by Hive. If in the future Hive changes the way it writes TIMESTAMP data in Parquet, Impala will automatically handle that new TIMESTAMP encoding.
  
  See TIMESTAMP Data Type for details about time zone handling and the configuration options for Impala / Hive compatibility with Parquet format.
- In Impala 2.2.0 and higher, built-in functions that accept or return integers representing TIMESTAMP values use the BIGINT type for parameters and return values, rather than INT. This change lets the date and time functions avoid an overflow error that would otherwise occur on January 19th, 2038 (known as the "Year 2038 problem" or "Y2K38 problem"). This change affects the FROM_UNIXTIME() and UNIX_TIMESTAMP() functions. You might need to change application code that interacts with these functions, change the types of columns that store the return values, or add CAST() calls to SQL statements that call these functions.
  
  See Impala Date and Time Functions for the current function signatures.
The SHOW FILES statement lets you view the names and sizes of the files that make up an entire table or a specific partition. See SHOW FILES Statement for details.
Impala can now run queries against Parquet data containing columns with complex or nested types, as long as the query only refers to columns with scalar types.
Performance improvements for queries that include IN() operators and involve partitioned tables.
The new -max_log_files configuration option specifies how many log files to keep at each severity level. The default value is 10, meaning that Impala preserves the latest 10 log files for each severity level (INFO, WARNING, and ERROR) for each Impala-related daemon (impalad, statestored, and catalogd). Impala checks to see if any old logs need to be removed based on the interval specified in the logbufsecs setting, every 5 seconds by default. See Rotating Impala Logs for details.
Redaction of sensitive data from Impala log files. This feature protects details such as credit card numbers or tax IDs from administrators who see the text of SQL statements in the course of monitoring and troubleshooting a Hadoop cluster. See Redacting Sensitive Information from Impala Log Files for background information for Impala users, and the documentation for your Apache Hadoop distribution for usage details.
Lineage information is available for data created or queried by Impala. This feature lets you track who has accessed data through Impala SQL statements, down to the level of specific columns, and how data has been propagated between tables. See Viewing Lineage Information for Impala Data for background information for Impala users, the documentation for your Apache Hadoop distribution for usage details and how to interpret the lineage information.
Impala tables and partitions can now be located on the Amazon Simple Storage Service (S3) filesystem, for convenience in cases where data is already located in S3 and you prefer to query it in-place. Queries might have lower performance than when the data files reside on HDFS, because Impala uses some HDFS-specific optimizations. Impala can query data in S3, but cannot write to S3. Therefore, statements such as INSERT and LOAD DATA are not available when the destination table or partition is in S3. See Using Impala with Amazon S3 Object Store for details.

Important:
Impala query support for Amazon S3 is included in Impala 2.2, but is not supported or recommended for production use in this version.
Improved support for HDFS encryption. The LOAD DATA statement now works when the source directory and destination table are in different encryption zones. See the documentation for your Apache Hadoop distribution for details about using HDFS encryption with Impala.
Additional arithmetic function mod(). See Impala Mathematical Functions for details.
Flexibility to interpret TIMESTAMP values using the UTC time zone (the traditional Impala behavior) or using the local time zone (for compatibility with TIMESTAMP values produced by Hive).
Enhanced support for ETL using tools such as Flume. Impala ignores temporary files typically produced by these tools (filenames with suffixes .copying and .tmp).
The CPU requirement for Impala, which had become more restrictive in Impala 2.0.x and 2.1.x, has now been relaxed.

The prerequisite for CPU architecture has been relaxed in Impala 2.2.0 and higher. From this release onward, Impala works on CPUs that have the SSSE3 instruction set. The SSE4 instruction set is no longer required. This relaxed requirement simplifies the upgrade planning from Impala 1.x releases, which also worked on SSSE3-enabled processors.
Enhanced support for CHAR and VARCHAR types in the COMPUTE STATS statement.
The amount of memory required during setup for "spill to disk" operations is greatly reduced. This enhancement reduces the chance of a memory-intensive join or aggregation query failing with an out-of-memory error.
Several new conditional functions provide enhanced compatibility when porting code that uses industry extensions. The new functions are: isfalse(), isnotfalse(), isnottrue(), istrue(), nonnullvalue(), and nullvalue(). See Impala Conditional Functions for details.
The Impala debug web UI now can display a visual representation of the query plan. On the /queries tab, select Details for a particular query. The Details page includes a Plan tab with a plan diagram that you can zoom in or out (using scroll gestures through mouse wheel or trackpad).

New Features in Impala 2.1

This release contains the following enhancements to query performance and system scalability:

Impala can now collect statistics for individual partitions in a partitioned table, rather than processing the entire table for each COMPUTE STATS statement. This feature is known as incremental statistics, and is controlled by the COMPUTE INCREMENTAL STATS syntax. (You can still use the original COMPUTE STATS statement for nonpartitioned tables or partitioned tables that are unchanging or whose contents are entirely replaced all at once.) See COMPUTE STATS Statement and Table and Column Statistics for details.
Optimization for small queries lets Impala process queries that process very few rows without the unnecessary overhead of parallelizing and generating native code. Reducing this overhead lets Impala clear small queries quickly, keeping YARN resources and admission control slots available for data-intensive queries. The number of rows considered to be a "small" query is controlled by the EXEC_SINGLE_NODE_ROWS_THRESHOLD query option. See EXEC_SINGLE_NODE_ROWS_THRESHOLD Query Option (Impala 2.1 or higher only) for details.
An enhancement to the statestore component lets it transmit heartbeat information independently of broadcasting metadata updates. This optimization improves reliability of health checking on large clusters with many tables and partitions.
The memory requirement for querying gzip-compressed text is reduced. Now Impala decompresses the data as it is read, rather than reading the entire gzipped file and decompressing it in memory.

New Features in Impala 2.0

The following are the major new features in Impala 2.0. This major release contains improvements to performance, scalability, security, and SQL syntax.

Queries with joins or aggregation functions involving high volumes of data can now use temporary work areas on disk, reducing the chance of failure due to out-of-memory errors. When the required memory for the intermediate result set exceeds the amount available on a particular node, the query automatically uses a temporary work area on disk. This "spill to disk" mechanism is similar to the ORDER BY improvement from Impala 1.4. For details, see SQL Operations that Spill to Disk.
Subquery enhancements:
- Subqueries are now allowed in the WHERE clause, for example with the IN operator.
- The EXISTS and NOT EXISTS operators are available. They are always used in conjunction with subqueries.
- The IN and NOT IN queries can now operate on the result set from a subquery, not just a hardcoded list of values.
- Uncorrelated subqueries let you compare against one or more values for equality, IN, and EXISTS comparisons. For example, you might use WHERE clauses such as WHERE column = (SELECT MAX(some_other_column FROM table) or WHERE column IN (SELECT some_other_column FROM table WHERE conditions).
- Correlated subqueries let you cross-reference values from the outer query block and the subquery.
- Scalar subqueries let you substitute the result of single-value aggregate functions such as MAX(), MIN(), COUNT(), or AVG(), where you would normally use a numeric value in a WHERE clause.
For details about subqueries, see Subqueries in Impala SELECT Statements For information about new and improved operators, see EXISTS Operator and IN Operator.
Analytic functions such as RANK(), LAG(), LEAD(), and FIRST_VALUE() let you analyze sequences of rows with flexible ordering and grouping. Existing aggregate functions such as MAX(), SUM(), and COUNT() can also be used in an analytic context. See Impala Analytic Functions for details. See Impala Aggregate Functions for enhancements to existing aggregate functions.
New data types provide greater compatibility with source code from traditional database systems:
- VARCHAR is like the STRING data type, but with a maximum length. See VARCHAR Data Type (Impala 2.0 or higher only) for details.
- CHAR is like the STRING data type, but with a precise length. Short values are padded with spaces on the right. See CHAR Data Type (Impala 2.0 or higher only) for details.
Security enhancements:
- Formerly, Impala was restricted to using either Kerberos or LDAP / Active Directory authentication within a cluster. Now, Impala can freely accept either kind of authentication request, allowing you to set up some hosts with Kerberos authentication and others with LDAP or Active Directory. See Using Multiple Authentication Methods with Impala for details.
- GRANT statement. See GRANT Statement (Impala 2.0 or higher only) for details.
- REVOKE statement. See REVOKE Statement (Impala 2.0 or higher only) for details.
- CREATE ROLE statement. See CREATE ROLE Statement (Impala 2.0 or higher only) for details.
- DROP ROLE statement. See DROP ROLE Statement (Impala 2.0 or higher only) for details.
- SHOW ROLES and SHOW ROLE GRANT statements. See SHOW Statement for details.
- To complement the HDFS encryption feature, a new Impala configuration option, --disk_spill_encryption secures sensitive data from being observed or tampered with when temporarily stored on disk.
The new security-related SQL statements work along with the Sentry authorization framework. See Enabling Sentry Authorization for Impala for details.
Impala can now read compressed text files compressed by gzip, bzip, or Snappy. These files do not require any special table settings to work in an Impala text table. Impala recognizes the compression type automatically based on file extensions of .gz, .bz2, and .snappy respectively. These types of compressed text files are intended for convenience with existing ETL pipelines. Their non-splittable nature means they are not optimal for high-performance parallel queries. See Using bzip2, deflate, gzip, Snappy, or zstd Text Files for details.
Query hints can now use comment notation, /* +hint_name */ or -- +hint_name, at the same places in the query where the hints enclosed by [ ] are recognized. This enhancement makes it easier to reuse Impala queries on other database systems. See Optimizer Hints for details.
A new query option, QUERY_TIMEOUT_S, lets you specify a timeout period in seconds for individual queries.

The working of the --idle_query_timeout configuration option is extended. If no QUERY_OPTION_S query option is in effect, --idle_query_timeout works the same as before, setting the timeout interval. When the QUERY_OPTION_S query option is specified, its maximum value is capped by the value of the --idle_query_timeout option.

That is, the system administrator sets the default and maximum timeout through the --idle_query_timeout startup option, and then individual users or applications can set a lower timeout value if desired through the QUERY_TIMEOUT_S query option. See Setting Timeout Periods for Daemons, Queries, and Sessions and QUERY_TIMEOUT_S Query Option (Impala 2.0 or higher only) for details.
New functions VAR_SAMP() and VAR_POP() are aliases for the existing VARIANCE_SAMP() and VARIANCE_POP() functions.
A new date and time function, DATE_PART(), provides similar functionality to EXTRACT(). You can also call the EXTRACT() function using the SQL-99 syntax, EXTRACT(unit FROM timestamp). These enhancements simplify the porting process for date-related code from other systems. See Impala Date and Time Functions for details.
New approximation features provide a fast way to get results when absolute precision is not required:
- The APPX_COUNT_DISTINCT query option lets Impala rewrite COUNT(DISTINCT) calls to use NDV() instead, which speeds up the operation and allows multiple COUNT(DISTINCT) operations in a single query. See APPX_COUNT_DISTINCT Query Option (Impala 2.0 or higher only) for details.
The APPX_MEDIAN() aggregate function produces an estimate for the median value of a column by using sampling. See APPX_MEDIAN Function for details.
Impala now supports a DECODE() function. This function works as a shorthand for a CASE() expression, and improves compatibility with SQL code containing vendor extensions. See Impala Conditional Functions for details.
The STDDEV(), STDDEV_POP(), STDDEV_SAMP(), VARIANCE(), VARIANCE_POP(), VARIANCE_SAMP(), and NDV() aggregate functions now all return DOUBLE results rather than STRING. Formerly, you were required to CAST() the result to a numeric type before using it in arithmetic operations.
The default settings for Parquet block size, and the associated PARQUET_FILE_SIZE query option, are changed. Now, Impala writes Parquet files with a size of 256 MB and an HDFS block size of 256 MB. Previously, Impala attempted to write Parquet files with a size of 1 GB and an HDFS block size of 1 GB. In practice, Impala used a conservative estimate of the disk space needed for each Parquet block, leading to files that were typically 512 MB anyway. Thus, this change will make the file size more accurate if you specify a value for the PARQUET_FILE_SIZE query option. It also reduces the amount of memory reserved during INSERT into Parquet tables, potentially avoiding out-of-memory errors and improving scalability when inserting data into Parquet tables.
Anti-joins are now supported, expressed using the LEFT ANTI JOIN and RIGHT ANTI JOIN clauses. These clauses returns results from one table that have no match in the other table. You might use this type of join in the same sorts of use cases as the NOT EXISTS and NOT IN operators. See Joins in Impala SELECT Statements for details.
The SET command in impala-shell has been promoted to a real SQL statement. You can now set query options such as PARQUET_FILE_SIZE, MEM_LIMIT, and SYNC_DDL within JDBC, ODBC, or any other kind of application that submits SQL without going through the impala-shell interpreter. See SET Statement for details.
The impala-shell interpreter now reads settings from an optional configuration file, named $HOME/.impalarc by default. See impala-shell Configuration File for details.
The library used for regular expression parsing has changed from Boost to Google RE2. This implementation change adds support for non-greedy matches using the .*? notation. This and other changes in the way regular expressions are interpreted means you might need to re-test queries that use functions such as regexp_extract() or regexp_replace(), or operators such as REGEXP or RLIKE. See Incompatible Changes and Limitations in Apache Impala for those details.

New Features in Impala 1.4

The following are the major new features in Impala 1.4:

The DECIMAL data type lets you store fixed-precision values, for working with currency or other fractional values where it is important to represent values exactly and avoid rounding errors. This feature includes enhancements to built-in functions, numeric literals, and arithmetic expressions. See DECIMAL Data Type (Impala 3.0 or higher only) for details.
Where the underlying HDFS support exists, Impala can take advantage of the HDFS caching feature to "pin" entire tables or individual partitions in memory, to speed up queries on frequently accessed data and reduce the CPU overhead of memory-to-memory copying. When HDFS files are cached in memory, Impala can read the cached data without any disk reads, and without making an additional copy of the data in memory. Other Hadoop components that read the same data files also experience a performance benefit.

For background information about HDFS caching, see the documentation for your Apache Hadoop distribution. For performance information about using this feature with Impala, see Using HDFS Caching with Impala (Impala 2.1 or higher only). For the SET CACHED and SET UNCACHED clauses that let you control cached table data through DDL statements, see CREATE TABLE Statement and ALTER TABLE Statement.
Impala can now use Sentry-based authorization based either on the original policy file, or on rules defined by GRANT and REVOKE statements issued through Hive. See Enabling Sentry Authorization for Impala for details.
For interoperability with Parquet files created through other Hadoop components, such as Pig or MapReduce jobs, you can create an Impala table that automatically sets up the column definitions based on the layout of an existing Parquet data file. See CREATE TABLE Statement for the syntax, and Creating Parquet Tables in Impala for usage information.
ORDER BY queries no longer require a LIMIT clause. If the size of the result set to be sorted exceeds the memory available to Impala, Impala uses a temporary work space on disk to perform the sort operation. See ORDER BY Clause for details.
LDAP connections can be secured through either SSL or TLS. See Enabling LDAP Authentication for Impala for details.
The following new built-in scalar and aggregate functions are available:
- A new built-in function, EXTRACT(), returns one date or time field from a TIMESTAMP value. See Impala Date and Time Functions for details.
- A new built-in function, TRUNC(), truncates date/time values to a particular granularity, such as year, month, day, hour, and so on. See Impala Date and Time Functions for details.
- ADD_MONTHS() built-in function, an alias for the existing MONTHS_ADD() function. See Impala Date and Time Functions for details.
- A new built-in function, ROUND(), rounds DECIMAL values to a specified number of fractional digits. See Impala Mathematical Functions for details.
- Several built-in aggregate functions for computing properties for statistical distributions: STDDEV(), STDDEV_SAMP(), STDDEV_POP(), VARIANCE(), VARIANCE_SAMP(), and VARIANCE_POP(). See STDDEV, STDDEV_SAMP, STDDEV_POP Functions and VARIANCE, VARIANCE_SAMP, VARIANCE_POP, VAR_SAMP, VAR_POP Functions for details.
- Several new built-in functions, such as MAX_INT(), MIN_SMALLINT(), and so on, let you conveniently check whether data values are in an expected range. You might be able to switch a column to a smaller type, saving memory during processing. See Impala Mathematical Functions for details.
- New built-in functions, IS_INF() and IS_NAN(), check for the special values infinity and "not a number". These values could be specified as inf or nan in text data files, or be produced by certain arithmetic expressions. See Impala Mathematical Functions for details.
The SHOW PARTITIONS statement displays information about the structure of a partitioned table. See SHOW Statement for details.
New configuration options for the impalad daemon let you specify initial memory usage for all queries. The initial resource requests handled by Llama and YARN can be expanded later if needed, avoiding unnecessary over-allocation and reducing the chance of out-of-memory conditions. See Resource Management for details.
The Impala CREATE TABLE statement now has a STORED AS AVRO clause, allowing you to create Avro tables through Impala. See Using the Avro File Format with Impala Tables for details and examples.
New impalad configuration options let you fine-tune the calculations Impala makes to estimate resource requirements for each query. These options can help avoid problems due to overconsumption due to too-low estimates, or underutilization due to too-high estimates. See Resource Management for details.
A new SUMMARY command in the impala-shell interpreter provides a high-level summary of the work performed at each stage of the explain plan. The summary is also included in output from the PROFILE command. See impala-shell Command Reference and Using the SUMMARY Report for Performance Tuning for details.
Performance improvements for the COMPUTE STATS statement:
- The NDV function is speeded up through native code generation.
- Because the NULL count is not currently used by the Impala query planner, in Impala 1.4.0 and higher, COMPUTE STATS does not count the NULL values for each column. (The #Nulls field of the stats table is left as -1, signifying that the value is unknown.)
See COMPUTE STATS Statement for general details about the COMPUTE STATS statement, and Table and Column Statistics for how to use the statistics to improve query performance.
Performance improvements for partition pruning. This feature reduces the time spent in query planning, for partitioned tables with thousands of partitions. Previously, Impala typically queried tables with up to approximately 3000 partitions. With the performance improvement in partition pruning, now Impala can comfortably handle tables with tens of thousands of partitions. See Partition Pruning for Queries for information about partition pruning.
The documentation provides additional guidance for planning tasks. See Planning for Impala Deployment.
The impala-shell interpreter now supports UTF-8 characters for input and output. You can control whether impala-shell ignores invalid Unicode code points through the --strict_unicode option. (Although this option is removed in Impala 2.0.)

New Features in Impala 1.3.2

No new features. This point release is exclusively a bug fix release for the IMPALA-1019 issue related to HDFS caching.

New Features in Impala 1.3.1

This point release is primarily a vehicle to deliver bug fixes. Any new features are minor changes resulting from fixes for performance, reliability, or usability issues.

A new impalad startup option, --insert_inherit_permissions, causes Impala INSERT statements to create each new partition with the same HDFS permissions as its parent directory. By default, INSERT statements create directories for new partitions using default HDFS permissions. See INSERT Statement for examples of INSERT statements for partitioned tables.
The SHOW FUNCTIONS statement now displays the return type of each function, in addition to the types of its arguments. See SHOW Statement for examples.
You can now specify the clause FIELDS TERMINATED BY '\0' with a CREATE TABLE statement to use text data files that use ASCII 0 (nul) characters as a delimiter. See Using Text Data Files with Impala Tables for details.
In Impala 1.3.1 and higher, the REGEXP and RLIKE operators now match a regular expression string that occurs anywhere inside the target string, the same as if the regular expression was enclosed on each side by .*. See REGEXP Operator for examples. Previously, these operators only succeeded when the regular expression matched the entire target string. This change improves compatibility with the regular expression support for popular database systems. There is no change to the behavior of the regexp_extract() and regexp_replace() built-in functions.

New Features in Impala 1.3

The admission control feature lets you control and prioritize the volume and resource consumption of concurrent queries. This mechanism reduces spikes in resource usage, helping Impala to run alongside other kinds of workloads on a busy cluster. It also provides more user-friendly conflict resolution when multiple memory-intensive queries are submitted concurrently, avoiding resource contention that formerly resulted in out-of-memory errors. See Admission Control and Query Queuing for details.
Enhanced EXPLAIN plans provide more detail in an easier-to-read format. Now there are four levels of verbosity: the EXPLAIN_LEVEL option can be set from 0 (most concise) to 3 (most verbose). See EXPLAIN Statement for syntax and Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles for usage information.
The TIMESTAMP data type accepts more kinds of input string formats through the UNIX_TIMESTAMP function, and produces more varieties of string formats through the FROM_UNIXTIME function. The documentation now also lists more functions for date arithmetic, used for adding and subtracting INTERVAL expressions from TIMESTAMP values. See Impala Date and Time Functions for details.
New conditional functions, NULLIF(), NULLIFZERO(), and ZEROIFNULL(), simplify porting SQL containing vendor extensions to Impala. See Impala Conditional Functions for details.
New utility function, CURRENT_DATABASE(). See Impala Miscellaneous Functions for details.
Integration with the YARN resource management framework. This feature makes use of the underlying YARN service, plus an additional service (Llama) that coordinates requests to YARN for Impala resources, so that the Impala query only proceeds when all requested resources are available. See Resource Management for full details.

On the Impala side, this feature involves some new startup options for the impalad daemon:
- -enable_rm
- -llama_host
- -llama_port
- -llama_callback_port
- -cgroup_hierarchy_path
For details of these startup options, see Modifying Impala Startup Options.

This feature also involves several new or changed query options that you can set through the impala-shell interpreter and apply within a specific session:
- MEM_LIMIT: the function of this existing option changes when Impala resource management is enabled.
- REQUEST_POOL: a new option. (Renamed to RESOURCE_POOL in Impala 1.3.0.)
- V_CPU_CORES: a new option.
- RESERVATION_REQUEST_TIMEOUT: a new option.
For details of these query options, see impala_resource_management.html#rm_query_options.

New Features in Impala 1.2.4

Note: Impala 1.2.4 is primarily a bug fix release for Impala 1.2.3, plus some performance enhancements for the catalog server to minimize startup and DDL wait times for Impala deployments with large numbers of databases, tables, and partitions.

On Impala startup, the metadata loading and synchronization mechanism has been improved and optimized, to give more responsiveness when starting Impala on a system with a large number of databases, tables, or partitions. The initial metadata loading happens in the background, allowing queries to be run before the entire process is finished. When a query refers to a table whose metadata is not yet loaded, the query waits until the metadata for that table is loaded, and the load operation for that table is prioritized to happen first.
Formerly, if you created a new table in Hive, you had to issue the INVALIDATE METADATA statement (with no table name) which was an expensive operation that reloaded metadata for all tables. Impala did not recognize the name of the Hive-created table, so you could not do INVALIDATE METADATA new_table to get the metadata for just that one table. Now, when you issue INVALIDATE METADATA table_name, Impala checks to see if that name represents a table created in Hive, and if so recognizes the new table and loads the metadata for it. Additionally, if the new table is in a database that was newly created in Hive, Impala also recognizes the new database.
If you issue INVALIDATE METADATA table_name and the table has been dropped through Hive, Impala will recognize that the table no longer exists.
New startup options let you control the parallelism of the metadata loading during startup for the catalogd daemon:
- --load_catalog_in_background makes Impala load and cache metadata using background threads after startup. It is true by default. Previously, a system with a large number of databases, tables, or partitions could be unresponsive or even time out during startup.
- --num_metadata_loading_threads determines how much parallelism Impala devotes to loading metadata in the background. The default is 16. You might increase this value for systems with huge numbers of databases, tables, or partitions. You might lower this value for busy systems that are CPU-constrained due to jobs from components other than Impala.

New Features in Impala 1.2.3

Impala 1.2.3 contains exactly the same feature set as Impala 1.2.2. Its only difference is one additional fix for compatibility with Parquet files generated outside of Impala by components such as Hive, Pig, or MapReduce. If you are upgrading from Impala 1.2.1 or earlier, see New Features in Impala 1.2.2 for the latest added features.

New Features in Impala 1.2.2

Impala 1.2.2 includes new features for performance, security, and flexibility. The major enhancements over 1.2.1 are performance related, primarily for join queries.