New Features in Apache Impala

This release of Impala contains the following changes and enhancements from previous releases.

New Features in Impala 3.4

The following sections describe the noteworthy improvements made in Impala 3.4.

For the full list of issues closed in this release, see the changelog for Impala 3.4.

Support for Hive Insert-Only Transactional Tables

Impala added the support to truncate insert-only transactional tables.

By default, Impala creates an insert-only transactional table when you issue the CREATE TABLE statement.

Use the Hive compaction to compact small files to improve the performance and scalability of metadata in transactional tables.

See Impala Transactions for more information.

Server-side Spooling of Query Results

You can use the SPOOL_QUERY_RESULTS query option to control how query results are returned to the client.

By default, when a client fetches a set of query results, the next set of results are fetched in batches until all the result rows are produced. If a client issues a query without fetching all the results, the query fragments continue to hold on to the resources until the query is canceled and unregistered, potentially tying up resources and causing other queries to wait in admission control.

When the query result spooling feature is enabled, the result sets of queries are eagerly fetched and buffered until they are read by the client, and resources are freed up for other queries.

See Spooling Impala Query Results for the new feature and the query options.

Cookie-based Authentication

Starting in this version, Impala supports cookies for authentication when clients connect via HiveServer2 over HTTP.

You can use the --max_cookie_lifetime_s startup flag to:

  • Disable the use of cookies
  • Control how long generated cookies are valid for

See Impala Client Access for more information.

Object Ownership Support

Object ownership for tables, views, and databases is enabled by default in Impala. When you create a database, a table, or a view, as the owner of that object, you implicitly have the privileges on the object. The privileges that owners have are specified in Ranger on the special user, {OWNER}.

The {OWNER} user must be defined in Ranger for the object ownership privileges work in Impala.

See Impala Authorization for details.

New Built-in Functions for Fuzzy Matching of Strings

Use the new Jaro or Jaro-Winkler functions to perform fuzzy matches on relatively short strings, e.g. to scrub user inputs of names against the records in the database.

  • JARO_DISTANCE, JARO_DST
  • JARO_SIMILARITY, JARO_SIM
  • JARO_WINKLER_DISTANCE, JW_DST
  • JARO_WINKLER_SIMILARITY, JW_SIM

See Impala String Functions for details.

Capacity Quota for Scratch Disks

When configuring scratch space for intermediate files used in large sorts, joins, aggregations, or analytic function operations, use the ‑‑scratch_dirs startup flag to optionally specify a capacity quota per scratch directory, e.g., ‑‑scratch_dirs=/dir1:5MB,/dir2.

See How Impala Works with Hadoop File Formats for details.

Query Option for Disabling HBase Row Estimation

During query plan generation, Impala samples underlying HBase tables to estimate row count and row size, but the sampling process can negatively impact the planning time. To alleviate the issue, when the HBase table stats do not change much in a short time, disable the sampling with the DISABLE_HBASE_NUM_ROWS_ESTIMATE query option so that the Impala planner falls back to using Hive Metastore (HMS) table stats instead.

See DISABLE_HBASE_NUM_ROWS_ESTIMATE Query Option.

Query Option for Controlling Size of Parquet Splits on Non-block Stores

To optimize query performance, Impala planner uses the value of the fs.s3a.block.size startup flag when calculating the split size on non-block based stores, e.g. S3, ADLS, etc. Starting in this release, Impala planner uses the PARQUET_OBJECT_STORE_SPLIT_SIZE query option to get the Parquet file format specific split size.

For Parquet files, the fs.s3a.block.size startup flag is no longer used.

The default value of the PARQUET_OBJECT_STORE_SPLIT_SIZE query option is 256 MB.

See Using Impala with Amazon S3 Object Store for tuning Impala query performance for S3.

Query Profile Exported to JSON

On the Query Details page of Impala Daemon Web UI, you have a new option, in addition to the existing Thrift and Text formats, to export the query profile output in the JSON format.

See Impala Web User Interface for Debugging for generating JSON query profile outputs in Web UI.

DATE Data Type Supported in Avro Tables

You can now use the DATE data type to query date values from Avro tables.

See Using the Avro File Format with Impala Tables for details.

Primary Key and Foreign Key Constraints

This release adds support for primary and foreign key constraints, but in this release the constraints are advisory and intended for estimating cardinality during query planning in a future release. There is no attempt to enforce constraints. See CREATE TABLE Statement for details.

Enhanced External Kudu Table

By default HMS implicitly translates internal Kudu tables to external Kudu tables with the 'external.table.purge' property set to true. These tables behave similar to internal tables. You can explicitly create such external Kudu tables. See CREATE TABLE Statement for details.

Ranger Column Masking

This release supports Ranger column masking, which hides sensitive columnar data in Impala query output. For example, you can define a policy that reveals only the first or last four characters of column data. Column masking is enabled by default. See Ranger Column Masking for details.

BROADCAST_BYTES_LIMIT query option

You can set the default limit for the size of the broadcast input. Such a limit can prevent possible performance problems.

Experimental Support for Apache Hudi

In this release, you can use Read Optimized Queries on Hudi tables. See Using the Hudi File Format for details.

ORC Reads Enabled by Default

Impala stability and performance have been improved. Consequently, ORC reads are now enabled in Impala by default. To disable, set --enable_orc_scanner to false when starting the cluster. See Using the ORC File Format with Impala Tables for details.

Support for ZSTD and DEFLATE

This release supports ZSTD and DEFLATE compression codecs for text files. See Using bzip2, deflate, gzip, Snappy, or zstd Text Files for details.

New Features in Impala 3.3

The following sections describe the noteworthy improvements made in Impala 3.3.

For the full list of issues closed in this release, see the changelog for Impala 3.3.

Increased Compatibility with Apache Projects

Impala is integrate with the following components:
  • Apache Ranger: Use Apache Ranger to manage authorization in Impala. See Impala Authorization for details.

  • Apache Atlas: Use Apache Atlas to manage data governance in Impala.

  • Hive 3

Parquet Page Index

To improve performance when using Parquet files, Impala can now write page indexes in Parquet files and use those indexes to skip pages for the faster scan.

See Query Performance for Impala Parquet Tables for details.

The Remote File Handle Cache Supports S3

Impala can now cache remote HDFS file handles when the tables that store their data in Amazon S3 cloud storage.

See Scalability Considerations for File Handle Caching for the information on remote file handle cache.

Support for Kudu Integrated with Hive Metastore

In Impala 3.3 and Kudu 1.10, Kudu is integrated with Hive Metastore (HMS), and from Impala, you can create, update, delete, and query the tables in the Kudu services integrated with HMS.

See Using Kudu with Impala for information on using Kudu tables in Impala.

Zstd Compression for Parquet files

Zstandard (Zstd) is a real-time compression algorithm offering a tradeoff between speed and ratio of compression. Compression levels from 1 up to 22 are supported. The lower the level, the faster the speed at the cost of compression ratio.

Lz4 Compression for Parquet files

Lz4 is a lossless compression algorithm providing extremely fast and scalable compression and decompression.

Data Cache for Remote Reads

To improve performance on multi-cluster HDFS environments as well as on object store environments, Impala now caches data for non-local reads (e.g. S3, ABFS, ADLS) on local storage.

The data cache is enabled with the --data_cache startup flag.

See Impala Remote Data Cache for the information and steps to enable remote data cache.

Metadata Performance Improvements

The following features to improve metadata performance are enabled by default in this release:

  • Incremental stats are now compressed in memory in catalogd, reducing memory footprint in catalogd.

  • impaladcoordinators fetch incremental stats from catalogd on-demand, reducing the memory footprint and the network requirements for broadcasting metadata.

  • Time-based and memory-based automatic invalidation of metadata to keep the size of metadata bounded and to reduce the chances of catalogdcache running out of memory.

  • Automatic invalidation of metadata

    With automatic metadata management enabled, you no longer have to issue INVALIDATE / REFRESH in a number of conditions.

    In Impala 3.3, the following additional event in Hive Metastore can trigger automatic INVALIDATE / REFRESH of Metadata:

    • INSERT into tables and partitions from Impala or from Spark on the same or multiple cluster configuration

See Metadata Management for the information on the above features.

Scalable Pool Configuration in Admission Controller

To offer more dynamic and flexible resource management, Impala supports the new configuration parameters that scale with the number of hosts in the resource pool. You can use the parameters to control the number of running queries, queued queries, and maximum amount of memory allocated for Impala resource pools. See Admission Control and Query Queuing for the information about the new parameters and using them for admission control.

Query Profile

The following information was added to the Query Profile output for better monitoring and troubleshooting of query performance.

  • Network I/O throughput

  • System disk I/O throughput

See Impala Query Profile for generating and reading query profile.

DATE Data Type and Functions

You can use the new the DATE type to describe a particular year/month/day, in the form YYYY-­MM-­DD.

This initial DATE type support the TEXT, Parquet, and HBASE file formats.

The support of DATE data type includes the following features:

  • DATE type column as a partitioning key column
  • DATE literal
  • Implicit casting between DATE and other types: STRING and TIMESTAMP
  • Most of the built-in functions for TIMESTAMP now allow the DATE type arguments, as well.

See DATE Data Type and Impala Date and Time Functions for using the DATE type.

Support Hive Insert-Only Transactional Tables

Impala added the support to create, drop, query, and insert into the insert-only type of transactional tables.

See Impala Transactions for details.

HiveServer2 HTTP Connection for Clients

Now client applications can connect to Impala over HTTP via HiveServer2 with the option to use the Kerberos SPNEGO and LDAP for authentication. See Impala Clients for details.

Default File Format Changed to Parquet

When you create a table, the default format for that table data is now Parquet.

For backward compatibility, you can use the DEFAULT_FILE_FORMAT query option to set the default file format to the previous default, text, or other formats.

Built-in Function to Process JSON Objects

The GET_JSON_OBJECT() function extracts JSON object from a string based on the path specified and returns the extracted JSON object.

See Impala Miscellaneous Functions. for details.

Ubuntu 18.04

This version of Impala is certified to run on Ubuntu 18.04.

New Features in Impala 3.2

The following sections describe the noteworthy improvements made in Impala 3.2.

For the full list of issues closed in this release, see the changelog for Impala 3.2.

Multi-cluster Support

  • Remote File Handle Cache

    Impala can now cache remote HDFS file handles when the cache_remote_file_handles impalad flag is set to true. This feature does not apply to non-HDFS tables, such as Kudu or HBase tables, and does not apply to the tables that store their data on cloud services, such as S3 or ADLS. See Scalabilty Considerations for file handle caching in Impala.

Enhancements in Resource Management and Admission Control

  • Admission Debug page is available in Impala Daemon (impalad) web UI at \admission and provides the following information about Impala resource pools:
    • Pool configuration
    • Relevant pool stats
    • Queued queries in order of being queued (local to the coordinator)
    • Running queries (local to this coordinator)
    • Histogram of the distribution of peak memory usage by admitted queries
  • A new query option, NUM_ROWS_PRODUCED_LIMIT, was added to limit the number of rows returned from queries.

    Impala will cancel a query if the query produces more rows than the limit specified by this query option. The limit applies only when the results are returned to a client, e.g. for a SELECT query, but not an INSERT query. This query option is a guardrail against users accidentally submitting queries that return a large number of rows.

Metadata Performance Improvements

  • Automatic Metadata Sync using Hive Metastore Notification Events

    When enabled, the catalogd polls Hive Metastore (HMS) notifications events at a configurable interval and syncs with HMS. You can use the new web UI pages of the catalogd to check the state of the automatic invalidate event processor.

    Note: This is a preview feature in Impala 3.2.

Compatibility and Usability Enhancements

  • Impala can now read the TIMESTAMP_MILLIS and TIMESTAMP_MICROS Parquet types. See Using Parquet File Format for Impala Tables for the Parquet support in Impala.
  • Impala can now read the complex types in ORC such as ARRAY, STRUCT, and MAP. See Using ORC File Format for Impala Tables for the ORC support in Impala.
  • The LEVENSHTEIN string function is supported.

    The function returns the Levenshtein distance between two input strings, the minimum number of single-character edits required to transform one string to other.

  • The IF NOT EXISTS clause is supported in the ALTER TABLE statement.
  • The new DEFAULT_FILE_FORMAT query option allows you to set the default table file format. This removes the need for the STORED AS <format> clause. Set this option if you prefer a value that is not TEXT. The supported formats are:
    • TEXT
    • RC_FILE
    • SEQUENCE_FILE
    • AVRO
    • PARQUET
    • KUDU
    • ORC
  • The extended or verbose EXPLAIN output includes the following new information for queries:
    • The text of the analyzed query that may have been rewritten to include various optimizations and implicit casts.
    • The implicit casts and literals shown with the actual types.
  • CPU resource utilization (user, system, iowait) metrics were added to the Impala profile output.

Security Enhancement

New Features in Impala 3.1

For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 3.1.

New Features in Impala 3.0

For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 3.0.

New Features in Impala 2.12

For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 2.12.

New Features in Impala 2.11

For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 2.11.

New Features in Impala 2.10

For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 2.10.

New Features in Impala 2.9

For the full list of issues closed in this release, including the issues marked as "new features" or "improvements", see the changelog for Impala 2.9.

The following are some of the most significant new features in this release:

New Features in Impala 2.8

New Features in Impala 2.7

New Features in Impala 2.6

New Features in Impala 2.5

New Features in Impala 2.4

New Features in Impala 2.3

The following are the major new features in Impala 2.3.x. This major release contains improvements to SQL syntax (particularly new support for complex types), performance, manageability, security.

In Impala 2.3.2, the bug fix for IMPALA-2598 removes the restriction on using both Kerberos and SSL for internal communication between Impala components.

New Features in Impala 2.8

The following are the major new features in Impala 2.2. This release contains improvements to performance, manageability, security, and SQL syntax.

New Features in Impala 2.1

This release contains the following enhancements to query performance and system scalability:

New Features in Impala 2.0

The following are the major new features in Impala 2.0. This major release contains improvements to performance, scalability, security, and SQL syntax.

New Features in Impala 1.4

The following are the major new features in Impala 1.4:

New Features in Impala 1.3.2

No new features. This point release is exclusively a bug fix release for the IMPALA-1019 issue related to HDFS caching.

New Features in Impala 1.3.1

This point release is primarily a vehicle to deliver bug fixes. Any new features are minor changes resulting from fixes for performance, reliability, or usability issues.

New Features in Impala 1.3

New Features in Impala 1.2.4

Note: Impala 1.2.4 is primarily a bug fix release for Impala 1.2.3, plus some performance enhancements for the catalog server to minimize startup and DDL wait times for Impala deployments with large numbers of databases, tables, and partitions.

New Features in Impala 1.2.3

Impala 1.2.3 contains exactly the same feature set as Impala 1.2.2. Its only difference is one additional fix for compatibility with Parquet files generated outside of Impala by components such as Hive, Pig, or MapReduce. If you are upgrading from Impala 1.2.1 or earlier, see New Features in Impala 1.2.2 for the latest added features.

New Features in Impala 1.2.2

Impala 1.2.2 includes new features for performance, security, and flexibility. The major enhancements over 1.2.1 are performance related, primarily for join queries.

New user-visible features include:

Because Impala 1.2.2 builds on a number of features introduced in 1.2.1, if you are upgrading from an older 1.1.x release straight to 1.2.2, also review New Features in Impala 1.2.1 to see features such as the SHOW TABLE STATS and SHOW COLUMN STATS statements, and user-defined functions (UDFs).

New Features in Impala 1.2.1

Note: The Impala 1.2.1 feature set is a superset of features in the Impala 1.2.0 beta, with the exception of resource management, which relies on resource management infrastructure in the underlying Hadoop distribution.

Impala 1.2.1 includes new features for security, performance, and flexibility.

New user-visible features include:

New Features in Impala 1.2.0 (Beta)

The Impala 1.2.0 beta includes new features for security, performance, and flexibility.

New user-visible features include:

New Features in Impala 1.1.1

Impala 1.1.1 includes new features for security and stability.

New user-visible features include:

New Features in Impala 1.1

Impala 1.1 includes new features for security, performance, and usability.

New user-visible features include:

New Features in Impala 1.0.1

New user-visible features include:

New Features in Impala 1.0

This version has multiple performance improvements and adds the following functionality:

New Features in Version 0.7 of the Impala Beta Release

This version has multiple performance improvements and adds the following functionality:

New Features in Version 0.6 of the Impala Beta Release

New Features in Version 0.5 of the Impala Beta Release

New Features in Version 0.4 of the Impala Beta Release

New Features in Version 0.3 of the Impala Beta Release

New Features in Version 0.2 of the Impala Beta Release