Known Issues and Workarounds in Impala

Impala Known Issues: Startup

These issues can prevent one or more Impala-related daemons from starting properly.

Impala requires FQDN from hostname command on kerberized clusters

The method Impala uses to retrieve the host name while constructing the Kerberos principal is the gethostname() system call. This function might not always return the fully qualified domain name, depending on the network configuration. If the daemons cannot determine the FQDN, Impala does not start on a kerberized cluster.

Workaround: Test if a host is affected by checking whether the output of the hostname command includes the FQDN. On hosts where hostname, only returns the short name, pass the command-line flag --hostname=fully_qualified_domain_name in the startup options of all Impala-related daemons.

Apache Issue: IMPALA-4978

Impala Known Issues: Performance

These issues involve the performance of operations such as queries or DDL statements.

Metadata operations block read-only operations on unrelated tables

Metadata operations that change the state of a table, like COMPUTE STATS or ALTER RECOVER PARTITIONS, may delay metadata propagation of unrelated unloaded tables triggered by statements like DESCRIBE or SELECT queries.

Bug: IMPALA-6671

Slow queries for Parquet tables with convert_legacy_hive_parquet_utc_timestamps=true

The configuration setting convert_legacy_hive_parquet_utc_timestamps=true uses an underlying function that can be a bottleneck on high volume, highly concurrent queries due to the use of a global lock while loading time zone information. This bottleneck can cause slowness when querying Parquet tables, up to 30x for scan-heavy queries. The amount of slowdown depends on factors such as the number of cores and number of threads involved in the query.

Note:

The slowdown only occurs when accessing TIMESTAMP columns within Parquet files that were generated by Hive, and therefore require the on-the-fly timezone conversion processing.

Bug: IMPALA-3316

Severity: High

Workaround: If the TIMESTAMP values stored in the table represent dates only, with no time portion, consider storing them as strings in yyyy-MM-dd format. Impala implicitly converts such string values to TIMESTAMP in calls to date/time functions.

Interaction of File Handle Cache with HDFS Appends and Short-Circuit Reads

If a data file used by Impala is being continuously appended or overwritten in place by an HDFS mechanism, such as hdfs dfs -appendToFile, interaction with the file handle caching feature in Impala 2.10 and higher could cause short-circuit reads to sometimes be disabled on some DataNodes. When a mismatch is detected between the cached file handle and a data block that was rewritten because of an append, short-circuit reads are turned off on the affected host for a 10-minute period.

The possibility of encountering such an issue is the reason why the file handle caching feature is currently turned off by default. See Scalability Considerations for Impala for information about this feature and how to enable it.

Bug: HDFS-12528

Severity: High

Workaround: Verify whether your ETL process is susceptible to this issue before enabling the file handle caching feature. You can set the impalad configuration option unused_file_handle_timeout_sec to a time period that is shorter than the HDFS setting dfs.client.read.shortcircuit.streams.cache.expiry.ms. (Keep in mind that the HDFS setting is in milliseconds while the Impala setting is in seconds.)

Resolution: Fixed in HDFS 2.10 and higher. Use the new HDFS parameter dfs.domain.socket.disable.interval.seconds to specify the amount of time that short circuit reads are disabled on encountering an error. The default value is 10 minutes (600 seconds). It is recommended that you set dfs.domain.socket.disable.interval.seconds to a small value, such as 1 second, when using the file handle cache. Setting dfs.domain.socket.disable.interval.seconds to 0 is not recommended as a non-zero interval protects the system if there is a persistent problem with short circuit reads.

Impala Known Issues: JDBC and ODBC Drivers

These issues affect applications that use the JDBC or ODBC APIs, such as business intelligence tools or custom-written applications in languages such as Java or C++.

ImpalaODBC: Can not get the value in the SQLGetData(m-x th column) after the SQLBindCol(m th column)

If the ODBC SQLGetData is called on a series of columns, the function calls must follow the same order as the columns. For example, if data is fetched from column 2 then column 1, the SQLGetData call for column 1 returns NULL.

Bug: IMPALA-1792

Workaround: Fetch columns in the same order they are defined in the table.

Impala Known Issues: Resources

These issues involve memory or disk usage, including out-of-memory conditions, the spill-to-disk feature, and resource management features.

Handling large rows during upgrade to Impala 2.10 or higher

After an upgrade to Impala 2.10 or higher, users who process very large column values (long strings), or have increased the --read_size configuration setting from its default of 8 MB, might encounter capacity errors for some queries that previously worked.

Resolution: After the upgrade, follow the instructions in to check if your queries are affected by these changes and to modify your configuration settings if so.

Apache Issue: IMPALA-6028

Configuration to prevent crashes caused by thread resource limits

Impala could encounter a serious error due to resource usage under very high concurrency. The error message is similar to:


F0629 08:20:02.956413 29088 llvm-codegen.cc:111] LLVM hit fatal error: Unable to allocate section memory!
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::thread_resource_error> >'

Bug: IMPALA-5605

Severity: High

Workaround: To prevent such errors, configure each host running an impalad daemon with the following settings:


echo 2000000 > /proc/sys/kernel/threads-max
echo 2000000 > /proc/sys/kernel/pid_max
echo 8000000 > /proc/sys/vm/max_map_count

Add the following lines in /etc/security/limits.conf:


impala soft nproc 262144
impala hard nproc 262144

Breakpad minidumps can be very large when the thread count is high

The size of the breakpad minidump files grows linearly with the number of threads. By default, each thread adds 8 KB to the minidump size. Minidump files could consume significant disk space when the daemons have a high number of threads.

Workaround: Add --minidump_size_limit_hint_kb=size to set a soft upper limit on the size of each minidump file. If the minidump file would exceed that limit, Impala reduces the amount of information for each thread from 8 KB to 2 KB. (Full thread information is captured for the first 20 threads, then 2 KB per thread after that.) The minidump file can still grow larger than the "hinted" size. For example, if you have 10,000 threads, the minidump file can be more than 20 MB.

Apache Issue: IMPALA-3509

Process mem limit does not account for the JVM's memory usage

Some memory allocated by the JVM used internally by Impala is not counted against the memory limit for the impalad daemon.

Workaround: To monitor overall memory usage, use the top command, or add the memory figures in the Impala web UI /memz tab to JVM memory usage shown on the /metrics tab.

Apache Issue: IMPALA-691

Impala Known Issues: Correctness

These issues can cause incorrect or unexpected results from queries. They typically only arise in very specific circumstances.

Incorrect result due to constant evaluation in query with outer join

An OUTER JOIN query could omit some expected result rows due to a constant such as FALSE in another join clause. For example:


explain SELECT 1 FROM alltypestiny a1
  INNER JOIN alltypesagg a2 ON a1.smallint_col = a2.year AND false
  RIGHT JOIN alltypes a3 ON a1.year = a1.bigint_col;
+---------------------------------------------------------+
| Explain String                                          |
+---------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=1.00KB VCores=1 |
|                                                         |
| 00:EMPTYSET                                             |
+---------------------------------------------------------+

Bug: IMPALA-3094

Severity: High

Impala may use incorrect bit order with BIT_PACKED encoding

Parquet BIT_PACKED encoding as implemented by Impala is LSB first. The parquet standard says it is MSB first.

Bug: IMPALA-3006

Severity: High, but rare in practice because BIT_PACKED is infrequently used, is not written by Impala, and is deprecated in Parquet 2.0.

BST between 1972 and 1995

The calculation of start and end times for the BST (British Summer Time) time zone could be incorrect between 1972 and 1995. Between 1972 and 1995, BST began and ended at 02:00 GMT on the third Sunday in March (or second Sunday when Easter fell on the third) and fourth Sunday in October. For example, both function calls should return 13, but actually return 12, in a query such as:


select
  extract(from_utc_timestamp(cast('1970-01-01 12:00:00' as timestamp), 'Europe/London'), "hour") summer70start,
  extract(from_utc_timestamp(cast('1970-12-31 12:00:00' as timestamp), 'Europe/London'), "hour") summer70end;

Bug: IMPALA-3082

Severity: High

% escaping does not work correctly when occurs at the end in a LIKE clause

If the final character in the RHS argument of a LIKE operator is an escaped \% character, it does not match a % final character of the LHS argument.

Bug: IMPALA-2422

Crash: impala::Coordinator::ValidateCollectionSlots

A query could encounter a serious error if includes multiple nested levels of INNER JOIN clauses involving subqueries.

Bug: IMPALA-2603

Impala Known Issues: Interoperability

These issues affect the ability to interchange data between Impala and other database systems. They cover areas such as data types and file formats.

DESCRIBE FORMATTED gives error on Avro table

This issue can occur either on old Avro tables (created prior to Hive 1.1) or when changing the Avro schema file by adding or removing columns. Columns added to the schema file will not show up in the output of the DESCRIBE FORMATTED command. Removing columns from the schema file will trigger a NullPointerException.

As a workaround, you can use the output of SHOW CREATE TABLE to drop and recreate the table. This will populate the Hive metastore database with the correct column definitions.

Warning:

Only use this for external tables, or Impala will remove the data files. In case of an internal table, set it to external first:


ALTER TABLE table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');

(The part in parentheses is case sensitive.) Make sure to pick the right choice between internal and external when recreating the table. See Overview of Impala Tables for the differences between internal and external tables.

Severity: High

Deviation from Hive behavior: Out of range values float/double values are returned as maximum allowed value of type (Hive returns NULL)

Impala behavior differs from Hive with respect to out of range float/double values. Out of range values are returned as maximum allowed value of type (Hive returns NULL).

Workaround: None

Configuration needed for Flume to be compatible with Impala

For compatibility with Impala, the value for the Flume HDFS Sink hdfs.writeFormat must be set to Text, rather than its default value of Writable. The hdfs.writeFormat setting must be changed to Text before creating data files with Flume; otherwise, those files cannot be read by either Impala or Hive.

Resolution: This information has been requested to be added to the upstream Flume documentation.

Avro Scanner fails to parse some schemas

Querying certain Avro tables could cause a crash or return no rows, even though Impala could DESCRIBE the table.

Bug: IMPALA-635

Workaround: Swap the order of the fields in the schema specification. For example, ["null", "string"] instead of ["string", "null"].

Resolution: Not allowing this syntax agrees with the Avro specification, so it may still cause an error even when the crashing issue is resolved.

Impala BE cannot parse Avro schema that contains a trailing semi-colon

If an Avro table has a schema definition with a trailing semicolon, Impala encounters an error when the table is queried.

Bug: IMPALA-1024

Severity: Remove trailing semicolon from the Avro schema.

Incorrect results with basic predicate on CHAR typed column

When comparing a CHAR column value to a string literal, the literal value is not blank-padded and so the comparison might fail when it should match.

Bug: IMPALA-1652

Workaround: Use the RPAD() function to blank-pad literals compared with CHAR columns to the expected length.

Impala Known Issues: Limitations

These issues are current limitations of Impala that require evaluation as you plan how to integrate Impala into your data management workflow.

Set limits on size of expression trees

Very deeply nested expressions within queries can exceed internal Impala limits, leading to excessive memory usage.

Bug: IMPALA-4551

Severity: High

Resolution:

Workaround: Avoid queries with extremely large expression trees. Setting the query option disable_codegen=true may reduce the impact, at a cost of longer query runtime.

Impala does not support running on clusters with federated namespaces

Impala does not support running on clusters with federated namespaces. The impalad process will not start on a node running such a filesystem based on the org.apache.hadoop.fs.viewfs.ViewFs class.

Bug: IMPALA-77

Anticipated Resolution: Limitation

Workaround: Use standard HDFS on all Impala nodes.

Impala Known Issues: Miscellaneous

These issues do not fall into one of the above categories or have not been categorized yet.

A failed CTAS does not drop the table if the insert fails

If a CREATE TABLE AS SELECT operation successfully creates the target table but an error occurs while querying the source table or copying the data, the new table is left behind rather than being dropped.

Bug: IMPALA-2005

Workaround: Drop the new table manually after a failed CREATE TABLE AS SELECT.

Casting scenarios with invalid/inconsistent results

Using a CAST() function to convert large literal values to smaller types, or to convert special values such as NaN or Inf, produces values not consistent with other database systems. This could lead to unexpected results from queries.

Bug: IMPALA-1821

Impala Parser issue when using fully qualified table names that start with a number

A fully qualified table name starting with a number could cause a parsing error. In a name such as db.571_market, the decimal point followed by digits is interpreted as a floating-point number.

Bug: IMPALA-941

Workaround: Surround each part of the fully qualified name with backticks (``).

Impala should tolerate bad locale settings

If the LC_* environment variables specify an unsupported locale, Impala does not start.

Bug: IMPALA-532

Workaround: Add LC_ALL="C" to the environment settings for both the Impala daemon and the Statestore daemon. See Modifying Impala Startup Options for details about modifying these environment settings.

Resolution: Fixing this issue would require an upgrade to Boost 1.47 in the Impala distribution.

Log Level 3 Not Recommended for Impala

The extensive logging produced by log level 3 can cause serious performance overhead and capacity issues.

Workaround: Reduce the log level to its default value of 1, that is, GLOG_v=1. See Setting Logging Levels for details about the effects of setting different logging levels.

Impala Known Issues: Crashes and Hangs

These issues can cause Impala to quit or become unresponsive.

Unable to view large catalog objects in catalogd Web UI

In catalogd Web UI, you can list metadata objects and view their details. These details are accessed via a link and printed to a string formatted using thrift's DebugProtocol. Printing large objects (> 1 GB) in Web UI can crash catalogd.

Bug: IMPALA-6841