How Impala Works with Hadoop File Formats

Impala supports several familiar file formats used in Apache Hadoop. Impala can load and query data files produced by other Hadoop components such as Pig or MapReduce, and data files produced by Impala can be used by other components also. The following sections discuss the procedures, limitations, and performance considerations for using each file format with Impala.

The file format used for an Impala table has significant performance consequences. Some file formats include compression support that affects the size of data on the disk and, consequently, the amount of I/O and CPU resources required to deserialize data. The amounts of I/O and CPU resources required can be a limiting factor in query performance since querying often begins with moving and decompressing data. To reduce the potential impact of this part of the process, data is often compressed. By compressing data, a smaller total number of bytes are transferred from disk to memory. This reduces the amount of time taken to transfer the data, but a tradeoff occurs when the CPU decompresses the content.

Impala can query files encoded with most of the popular file formats and compression codecs used in Hadoop. Impala can create and insert data into tables that use some file formats but not others; for file formats that Impala cannot write to, create the table in Hive, issue the INVALIDATE METADATA table_name statement in impala-shell, and query the table through Impala. File formats can be structured, in which case they may include metadata and built-in compression. Supported formats include:

Table 1. File Format Support in Impala
File Type Format Compression Codecs Impala Can CREATE? Impala Can INSERT?
Parquet Structured Snappy, gzip; currently Snappy by default Yes. Yes: CREATE TABLE, INSERT, LOAD DATA, and query.
ORC Structured gzip, Snappy, LZO, LZ4; currently gzip by default Yes, in Impala 2.12.0 and higher. No. Import data by using LOAD DATA on data files already in the right format, or use INSERT in Hive followed by REFRESH table_name in Impala.
Text Unstructured LZO, gzip, bzip2, Snappy Yes. For CREATE TABLE with no STORED AS clause, the default file format is uncompressed text, with values separated by ASCII 0x01 characters (typically represented as Ctrl-A). Yes: CREATE TABLE, INSERT, LOAD DATA, and query. If LZO compression is used, you must create the table and load data in Hive. If other kinds of compression are used, you must load data through LOAD DATA, Hive, or manually in HDFS.
Avro Structured Snappy, gzip, deflate, bzip2 Yes, in Impala 1.4.0 and higher. In lower versions, create the table using Hive. No. Import data by using LOAD DATA on data files already in the right format, or use INSERT in Hive followed by REFRESH table_name in Impala.
RCFile Structured Snappy, gzip, deflate, bzip2 Yes. No. Import data by using LOAD DATA on data files already in the right format, or use INSERT in Hive followed by REFRESH table_name in Impala.
SequenceFile Structured Snappy, gzip, deflate, bzip2 Yes. No. Import data by using LOAD DATA on data files already in the right format, or use INSERT in Hive followed by REFRESH table_name in Impala.

Impala can only query the file formats listed in the preceding table. The ORC support is an experimental feature since Impala-2.12. To disable it, set --enable_orc_scanner to false when starting the cluster.

Impala supports the following compression codecs:

Choosing the File Format for a Table

Different file formats and compression codecs work better for different data sets. While Impala typically provides performance gains regardless of file format, choosing the proper format for your data can yield further performance improvements. Use the following considerations to decide which combination of file format and compression to use for a particular table: