Using the ORC File Format with Impala Tables
Impala supports using ORC data files. By default, ORC reads are enabled in Impala 3.4.0 and higher.
|File Type||Format||Compression Codecs||Impala Can CREATE?||Impala Can INSERT?|
|ORC||Structured||gzip, Snappy, LZO, LZ4; currently gzip by default|| Yes, in Impala 2.12.0 and higher.
By default, ORC reads are enabled in Impala 3.4.0 and higher.
No. Import data by using
Creating ORC Tables and Loading Data
If you do not have an existing data file to use, begin by creating one in the appropriate format.
To create an ORC table:
impala-shell interpreter, issue a command similar to:
CREATE TABLE orc_table (column_specs) STORED AS ORC;
Because Impala can query some kinds of tables that it cannot currently write to, after creating tables of
certain file formats, you might use the Hive shell to load the data. See
How Impala Works with Hadoop File Formats for details. After loading data into a table through
Hive or other mechanism outside of Impala, issue a
statement the next time you connect to the Impala node, before querying the table, to make Impala recognize
the new data.
For example, here is how you might create some ORC tables in Impala (by specifying the columns explicitly, or cloning the structure of another table), load data through Hive, and query them through Impala:
$ impala-shell -i localhost [localhost:21000] default> CREATE TABLE orc_table (x INT) STORED AS ORC; [localhost:21000] default> CREATE TABLE orc_clone LIKE some_other_table STORED AS ORC; [localhost:21000] default> quit; $ hive hive> INSERT INTO TABLE orc_table SELECT x FROM some_other_table; 3 Rows loaded to orc_table Time taken: 4.169 seconds hive> quit; $ impala-shell -i localhost [localhost:21000] default> SELECT * FROM orc_table; Fetched 0 row(s) in 0.11s [localhost:21000] default> -- Make Impala recognize the data loaded through Hive; [localhost:21000] default> REFRESH orc_table; [localhost:21000] default> SELECT * FROM orc_table; +---+ | x | +---+ | 1 | | 2 | | 3 | +---+ Fetched 3 row(s) in 0.11s
Enabling Compression for ORC Tables
ORC tables are in zlib (Deflate in Impala) compression in default. You may want to use Snappy or LZO compression on existing tables for different balance between compression ratio and decompression speed. In Hive-1.1.0, the supported compressions for ORC tables are NONE, ZLIB, SNAPPY and LZO. For example, to enable Snappy compression, you would specify the following additional settings when loading data through the Hive shell:
hive> SET hive.exec.compress.output=true; hive> SET orc.compress=SNAPPY; hive> INSERT OVERWRITE TABLE new_table SELECT * FROM old_table;
If you are converting partitioned tables, you must complete additional steps. In such a case, specify additional settings similar to the following:
hive> CREATE TABLE new_table (your_cols) PARTITIONED BY (partition_cols) STORED AS new_format; hive> SET hive.exec.dynamic.partition.mode=nonstrict; hive> SET hive.exec.dynamic.partition=true; hive> INSERT OVERWRITE TABLE new_table PARTITION(comma_separated_partition_cols) SELECT * FROM old_table;
Remember that Hive does not require that you specify a source format for it. Consider the case of
converting a table with two partition columns called
month to a
Snappy compressed ORC table. Combining the components outlined previously to complete this table conversion,
you would specify settings similar to the following:
hive> CREATE TABLE tbl_orc (int_col INT, string_col STRING) STORED AS ORC; hive> SET hive.exec.compress.output=true; hive> SET orc.compress=SNAPPY; hive> SET hive.exec.dynamic.partition.mode=nonstrict; hive> SET hive.exec.dynamic.partition=true; hive> INSERT OVERWRITE TABLE tbl_orc SELECT * FROM tbl;
To complete a similar process for a table that includes partitions, you would specify settings similar to the following:
hive> CREATE TABLE tbl_orc (int_col INT, string_col STRING) PARTITIONED BY (year INT) STORED AS ORC; hive> SET hive.exec.compress.output=true; hive> SET orc.compress=SNAPPY; hive> SET hive.exec.dynamic.partition.mode=nonstrict; hive> SET hive.exec.dynamic.partition=true; hive> INSERT OVERWRITE TABLE tbl_orc PARTITION(year) SELECT * FROM tbl;
The compression type is specified in the following command:
You could elect to specify alternative codecs such as
NONE, GZIP, LZO here.
Query Performance for Impala ORC Tables
In general, expect query performance with ORC tables to be faster than with tables using text data, but slower than with Parquet tables since there're bunch of optimizations for Parquet. See Using the Parquet File Format with Impala Tables for information about using the Parquet file format for high-performance analytic queries.
In Impala 2.6 and higher, Impala queries are optimized for files
stored in Amazon S3. For Impala tables that use the file formats Parquet, ORC, RCFile,
SequenceFile, Avro, and uncompressed text, the setting
fs.s3a.block.size in the core-site.xml
configuration file determines how Impala divides the I/O work of reading the data files.
This configuration setting is specified in bytes. By default, this value is 33554432 (32
MB), meaning that Impala parallelizes S3 read operations on the files as if they were
made up of 32 MB blocks. For example, if your S3 queries primarily access Parquet files
written by MapReduce or Hive, increase
fs.s3a.block.size to 134217728
(128 MB) to match the row group size of those files. If most S3 queries involve Parquet
files written by Impala, increase
fs.s3a.block.size to 268435456 (256
MB) to match the row group size produced by Impala.
Data Type Considerations for ORC Tables
The ORC format defines a set of data types whose names differ from the names of the corresponding Impala data types. If you are preparing ORC files using other Hadoop components such as Pig or MapReduce, you might need to work with the type names defined by ORC. The following figure lists the ORC-defined types and the equivalent types in Impala.
BINARY -> STRING BOOLEAN -> BOOLEAN DOUBLE -> DOUBLE FLOAT -> FLOAT TINYINT -> TINYINT SMALLINT -> SMALLINT INT -> INT BIGINT -> BIGINT TIMESTAMP -> TIMESTAMP DATE (not supported)
In Impala 2.3 and higher, Impala supports the complex types
Impala 3.2 and higher, Impala also supports these
complex types in ORC. See
Complex Types (Impala 2.3 or higher only) for details.
These Complex types are currently supported only for the Parquet or ORC file formats.
Because Impala has better performance on Parquet than ORC, if you plan to use complex
types, become familiar with the performance and storage aspects of Parquet first.