Using Impala with Apache Ozone Storage

You can use Impala to query data files that reside on Apache Ozone distributed storage, rather than in HDFS. The combination of the Impala query engine and Apache Ozone storage is certified on Impala 4.2 or higher.

For more information on Ozone, see the Apache Ozone site.

The typical use case for Impala and Ozone together is to use Ozone for the default filesystem, replacing HDFS entirely. In this configuration, when you create a database, table, or partition, the data always resides on Ozone storage and you do not need to specify any special LOCATION attribute. If you do specify a LOCATION attribute, its value refers to a path within the Ozone filesystem. For example:

-- If the default filesystem is Ozone, all Impala data resides there
-- and all Impala databases and tables are located there.

-- You can specify LOCATION for database, table, or partition,
-- using values from the Ozone filesystem.
CREATE DATABASE d1 LOCATION '/some/path/on/ozone/server/d1.db';

Impala can write to, delete, and rename data files and database, table, and partition directories on Ozone storage. Therefore, Impala statements such as CREATE TABLE, DROP TABLE, CREATE DATABASE, DROP DATABASE, ALTER TABLE, and INSERT work the same with Ozone storage as with HDFS.

Ozone supports multiple protocols: ofs, o3fs, and s3a. Impala supports reading ofs and o3fs. Impala can also read s3a (see Using Impala with Amazon S3 Object Store). However ofs is their newer protocol, and the only one Impala supports as a default filesystem. We recommend using it for DDL Statements to avoid access limitations, and for DML Statements and SELECT Statement for performance.

Because Apache Ozone storage buckets use a global value for the block size rather than a configurable value for each file, the PARQUET_FILE_SIZE query option has no effect when Impala inserts data into a table or partition residing on Ozone storage.

Impala's spill-to-disk feature may be configured to use Ozone storage by specifying a full URI (e.g. ofs://host:port/volume/bucket/key) for the spill location. See Managing Disk Space for Impala Data for details on configuring remote spill-to-disk.