Using Impala with Isilon Storage
You can use Impala to query data files that reside on EMC Isilon storage devices, rather than in HDFS. This capability allows convenient query access to a storage system where you might already be managing large volumes of data. The combination of the Impala query engine and Isilon storage is certified on Impala 2.2.4 or higher.
PARQUET_FILE_SIZE
query option
has no effect when Impala inserts data into a table or partition residing on Isilon
storage. Use the isi
command to set the default block size globally on
the Isilon device. For example, to set the Isilon default block size to 256 MB, the
recommended size for Parquet data files for Impala, issue the following command:
isi hdfs settings modify --default-block-size=256MB
The typical use case for Impala and Isilon together is to use Isilon for the default
filesystem, replacing HDFS entirely. In this configuration, when you create a database,
table, or partition, the data always resides on Isilon storage and you do not need to
specify any special LOCATION
attribute. If you do specify a
LOCATION
attribute, its value refers to a path within the Isilon
filesystem. For example:
-- If the default filesystem is Isilon, all Impala data resides there
-- and all Impala databases and tables are located there.
CREATE TABLE t1 (x INT, s STRING);
-- You can specify LOCATION for database, table, or partition,
-- using values from the Isilon filesystem.
CREATE DATABASE d1 LOCATION '/some/path/on/isilon/server/d1.db';
CREATE TABLE d1.t2 (a TINYINT, b BOOLEAN);
Impala can write to, delete, and rename data files and database, table, and partition
directories on Isilon storage. Therefore, Impala statements such as CREATE
TABLE
, DROP TABLE
, CREATE DATABASE
,
DROP DATABASE
, ALTER TABLE
, and INSERT
work the same with Isilon storage as with HDFS.
When the Impala spill-to-disk feature is activated by a query that approaches the memory limit, Impala writes all the temporary data to a local (not Isilon) storage device. Because the I/O bandwidth for the temporary data depends on the number of local disks, and clusters using Isilon storage might not have as many local disks attached, pay special attention on Isilon-enabled clusters to any queries that use the spill-to-disk feature. Where practical, tune the queries or allocate extra memory for Impala to avoid spilling. Although you can specify an Isilon storage device as the destination for the temporary data for the spill-to-disk feature, that configuration is not recommended due to the need to transfer the data both ways using remote I/O.
When tuning Impala queries on HDFS, you typically try to avoid any remote reads. When the
data resides on Isilon storage, all the I/O consists of remote reads. Do not be alarmed
when you see non-zero numbers for remote read measurements in query profile output. The
benefit of the Impala and Isilon integration is primarily convenience of not having to
move or copy large volumes of data to HDFS, rather than raw query performance. You can
increase the performance of Impala I/O for Isilon systems by increasing the value for the
‑‑num_remote_hdfs_io_threads
startup option for the
impalad daemon.