Data Cache for Remote Reads

When Impala compute nodes and its storage are not co-located, the network bandwidth requirement goes up as the network traffic includes the data fetch as well as the shuffling exchange traffic of intermediate results.

To mitigate the pressure on the network, you can enable the compute nodes to cache the working set read from remote filesystems, such as, remote HDFS data node, S3, ABFS, ADLS.

To enable remote data cache, set the --data_cache Impala Daemon start-up flag as below:

--data_cache=dir1,dir2,dir3,...:quota

The flag is set to a list of directories, separated by ,, followed by a :, and a capacity quota per directory.

If set to an empty string, data caching is disabled.

Cached data is stored in the specified directories.

The specified directories must exist in the local filesystem of each Impala Daemon, or Impala will fail to start.

In addition, the filesystem which the directory resides in must support hole punching.

The cache can consume up to the quota bytes for each of the directories specified.

The default setting for --data_cache is an empty string.

For example, with the following setting, the data cache may use up to 1 TB, with 500 GB max in /data/0 and /data/1 respectively.

--data_cache=/data/0,/data/1:500GB

In Impala 3.4 and higher, you can configure one of the following cache eviction policies for the data cache:

LRU (Least Recently Used--the default)
LIRS (Inter-referenece Recency Set)

LIRS is a scan-resistent, low performance-overhead policy. You configure a cache eviction policy using the --data_cache_eviction_policy Impala Daemon start-up flag:

--data_cache_eviction_policy=policy

Note: The cache item will not expire as long as the same file metadata is used in the query. This is because the cache key consists of the filename, mtime (last modified time of the file), and file offset. If the mtime in the file metadata remains unchanged, the scan request will consistently access the cache (provided that there is enough capacity).