Data Cache for Remote Reads
When Impala compute nodes and its storage are not co-located, the network bandwidth requirement goes up as the network traffic includes the data fetch as well as the shuffling exchange traffic of intermediate results.
To mitigate the pressure on the network, you can enable the compute nodes to cache the working set read from remote filesystems, such as, remote HDFS data node, S3, ABFS, ADLS.
To enable remote data cache, set the
--data_cache Impala Daemon start-up
flag as below:
The flag is set to a list of directories, separated by
,, followed by a
:, and a capacity
If set to an empty string, data caching is disabled.
Cached data is stored in the specified directories.
The specified directories must exist in the local filesystem of each Impala Daemon, or Impala will fail to start.
In addition, the filesystem which the directory resides in must support hole punching.
The cache can consume up to the
quota bytes for each of the directories
The default setting for
--data_cache is an empty string.
For example, with the following setting, the data cache may use up to 1 TB, with 500 GB
- LRU (Least Recently Used--the default)
- LIRS (Inter-referenece Recency Set)
--data_cache_eviction_policyImpala Daemon start-up flag: