When Impala compute nodes and its storage are not co-located, the network bandwidth requirement goes up as the network traffic includes the data fetch as well as the shuffling exchange traffic of intermediate results.
To mitigate the pressure on the network, you can enable the compute nodes to cache the working set read from remote filesystems, such as, remote HDFS data node, S3, ABFS, ADLS.
To enable remote data cache, set the --data_cache
Impala Daemon start-up
flag as below:
--data_cache=dir1,dir2,dir3,...:quota
The flag is set to a list of directories, separated by ,
, followed by a
:
, and a capacity quota
per
directory.
If set to an empty string, data caching is disabled.
Cached data is stored in the specified directories.
The specified directories must exist in the local filesystem of each Impala Daemon, or Impala will fail to start.
In addition, the filesystem which the directory resides in must support hole punching.
The cache can consume up to the quota
bytes for each of the directories
specified.
The default setting for --data_cache
is an empty string.
For example, with the following setting, the data cache may use up to 1 TB, with 500 GB
max in /data/0
and /data/1
respectively.
--data_cache=/data/0,/data/1:500GB
--data_cache_eviction_policy
Impala Daemon start-up
flag: --data_cache_eviction_policy=policy