Managing Disk Space for Impala Data

Although Impala typically works with many large files in an HDFS storage system with plenty of capacity, there are times when you might perform some file cleanup to reclaim space, or advise developers on techniques to minimize space consumption and file duplication.

Configuring Scratch Space for Spilling to Disk

Impala uses intermediate files during large sort, join, aggregation, or analytic function operations The files are removed when the operation finishes. You can specify locations of the intermediate files by starting the impalad daemon with the --scratch_dirs="path_to_directory" configuration option. By default, intermediate files are stored in the directory /tmp/impala-scratch.
  • You can specify a single directory or a comma-separated list of directories.
  • You can specify an optional a capacity quota per scratch directory using the colon (:) as the delimiter.

    The capacity quota of -1 or 0 is the same as no quota for the directory.

  • The scratch directories must be on the local filesystem, not in HDFS.
  • You might specify different directory paths for different hosts, depending on the capacity and speed of the available storage devices.

If there is less than 1 GB free on the filesystem where that directory resides, Impala still runs, but writes a warning message to its log.

Impala successfully starts (with a warning written to the log) if it cannot create or read and write files in one of the scratch directories.

The following are examples for specifying scratch directories.
Config option Description
--scratch_dirs=/dir1,/dir2 Use /dir1 and /dir2 as scratch directories with no capacity quota.
--scratch_dirs=/dir1,/dir2:25G Use /dir1 and /dir2 as scratch directories with no capacity quota on /dir1 and the 25GB quota on /dir2.
--scratch_dirs=/dir1:5MB,/dir2 Use /dir1 and /dir2 as scratch directories with the capacity quota of 5MB on /dir1 and no quota on /dir2.
--scratch_dirs=/dir1:-1,/dir2:0 Use /dir1 and /dir2 as scratch directories with no capacity quota.

Allocation from a scratch directory will fail if the specified limit for the directory is exceeded.

If Impala encounters an error reading or writing files in a scratch directory during a query, Impala logs the error, and the query fails.

Priority Based Scratch Directory Selection

The location of the intermediate files are configured by starting the impalad daemon with the flag --scratch_dirs="path_to_directory". Currently this startup flag uses the configured scratch directories in a round robin fashion. Automatic selection of scratch directories in a round robin fashion may not always be ideal in every situation since these directories could come from different classes of storage system volumes having different performance characteristics (SSD vs HDD, local storage vs network attached storage, etc.). To optimize your workload, you have an option to configure the priority of the scratch directories based on your storage system configuration.

The scratch directories will be selected for spilling based on how you configure the priorities of the directories and if you provide the same priority for multiple directories then the directories will be selected in a round robin fashion.

The valid formats for specifying the priority directories are as shown here:
dir-path:limit:priority
dir-path::priority

Example:

/dir1:200GB:0
/dir1::0
The following formats use the default priority:
/dir1
/dir1:200GB
/dir1:200GB:

In the example below, dir1 will be used as a spill victim until it is full and then dir2, dir3, and dir4 will be used in a round robin fashion.

--scratch_dirs="/dir1:200GB:0, /dir2:1024GB:1, /dir3:1024GB:1, /dir4:1024GB:1"

Increasing Scratch Capacity

You can compress the data spilled to disk to increase the effective scratch capacity. You typically more than double capacity using compression and reduce spilling to disk. Use the --disk_spill_compression_codec and –-disk_spill_punch_holes startup options. The --disk_spill_compression_codec takes any value supported by the COMPRESSION_CODEC query option. The value is not case-sensitive. A value of ZSTD or LZ4 is recommended (default is NONE).

For example:

--disk_spill_compression_codec=LZ4
--disk_spill_punch_holes=true

If you set --disk_spill_compression_codec to a value other than NONE, you must set --disk_spill_punch_holes to true.

The hole punching feature supported by many filesystems is used to reclaim space in scratch files during execution of a query that spills to disk. This results in lower scratch space requirements in many cases, especially when combined with disk spill compression. When this option is not enabled, scratch space is still recycled by a query, but less effectively in many cases.

You can specify a compression level for ZSTD only. For example:

--disk_spill_compression_codec=ZSTD:10
--disk_spill_punch_holes=true

Compression levels from 1 up to 22 (default 3) are supported for ZSTD. The lower the compression level, the faster the speed at the cost of compression ratio.

Configure Impala Daemon to spill to S3

Impala occasionally needs to use persistent storage for writing intermediate files during large sorts, joins, aggregations, or analytic function operations. If your workload results in large volumes of intermediate data being written, it is recommended to configure the heavy spilling queries to use a remote storage location rather than the local one. The advantage of using remote storage for scratch space is that it is elastic and can handle any amount of spilling.

Before you begin

Identify the URL for an S3 bucket to which you want your new Impala to write the temporary data. If you use the S3 bucket that is associated with the environment, navigate to the S3 bucket and copy the URL. If you want to use an external S3 bucket, you must first configure your environment to use the external S3 bucket with the correct read/write permissions.

Configuring the Start-up Option in Impala daemon

You can use the Impalad start option scratch_dirs to specify the locations of the intermediate files. The format of the option is:

--scratch_dirs="remote_dir, local_buffer_dir (,local_dir…)"

where local_buffer_dir and local_dir conform to the earlier descriptions for scratch directories.

With the option specified above:

Examples

Configure Impala Daemon to spill to HDFS

Impala occasionally needs to use persistent storage for writing intermediate files during large sorts, joins, aggregations, or analytic function operations. If your workload results in large volumes of intermediate data being written, it is recommended to configure the heavy spilling queries to use a remote storage location rather than the local one. The advantage of using remote storage for scratch space is that it is elastic and can handle any amount of spilling.

Before you begin

Configuring the Start-up Option in Impala daemon

You can use the Impalad start option scratch_dirs to specify the locations of the intermediate files.

Use the following format for this start up option:

--scratch_dirs="hdfs://authority/path(:max_bytes), local_buffer_dir (,local_dir…)"

Using the above format:

Examples

Even though max_bytes is optional, it is highly recommended to configure for spilling to HDFS because the HDFS cluster space is limited.

Configure Impala Daemon to spill to Ozone

Before you begin

Configuring the Start-up Option in Impala daemon

You can use the Impalad start option scratch_dirs to specify the locations of the intermediate files.

--scratch_dirs="ofs://authority/path(:max_bytes), local_buffer_dir (,local_dir…)"

Using the above format:

Examples

Even though max_bytes is optional, it is highly recommended to configure for spilling to Ozone because the Ozone cluster space is limited.