Although Impala typically works with many large files in an HDFS storage system with plenty of capacity, there are times when you might perform some file cleanup to reclaim space, or advise developers on techniques to minimize space consumption and file duplication.
Use compact binary file formats where practical. Numeric and time-based data in
particular can be stored in more compact form in binary data files. Depending on the
file format, various compression and encoding features can reduce file size even
further. You can specify the STORED AS
clause as part of the
CREATE TABLE
statement, or ALTER TABLE
with the
SET FILEFORMAT
clause for an existing table or partition within a
partitioned table. See How Impala Works with Hadoop File Formats
for details about file formats, especially Using the Parquet File Format with Impala Tables.
See CREATE TABLE Statement and
ALTER TABLE Statement for syntax details.
You manage underlying data files differently depending on whether the corresponding Impala table is defined as an internal or external table:
DESCRIBE FORMATTED
statement to check if a particular table
is internal (managed by Impala) or external, and to see the physical location of the
data files in HDFS. See DESCRIBE Statement
for details.
DROP TABLE
statements to remove data files. See
DROP TABLE Statement for details.
hadoop fs
, hdfs dfs
,
or distcp
, to create, move, copy, or delete files within HDFS
directories that are accessible by the impala
user. Issue a
REFRESH table_name
statement after adding or
removing any files from the data directory of an external table. See
REFRESH Statement for details.
LOAD DATA
statement to move HDFS files into the data
directory for an Impala table from inside Impala, without the need to specify the
HDFS path of the destination directory. This technique works for both internal and
external tables. See LOAD DATA Statement for details.
Make sure that the HDFS trashcan is configured correctly. When you remove files from HDFS, the space might not be reclaimed for use by other files until sometime later, when the trashcan is emptied. See DROP TABLE Statement for details. See User Account Requirements for permissions needed for the HDFS trashcan to operate correctly.
Drop all tables in a database before dropping the database itself. See DROP DATABASE Statement for details.
Clean up temporary files after failed INSERT
statements. If an
INSERT
statement encounters an error, and you see a directory named
.impala_insert_staging or
_impala_insert_staging left behind in the data directory for the
table, it might contain temporary data files taking up space in HDFS. You might be
able to salvage these data files, for example if they are complete but could not be
moved into place due to a permission error. Or, you might delete those files through
commands such as hadoop fs
or hdfs dfs
, to reclaim
space before re-trying the INSERT
. Issue DESCRIBE FORMATTED
table_name
to see the HDFS path where you can check for
temporary files.
If you use the Amazon Simple Storage Service (S3) as a place to offload data to reduce the volume of local storage, Impala 2.2.0 and higher can query the data directly from S3. See Using Impala with Amazon S3 Object Store for details.
--scratch_dirs="path_to_directory"
configuration option. By default, intermediate files are stored in the
directory /tmp/impala-scratch.
The capacity quota of -1
or 0
is the same as no
quota for the directory.
If there is less than 1 GB free on the filesystem where that directory resides, Impala still runs, but writes a warning message to its log.
Impala successfully starts (with a warning written to the log) if it cannot create or read and write files in one of the scratch directories.
Config option | Description |
---|---|
--scratch_dirs=/dir1,/dir2
|
Use /dir1 and /dir2 as scratch directories with no capacity quota. |
--scratch_dirs=/dir1,/dir2:25G
|
Use /dir1 and /dir2 as scratch directories with no capacity quota on /dir1 and the 25GB quota on /dir2. |
--scratch_dirs=/dir1:5MB,/dir2
|
Use /dir1 and /dir2 as scratch directories with the capacity quota of 5MB on /dir1 and no quota on /dir2. |
--scratch_dirs=/dir1:-1,/dir2:0
|
Use /dir1 and /dir2 as scratch directories with no capacity quota. |
Allocation from a scratch directory will fail if the specified limit for the directory is exceeded.
If Impala encounters an error reading or writing files in a scratch directory during a query, Impala logs the error, and the query fails.
The location of the intermediate files are configured by starting the impalad daemon with
the flag --scratch_dirs="path_to_directory"
. Currently this startup flag uses the configured
scratch directories in a round robin fashion. Automatic selection of scratch directories in
a round robin fashion may not always be ideal in every situation since these directories
could come from different classes of storage system volumes having different performance
characteristics (SSD vs HDD, local storage vs network attached storage, etc.). To optimize
your workload, you have an option to configure the priority of the scratch directories based
on your storage system configuration.
The scratch directories will be selected for spilling based on how you configure the priorities of the directories and if you provide the same priority for multiple directories then the directories will be selected in a round robin fashion.
dir-path:limit:priority
dir-path::priority
Example:
/dir1:200GB:0
/dir1::0
/dir1
/dir1:200GB
/dir1:200GB:
In the example below, dir1 will be used as a spill victim until it is full and then dir2, dir3, and dir4 will be used in a round robin fashion.
--scratch_dirs="/dir1:200GB:0, /dir2:1024GB:1, /dir3:1024GB:1, /dir4:1024GB:1"
You can compress the data spilled to disk to increase the effective scratch capacity. You
typically more than double capacity using compression and reduce spilling to disk. Use the
--disk_spill_compression_codec and –-disk_spill_punch_holes startup options. The
--disk_spill_compression_codec takes any value supported by the COMPRESSION_CODEC query
option. The value is not case-sensitive. A value of ZSTD
or
LZ4
is recommended (default is NONE).
For example:
--disk_spill_compression_codec=LZ4
--disk_spill_punch_holes=true
If you set --disk_spill_compression_codec
to a value other than NONE
, you must set --disk_spill_punch_holes
to true.
The hole punching feature supported by many filesystems is used to reclaim space in scratch files during execution of a query that spills to disk. This results in lower scratch space requirements in many cases, especially when combined with disk spill compression. When this option is not enabled, scratch space is still recycled by a query, but less effectively in many cases.
You can specify a compression level for ZSTD
only. For example:
--disk_spill_compression_codec=ZSTD:10
--disk_spill_punch_holes=true
Compression levels from 1 up to 22 (default 3) are supported for ZSTD
.
The lower the compression level, the faster the speed at the cost of compression ratio.
Impala occasionally needs to use persistent storage for writing intermediate files during large sorts, joins, aggregations, or analytic function operations. If your workload results in large volumes of intermediate data being written, it is recommended to configure the heavy spilling queries to use a remote storage location rather than the local one. The advantage of using remote storage for scratch space is that it is elastic and can handle any amount of spilling.
Before you begin
Identify the URL for an S3 bucket to which you want your new Impala to write the temporary data. If you use the S3 bucket that is associated with the environment, navigate to the S3 bucket and copy the URL. If you want to use an external S3 bucket, you must first configure your environment to use the external S3 bucket with the correct read/write permissions.
Configuring the Start-up Option in Impala daemon
You can use the Impalad start option scratch_dirs to specify the locations of the intermediate files. The format of the option is:
--scratch_dirs="remote_dir, local_buffer_dir (,local_dir…)"
where local_buffer_dir and local_dir conform to the earlier descriptions for scratch directories.
With the option specified above:
--remote_tmp_file_size=size
in
the start-up option. The default size of a remote intermediate file is 16MB while the
maximum is 512MB.Examples
--scratch_dirs=s3a://remote_dir,/local_buffer_dir --remote_tmp_file_size=64M
--scratch_dirs=s3a://remote_dir,/local_buffer_dir:256M,/local_dir:10G
--scratch_dirs=s3a://remote_dir,/local_buffer_dir,/local_dir_1:5G:1,/local_dir_2:5G:2
Impala occasionally needs to use persistent storage for writing intermediate files during large sorts, joins, aggregations, or analytic function operations. If your workload results in large volumes of intermediate data being written, it is recommended to configure the heavy spilling queries to use a remote storage location rather than the local one. The advantage of using remote storage for scratch space is that it is elastic and can handle any amount of spilling.
Before you begin
Configuring the Start-up Option in Impala daemon
You can use the Impalad start option scratch_dirs
to specify the
locations of the intermediate files.
Use the following format for this start up option:
--scratch_dirs="hdfs://authority/path(:max_bytes), local_buffer_dir (,local_dir…)"
hdfs://authority/path
is
the remote directory.ip_address
or
hostname
and port
, or service_id
.Using the above format:
--remote_tmp_file_size=size
in
the start-up option. The default size of a remote intermediate file is 16MB while the
maximum is 512MB.Examples
--scratch_dirs=hdfs://10.0.0.49:20500/tmp:300G,/local_buffer_dir --remote_tmp_file_size=64M
--scratch_dirs=hdfs://hdfsnn/tmp:300G,/local_buffer_dir:512M,/local_dir:10G
hdfs1
.
--scratch_dirs=hdfs://hdfs1/tmp,/local_buffer_dir,/local_dir_1:5G:1,/local_dir_2:5G:2
Even though max_bytes is optional, it is highly recommended to configure for spilling to HDFS because the HDFS cluster space is limited.
Before you begin
Configuring the Start-up Option in Impala daemon
You can use the Impalad start option scratch_dirs
to specify the locations of the
intermediate files.
--scratch_dirs="ofs://authority/path(:max_bytes), local_buffer_dir (,local_dir…)"
ofs://authority/path
is
the remote directory.authority
may include ip_address
or
hostname
and port
, or service_id
.max_bytes
is optional.Using the above format:
--remote_tmp_file_size=size
in
the start-up option. The default size of a remote intermediate file is 16MB while the
maximum is 512MB.Examples
--scratch_dirs=ofs://10.0.0.49:29000/tmp:300G,/local_buffer_dir --remote_tmp_file_size=64M
--scratch_dirs=ofs://ozonemgr/tmp:300G,/local_buffer_dir:512M,/local_dir:10G
ozone1
.
--scratch_dirs=ofs://ozone1/tmp,/local_buffer_dir,/local_dir_1:5G:1,/local_dir_2:5G:2
Even though max_bytes is optional, it is highly recommended to configure for spilling to Ozone because the Ozone cluster space is limited.