PARQUET_FILE_SIZE Query Option
Specifies the maximum size of each Parquet data file produced by Impala INSERT
statements.
Syntax:
Specify the size in bytes, or with a trailing m
or g
character to indicate
megabytes or gigabytes. For example:
-- 128 megabytes.
set PARQUET_FILE_SIZE=134217728
INSERT OVERWRITE parquet_table SELECT * FROM text_table;
-- 512 megabytes.
set PARQUET_FILE_SIZE=512m;
INSERT OVERWRITE parquet_table SELECT * FROM text_table;
-- 1 gigabyte.
set PARQUET_FILE_SIZE=1g;
INSERT OVERWRITE parquet_table SELECT * FROM text_table;
Usage notes:
With tables that are small or finely partitioned, the default Parquet block size (formerly 1 GB, now 256 MB
in Impala 2.0 and later) could be much larger than needed for each data file. For INSERT
operations into such tables, you can increase parallelism by specifying a smaller
PARQUET_FILE_SIZE
value, resulting in more HDFS blocks that can be processed by different
nodes.
Type: numeric, with optional unit specifier
Currently, the maximum value for this setting is 1 gigabyte (1g
).
Setting a value higher than 1 gigabyte could result in errors during
an INSERT
operation.
Default: 0 (produces files with a target size of 256 MB; files might be larger for very wide tables)
Because ADLS does not expose the block sizes of data files the way HDFS does, any Impala
INSERT
or CREATE TABLE AS SELECT
statements use the
PARQUET_FILE_SIZE
query option setting to define the size of Parquet
data files. (Using a large block size is more important for Parquet tables than for
tables that use other file formats.)
Isilon considerations:
PARQUET_FILE_SIZE
query option
has no effect when Impala inserts data into a table or partition residing on Isilon
storage. Use the isi
command to set the default block size globally on
the Isilon device. For example, to set the Isilon default block size to 256 MB, the
recommended size for Parquet data files for Impala, issue the following command:
isi hdfs settings modify --default-block-size=256MB
Ozone considerations:
Because Apache Ozone storage buckets use a global value for the block size rather than
a configurable value for each file, the PARQUET_FILE_SIZE
query option
has no effect when Impala inserts data into a table or partition residing on Ozone
storage.
Related information:
For information about the Parquet file format, and how the number and size of data files affects query performance, see Using the Parquet File Format with Impala Tables.