CHAR Data Type (Impala 2.0 or higher only)
A fixed-length character type, padded with trailing spaces if necessary to achieve the specified length. If values are longer than the specified length, Impala truncates any trailing characters.
Syntax:
In the column definition of a CREATE TABLE
statement:
column_name CHAR(length)
The maximum length you can specify is 255.
Semantics of trailing spaces:
-
When you store a
CHAR
value shorter than the specified length in a table, queries return the value padded with trailing spaces if necessary; the resulting value has the same length as specified in the column definition. -
If you store a
CHAR
value containing trailing spaces in a table, those trailing spaces are not stored in the data file. When the value is retrieved by a query, the result could have a different number of trailing spaces. That is, the value includes however many spaces are needed to pad it to the specified length of the column. -
If you compare two
CHAR
values that differ only in the number of trailing spaces, those values are considered identical.
Partitioning: This type can be used for partition key columns. Because of the efficiency advantage
of numeric values over character-based values, if the partition key is a string representation of a number,
prefer to use an integer type with sufficient range (INT
, BIGINT
, and so
on) where practical.
HBase considerations: This data type cannot be used with HBase tables.
Parquet considerations:
- This type can be read from and written to Parquet files.
- There is no requirement for a particular level of Parquet.
- Parquet files generated by Impala and containing this type can be freely interchanged with other components such as Hive and MapReduce.
- Any trailing spaces, whether implicitly or explicitly specified, are not written to the Parquet data files.
-
Parquet data files might contain values that are longer than allowed by the
CHAR(n)
length limit. Impala ignores any extra trailing characters when it processes those values during a query.
Text table considerations:
Text data files might contain values that are longer than allowed for a particular
CHAR(n)
column. Any extra trailing characters are ignored when Impala
processes those values during a query. Text data files can also contain values that are shorter than the
defined length limit, and Impala pads them with trailing spaces up to the specified length. Any text data
files produced by Impala INSERT
statements do not include any trailing blanks for
CHAR
columns.
Avro considerations:
The Avro specification allows string values up to 2**64 bytes in length.
Impala queries for Avro tables use 32-bit integers to hold string lengths.
In Impala 2.5 and higher, Impala truncates CHAR
and VARCHAR
values in Avro tables to (2**31)-1 bytes.
If a query encounters a STRING
value longer than (2**31)-1
bytes in an Avro table, the query fails. In earlier releases,
encountering such long values in an Avro table could cause a crash.
Compatibility:
This type is available using Impala 2.0 or higher.
Some other database systems make the length specification optional. For Impala, the length is required.
Internal details: Represented in memory as a byte array with the same size as the length specification. Values that are shorter than the specified length are padded on the right with trailing spaces.
Added in: Impala 2.0.0
Column statistics considerations: Because this type has a fixed size, the maximum and average size
fields are always filled in for column statistics, even before you run the COMPUTE STATS
statement.
UDF considerations: This type cannot be used for the argument or return type of a user-defined function (UDF) or user-defined aggregate function (UDA).
Examples:
These examples show how trailing spaces are not considered significant when comparing or processing
CHAR
values. CAST()
truncates any longer string to fit within the defined
length. If a CHAR
value is shorter than the specified length, it is padded on the right with
spaces until it matches the specified length. Therefore, LENGTH()
represents the length
including any trailing spaces, and CONCAT()
also treats the column value as if it has
trailing spaces.
select cast('x' as char(4)) = cast('x ' as char(4)) as "unpadded equal to padded";
+--------------------------+
| unpadded equal to padded |
+--------------------------+
| true |
+--------------------------+
create table char_length(c char(3));
insert into char_length values (cast('1' as char(3))), (cast('12' as char(3))), (cast('123' as char(3))), (cast('123456' as char(3)));
select concat("[",c,"]") as c, length(c) from char_length;
+-------+-----------+
| c | length(c) |
+-------+-----------+
| [1 ] | 3 |
| [12 ] | 3 |
| [123] | 3 |
| [123] | 3 |
+-------+-----------+
This example shows a case where data values are known to have a specific length, where CHAR
is a logical data type to use.
create table addresses
(id bigint,
street_name string,
state_abbreviation char(2),
country_abbreviation char(2));
The following example shows how values written by Impala do not physically include the trailing spaces. It
creates a table using text format, with CHAR
values much shorter than the declared length,
and then prints the resulting data file to show that the delimited values are not separated by spaces. The
same behavior applies to binary-format Parquet data files.
create table char_in_text (a char(20), b char(30), c char(40))
row format delimited fields terminated by ',';
insert into char_in_text values (cast('foo' as char(20)), cast('bar' as char(30)), cast('baz' as char(40))), (cast('hello' as char(20)), cast('goodbye' as char(30)), cast('aloha' as char(40)));
-- Running this Linux command inside impala-shell using the ! shortcut.
!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
foo,bar,baz
hello,goodbye,aloha
The following example further illustrates the treatment of spaces. It replaces the contents of the previous table with some values including leading spaces, trailing spaces, or both. Any leading spaces are preserved within the data file, but trailing spaces are discarded. Then when the values are retrieved by a query, the leading spaces are retrieved verbatim while any necessary trailing spaces are supplied by Impala.
insert overwrite char_in_text values (cast('trailing ' as char(20)), cast(' leading and trailing ' as char(30)), cast(' leading' as char(40)));
!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
trailing, leading and trailing, leading
select concat('[',a,']') as a, concat('[',b,']') as b, concat('[',c,']') as c from char_in_text;
+------------------------+----------------------------------+--------------------------------------------+
| a | b | c |
+------------------------+----------------------------------+--------------------------------------------+
| [trailing ] | [ leading and trailing ] | [ leading ] |
+------------------------+----------------------------------+--------------------------------------------+
Kudu considerations:
Currently, the data types CHAR
, VARCHAR
,
ARRAY
, MAP
, and STRUCT
cannot be used with Kudu tables.
Restrictions:
Because the blank-padding behavior requires allocating the maximum length for each value in memory, for
scalability reasons avoid declaring CHAR
columns that are much longer than typical values in
that column.
All data in CHAR
and VARCHAR
columns must be in a character encoding that
is compatible with UTF-8. If you have binary data from another database system (that is, a BLOB type), use
a STRING
column to hold it.
When an expression compares a CHAR
with a STRING
or
VARCHAR
, the CHAR
value is implicitly converted to STRING
first, with trailing spaces preserved.
select cast("foo " as char(5)) = 'foo' as "char equal to string";
+----------------------+
| char equal to string |
+----------------------+
| false |
+----------------------+
This behavior differs from other popular database systems. To get the expected result of
TRUE
, cast the expressions on both sides to CHAR
values of the appropriate
length:
select cast("foo " as char(5)) = cast('foo' as char(3)) as "char equal to string";
+----------------------+
| char equal to string |
+----------------------+
| true |
+----------------------+
This behavior is subject to change in future releases.
Related information:
STRING Data Type, VARCHAR Data Type (Impala 2.0 or higher only), String Literals, Impala String Functions