Impala has traditionally offered a single-byte binary character set for STRING data type and the character data is encoded in ASCII character set. Prior to this release, Impala was incompatible with Hive in some functions applying on non-ASCII strings. E.g. length() in Impala used to return the length of bytes of the string, while length() in Hive returns the length of UTF-8 characters of the string. UTF-8 characters (code points) are assembled in variant-length bytes (1~4 bytes), so the results differ when there are non-ASCII characters in the string. This release provides a UTF-8 aware behavior for Impala STRING type to get consistent behavior with Hive on UTF-8 strings using a query option.
UTF-8 support allows you to read and write UTF-8 from standard formats like Parquet and ORC, thus improving interoperability with other engines that also support those standard formats.
You can use the new query option, UTF8_MODE, to turn on/off the UTF-8 aware behavior. The query option can be set globally, or at per session level. Only queries with UTF8_MODE=true will have UTF-8 aware behaviors.
The new query option introduced will turn on the UTF-8 aware behavior of the following string functions: