UTF-8 Support

Impala has traditionally offered a single-byte binary character set for STRING data type and the character data is encoded in ASCII character set. Prior to this release, Impala was incompatible with Hive in some functions applying on non-ASCII strings. E.g. length() in Impala used to return the length of bytes of the string, while length() in Hive returns the length of UTF-8 characters of the string. UTF-8 characters (code points) are assembled in variant-length bytes (1~4 bytes), so the results differ when there are non-ASCII characters in the string. This release provides a UTF-8 aware behavior for Impala STRING type to get consistent behavior with Hive on UTF-8 strings using a query option.

UTF-8 support allows you to read and write UTF-8 from standard formats like Parquet and ORC, thus improving interoperability with other engines that also support those standard formats.

Turning ON the UTF-8 behavior

You can use the new query option, UTF8_MODE, to turn on/off the UTF-8 aware behavior. The query option can be set globally, or at per session level. Only queries with UTF8_MODE=true will have UTF-8 aware behaviors.

Note:

If the query option UTF8_MODE is turned on globally, existing queries that depend on the original binary behavior need to explicitly set UTF8_MODE=false.
Impala Daemons should be deployed on nodes using the same Glibc version since different Glibc version supports different Unicode standard version and also ensure that the en_US.UTF-8 locale is installed in the nodes. Not using the same Glibc version might result in inconsistent UTF-8 behavior when UTF8_MODE is set to true.

List of STRING Functions

The new query option introduced will turn on the UTF-8 aware behavior of the following string functions:

LENGTH(STRING a)
- returns the number of UTF-8 characters instead of bytes
SUBSTR(STRING a, INT start [, INT len])
SUBSTRING(STRING a, INT start [, INT len])()
- the substring start position and length is counted by UTF-8 characters instead of bytes
REVERSE(STRING a)
- the unit of the operation is a UTF-8 character, ie. it won't reverse bytes inside a UTF-8 character.
  
  Note: The results of reverse("最快的SQL引擎") used to be "��敼�LQS��竿倜�" and now "擎引LQS的快最".
INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
LOCATE(STRING substr, STRING str[, INT pos])
- These functions have an optional position argument. The return values are also positions in the string. In UTF-8 mode, these positions are counted by UTF-8 characters instead of bytes.
mask functions
- The unit of the operation is a UTF-8 character, ie. they won't mask the string byte-to-byte.
upper/lower/initcap
- These functions will recognize non-ascii characters and transform them based on the current locale used by the Impala process.

Limitations

Use the UTF8_MODE option only when needed since the performance of UTF_8 is not optimized yet. It is only an experimental feature.
UTF-8 support for CHAR and VARCHAR types is not implemented yet. So VARCHAR(N) will still return N bytes instead of N UTF-8 characters.