STRING
, VARCHAR
, or CHAR
data types,
for example to measure the length of a string or concatenate two strings together.
STRING
arguments also accept the
VARCHAR
and CHAR
types introduced in Impala 2.0.
VARCHAR
or CHAR
values are passed to a
function that returns a string value, the return type is normalized to
STRING
. For example, a call to CONCAT()
with a mix
of STRING
, VARCHAR
, and CHAR
arguments produces a STRING
result.
Related information:
The string functions operate mainly on these data types: STRING Data Type, VARCHAR Data Type (Impala 2.0 or higher only), and CHAR Data Type (Impala 2.0 or higher only).
Function reference:
Impala supports the following string functions:
Return type: INT
Return type: STRING
Usage notes:
For general information about Base64 encoding, see Base64 article on Wikipedia.
The functions BASE64ENCODE()
and BASE64DECODE()
are
typically used in combination, to store in an Impala table string data that is
problematic to store or transmit. For example, you could use these functions to store
string data that uses an encoding other than UTF-8, or to transform the values in
contexts that require ASCII values, such as for partition key columns. Keep in mind that
base64-encoded values produce different results for string functions such as
LENGTH()
, MAX()
, and MIN()
than when
those functions are called with the unencoded string values.
The set of characters that can be generated as output from
BASE64ENCODE()
, or specified in the argument string to
BASE64DECODE()
, are the ASCII uppercase and lowercase letters (A-Z,
a-z), digits (0-9), and the punctuation characters +
,
/
, and =
.
All return values produced by BASE64ENCODE()
are a multiple of 4 bytes
in length. All argument values supplied to BASE64DECODE()
must also be
a multiple of 4 bytes in length. If a base64-encoded value would otherwise have a
different length, it can be padded with trailing =
characters to reach
a length that is a multiple of 4 bytes.
If the argument string to BASE64DECODE()
does not represent a valid
base64-encoded value, subject to the constraints of the Impala implementation such as
the allowed character set, the function returns NULL
.
Examples:
BASE64ENCODE()
and
BASE64DECODE()
together to store and retrieve string values:
-- An arbitrary string can be encoded in base 64.
-- The length of the output is a multiple of 4 bytes,
-- padded with trailing = characters if necessary.
select base64encode('hello world') as encoded,
length(base64encode('hello world')) as length;
+------------------+--------+
| encoded | length |
+------------------+--------+
| aGVsbG8gd29ybGQ= | 16 |
+------------------+--------+
-- Passing an encoded value to base64decode() produces
-- the original value.
select base64decode('aGVsbG8gd29ybGQ=') as decoded;
+-------------+
| decoded |
+-------------+
| hello world |
+-------------+
These examples demonstrate incorrect encoded values that produce NULL
return values when decoded:
-- The input value to base64decode() must be a multiple of 4 bytes.
-- In this case, leaving off the trailing = padding character
-- produces a NULL return value.
select base64decode('aGVsbG8gd29ybGQ') as decoded;
+---------+
| decoded |
+---------+
| NULL |
+---------+
WARNINGS: UDF WARNING: Invalid base64 string; input length is 15,
which is not a multiple of 4.
-- The input to base64decode() can only contain certain characters.
-- The $ character in this case causes a NULL return value.
select base64decode('abc$');
+----------------------+
| base64decode('abc$') |
+----------------------+
| NULL |
+----------------------+
WARNINGS: UDF WARNING: Could not base64 decode input in space 4; actual output length 0
These examples demonstrate "round-tripping" of an original string to an encoded
string, and back again. This technique is applicable if the original source is in an
unknown encoding, or if some intermediate processing stage might cause national
characters to be misrepresented:
select 'circumflex accents: â, ê, î, ô, û' as original,
base64encode('circumflex accents: â, ê, î, ô, û') as encoded;
+-----------------------------------+------------------------------------------------------+
| original | encoded |
+-----------------------------------+------------------------------------------------------+
| circumflex accents: â, ê, î, ô, û | Y2lyY3VtZmxleCBhY2NlbnRzOiDDoiwgw6osIMOuLCDDtCwgw7s= |
+-----------------------------------+------------------------------------------------------+
select base64encode('circumflex accents: â, ê, î, ô, û') as encoded,
base64decode(base64encode('circumflex accents: â, ê, î, ô, û')) as decoded;
+------------------------------------------------------+-----------------------------------+
| encoded | decoded |
+------------------------------------------------------+-----------------------------------+
| Y2lyY3VtZmxleCBhY2NlbnRzOiDDoiwgw6osIMOuLCDDtCwgw7s= | circumflex accents: â, ê, î, ô, û |
+------------------------------------------------------+-----------------------------------+
Return type: STRING
Usage notes:
For general information about Base64 encoding, see Base64 article on Wikipedia.
The functions BASE64ENCODE()
and BASE64DECODE()
are
typically used in combination, to store in an Impala table string data that is
problematic to store or transmit. For example, you could use these functions to store
string data that uses an encoding other than UTF-8, or to transform the values in
contexts that require ASCII values, such as for partition key columns. Keep in mind that
base64-encoded values produce different results for string functions such as
LENGTH()
, MAX()
, and MIN()
than when
those functions are called with the unencoded string values.
The set of characters that can be generated as output from
BASE64ENCODE()
, or specified in the argument string to
BASE64DECODE()
, are the ASCII uppercase and lowercase letters (A-Z,
a-z), digits (0-9), and the punctuation characters +
,
/
, and =
.
All return values produced by BASE64ENCODE()
are a multiple of 4 bytes
in length. All argument values supplied to BASE64DECODE()
must also be
a multiple of 4 bytes in length. If a base64-encoded value would otherwise have a
different length, it can be padded with trailing =
characters to reach
a length that is a multiple of 4 bytes.
Examples:
BASE64ENCODE()
and
BASE64DECODE()
together to store and retrieve string values:
-- An arbitrary string can be encoded in base 64.
-- The length of the output is a multiple of 4 bytes,
-- padded with trailing = characters if necessary.
select base64encode('hello world') as encoded,
length(base64encode('hello world')) as length;
+------------------+--------+
| encoded | length |
+------------------+--------+
| aGVsbG8gd29ybGQ= | 16 |
+------------------+--------+
-- Passing an encoded value to base64decode() produces
-- the original value.
select base64decode('aGVsbG8gd29ybGQ=') as decoded;
+-------------+
| decoded |
+-------------+
| hello world |
+-------------+
These examples demonstrate incorrect encoded values that produce NULL
return values when decoded:
-- The input value to base64decode() must be a multiple of 4 bytes.
-- In this case, leaving off the trailing = padding character
-- produces a NULL return value.
select base64decode('aGVsbG8gd29ybGQ') as decoded;
+---------+
| decoded |
+---------+
| NULL |
+---------+
WARNINGS: UDF WARNING: Invalid base64 string; input length is 15,
which is not a multiple of 4.
-- The input to base64decode() can only contain certain characters.
-- The $ character in this case causes a NULL return value.
select base64decode('abc$');
+----------------------+
| base64decode('abc$') |
+----------------------+
| NULL |
+----------------------+
WARNINGS: UDF WARNING: Could not base64 decode input in space 4; actual output length 0
These examples demonstrate "round-tripping" of an original string to an encoded
string, and back again. This technique is applicable if the original source is in an
unknown encoding, or if some intermediate processing stage might cause national
characters to be misrepresented:
select 'circumflex accents: â, ê, î, ô, û' as original,
base64encode('circumflex accents: â, ê, î, ô, û') as encoded;
+-----------------------------------+------------------------------------------------------+
| original | encoded |
+-----------------------------------+------------------------------------------------------+
| circumflex accents: â, ê, î, ô, û | Y2lyY3VtZmxleCBhY2NlbnRzOiDDoiwgw6osIMOuLCDDtCwgw7s= |
+-----------------------------------+------------------------------------------------------+
select base64encode('circumflex accents: â, ê, î, ô, û') as encoded,
base64decode(base64encode('circumflex accents: â, ê, î, ô, û')) as decoded;
+------------------------------------------------------+-----------------------------------+
| encoded | decoded |
+------------------------------------------------------+-----------------------------------+
| Y2lyY3VtZmxleCBhY2NlbnRzOiDDoiwgw6osIMOuLCDDtCwgw7s= | circumflex accents: â, ê, î, ô, û |
+------------------------------------------------------+-----------------------------------+
STRING
value. By default, removes only spaces. If a
non-NULL
optional second argument is specified, the function removes
all occurrences of characters in that second argument from the beginning and end of
the string.
Return type: STRING
Added in: Impala 2.3.0
Examples:
The following examples show the default BTRIM()
behavior, and what
changes when you specify the optional second argument. All the examples bracket the
output value with [ ]
so that you can see any leading or trailing
spaces in the BTRIM()
result. By default, the function removes and
number of both leading and trailing spaces. When the second argument is specified,
any number of occurrences of any character in the second argument are removed from
the start and end of the input string; in this case, spaces are not removed (unless
they are part of the second argument) and any instances of the characters are not
removed if they do not come right at the beginning or end of the string.
-- Remove multiple spaces before and one space after.
select concat('[',btrim(' hello '),']');
+---------------------------------------+
| concat('[', btrim(' hello '), ']') |
+---------------------------------------+
| [hello] |
+---------------------------------------+
-- Remove any instances of x or y or z at beginning or end. Leave spaces alone.
select concat('[',btrim('xy hello zyzzxx','xyz'),']');
+------------------------------------------------------+
| concat('[', btrim('xy hello zyzzxx', 'xyz'), ']') |
+------------------------------------------------------+
| [ hello ] |
+------------------------------------------------------+
-- Remove any instances of x or y or z at beginning or end.
-- Leave x, y, z alone in the middle of the string.
select concat('[',btrim('xyhelxyzlozyzzxx','xyz'),']');
+----------------------------------------------------+
| concat('[', btrim('xyhelxyzlozyzzxx', 'xyz'), ']') |
+----------------------------------------------------+
| [helxyzlo] |
+----------------------------------------------------+
Syntax: BYTES (byte_expression)
Where:
byte_expression is the byte string for which the number of bytes is to be returned.
Usage notes:
The BYTES function is similar to the LENGTH() function except that it always returns the number of bytes regardless of the status of UTF-8 mode whether it is turned ON or OFF.
The following is the list of supported string data types to be used in byte_expression:
Examples:
The following example obtains the number of bytes from “cloudera” by applying the BYTES function to the column “cloudera”, which is type VARCHAR.
SELECT bytes(cast('cloudera' as varchar(20))) FROM xyz;
+----------------------------------------+
| bytes(cast('cloudera' as varchar(20))) |
+----------------------------------------+
| 8 |
+----------------------------------------+
CHAR
value.
Return type: INT
Usage notes:
When applied to a STRING
value, it returns the same result as the
length()
function. When applied to a CHAR
value,
it might return a larger value than length()
does, to account for
trailing spaces in the CHAR
.
Examples:
length()
and
char_length()
sometimes produce the same result, and sometimes produce
different results depending on the type of the argument and the presence of trailing
spaces for CHAR
values. The S
and C
values are displayed with enclosing quotation marks to show any trailing spaces.
create table length_demo (s string, c char(5));
insert into length_demo values
('a',cast('a' as char(5))),
('abc',cast('abc' as char(5))),
('hello',cast('hello' as char(5)));
select concat('"',s,'"') as s, concat('"',c,'"') as c,
length(s), length(c),
char_length(s), char_length(c)
from length_demo;
+---------+---------+-----------+-----------+----------------+----------------+
| s | c | length(s) | length(c) | char_length(s) | char_length(c) |
+---------+---------+-----------+-----------+----------------+----------------+
| "a" | "a " | 1 | 1 | 1 | 5 |
| "abc" | "abc " | 3 | 3 | 3 | 5 |
| "hello" | "hello" | 5 | 5 | 5 | 5 |
+---------+---------+-----------+-----------+----------------+----------------+
Return type: STRING
Usage notes: Can be used as the inverse of the ascii()
function, which converts a character to its numeric ASCII code.
Added in: Impala 2.3.0
Examples:
SELECT chr(65);
+---------+
| chr(65) |
+---------+
| A |
+---------+
SELECT chr(97);
+---------+
| chr(97) |
+---------+
| a |
+---------+
NULL
, it returns NULL
.
Return type: STRING
Usage notes: concat()
and concat_ws()
are
appropriate for concatenating the values of multiple columns within the same row, while
group_concat()
joins together values from different rows.
Return type: STRING
Usage notes: concat()
and concat_ws()
are
appropriate for concatenating the values of multiple columns within the same row, while
group_concat()
joins together values from different rows.
NULL
if
either argument is NULL
, 0 if the search string is not found, or 0 if
the search string contains a comma.
Return type: INT
Return type: STRING
Usage notes: concat()
and concat_ws()
are
appropriate for concatenating the values of multiple columns within the same row, while
group_concat()
joins together values from different rows.
By default, returns a single string covering the whole result set. To include other
columns or values in the result set, or to produce multiple concatenated strings for
subsets of rows, include a GROUP BY
clause in the query.
Strictly speaking, GROUP_CONCAT()
is an aggregate function, not a
scalar function like the others in this list. For additional details and examples,
see GROUP_CONCAT Function.
Return type: STRING
Example:
INITCAP("i gOt mY ChiCkeNs in tHe yard.")
returns " I Got
My Chickens In The Yard."
.
Return type:
INT
Usage notes:
If the substr is not present in str, the function returns 0.
The optional third and fourth arguments let you find instances of the substr other than the first instance starting from the left.
-- Restricting the search to positions 7..end,
-- the first occurrence of 'b' is at position 9.
select instr('foo bar bletch', 'b', 7);
+---------------------------------+
| instr('foo bar bletch', 'b', 7) |
+---------------------------------+
| 9 |
+---------------------------------+
-- Scanning right to left, the first occurrence of 'o'
-- is at position 8. (8th character from the left.)
select instr('hello world','o',-1);
+-------------------------------+
| instr('hello world', 'o', -1) |
+-------------------------------+
| 8 |
+-------------------------------+
-- 2nd occurrence of 'b' is at position 9.
select instr('foo bar bletch', 'b', 1, 2);
+------------------------------------+
| instr('foo bar bletch', 'b', 1, 2) |
+------------------------------------+
| 9 |
+------------------------------------+
NULL
, the function also returns
NULL
.JARO_SIMILARITY()
, i.e. (1 - JARO_SIMILARITY()
).
Return type: DOUBLE
Usage notes:
If the two input strings are identical, the function returns 0.0.
If there is no matching character between the input strings, the function returns 1.0.
If either input strings is NULL
, the function returns
NULL
.
If the length of either input string is bigger than 255 characters, the function returns an error.
Return type: DOUBLE
Usage notes:
If the two input strings are identical, the function returns 1.0.
If there is no matching character between the input strings, the function returns 0.0.
If either input strings is NULL
, the function returns
NULL
.
If the length of either input string is bigger than 255 characters, the function returns an error.
JARO_WINKLER_SIMILARITY()
, i.e. 1 -
JARO_WINKLER_SIMILARITY()
.
Return type: DOUBLE
Usage notes:
If the two input strings are identical, the function returns 0.0.
If there is no matching character between the input strings, the function returns 1.0.
If either input strings is NULL
, the function returns
NULL
.
The default scaling factor is 0.1.
The prefix weight will only be applied if the Jaro-distance exceeds the optional boost_threshold. By default, the boost_threshold value is 0.7.
Use Jaro or Jaro-Winkler functions to perform fuzzy matches on relatively short strings, e.g. to scrub user inputs of names against the records in the database.
Return type: DOUBLE
Usage notes:
If the two input strings are identical, the function returns 1.0.
If there is no matching character between the input strings, the function returns 0.0.
If either input strings is NULL
, the function returns
NULL
.
The default scaling factor is 0.1.
The prefix weight will only be applied if the Jaro-similarity exceeds the optional boost_threshold. By default, the boost_threshold value is 0.7.
STRLEFT()
function.
CHAR
values.
Return type: INT
Usage notes:
When applied to a STRING
value, it returns the same result as the
CHAR_LENGTH()
function. When applied to a CHAR
value, it might return a smaller value than CHAR_LENGTH()
does,
because LENGTH()
ignores any trailing spaces in the
CHAR
.
LENGTH()
with CHAR
values
containing trailing spaces is not standardized across the industry, when porting
code from other database systems, evaluate the behavior of LENGTH()
on the source system and switch to CHAR_LENGTH()
for Impala if
necessary.
Examples:
length()
and
char_length()
sometimes produce the same result, and sometimes produce
different results depending on the type of the argument and the presence of trailing
spaces for CHAR
values. The S
and C
values are displayed with enclosing quotation marks to show any trailing spaces.
create table length_demo (s string, c char(5));
insert into length_demo values
('a',cast('a' as char(5))),
('abc',cast('abc' as char(5))),
('hello',cast('hello' as char(5)));
select concat('"',s,'"') as s, concat('"',c,'"') as c,
length(s), length(c),
char_length(s), char_length(c)
from length_demo;
+---------+---------+-----------+-----------+----------------+----------------+
| s | c | length(s) | length(c) | char_length(s) | char_length(c) |
+---------+---------+-----------+-----------+----------------+----------------+
| "a" | "a " | 1 | 1 | 1 | 5 |
| "abc" | "abc " | 3 | 3 | 3 | 5 |
| "hello" | "hello" | 5 | 5 | 5 | 5 |
+---------+---------+-----------+-----------+----------------+----------------+
Return type: INT
If input strings are equal, the function returns 0.
If either input exceeds 255 characters, the function returns an error.
If either input string is NULL
, the function returns
NULL
.
If the length of one input string is zero, the function returns the length of the other string.
Example:
LEVENSHTEIN ('welcome', 'We come')
returns 2, first change to
replace 'w
' to 'W
', and then to replace
'l
' to a space character.
Return type: INT
Return type: STRING
Usage notes:
In Impala 2.5 and higher, you can simplify queries that use many
UPPER()
and LOWER()
calls to do case-insensitive
comparisons, by using the ILIKE
or IREGEXP
operators
instead. See ILIKE Operator and
IREGEXP Operator for details.
Return type: STRING
Return type: STRING
'PROTOCOL'
, 'HOST'
,
'PATH'
, 'REF'
, 'AUTHORITY'
,
'FILE'
, 'USERINFO'
, or 'QUERY'
.
Uppercase is required for these literal values. When requesting the
QUERY
portion of the URL, you can optionally specify a key to
retrieve just the associated value from the key-value pairs in the query string.
Return type: STRING
Usage notes: This function is important for the traditional Hadoop use case
of interpreting web logs. For example, if the web traffic data features raw URLs not
divided into separate table columns, you can count visitors to a particular page by
extracting the 'PATH'
or 'FILE'
field, or analyze
search terms by extracting the corresponding key from the 'QUERY'
field.
Return type: STRING
Examples:
select prettyprint_bytes(1), prettyprint_bytes(12345678), prettyprint_bytes(1234567890), prettyprint_bytes(1234567890123);
+-----------------------+------------------------------+--------------------------------+-------------------------------+
| prettyprint_bytes(1) | prettyprint_bytes(12345678) | prettyprint_bytes(1234567890) | prettyprint_bytes(1234567890123) |
+----------------------+-----------------------------+-------------------------------+----------------------------------+
| 1.00 B | 11.77 MB | 1.15 GB | 1149.78 GB |
+----------------------+-----------------------------+-------------------------------+----------------------------------+
REGEXP_ESCAPE()
function returns a string escaped
for the special character in RE2 library so that the special characters are
interpreted literally rather than as special characters. The following special
characters are escaped by the function:
.\+*?[^]$(){}=!<>|:-
Return type: STRING
In Impala 2.0 and later, the Impala regular expression syntax conforms to the POSIX
Extended Regular Expression syntax used by the Google RE2 library. For details, see
the RE2
documentation. It has most idioms familiar from regular expressions in Perl,
Python, and so on, including .*?
for non-greedy matches.
In Impala 2.0 and later, a change in the underlying regular expression library could cause changes in the way regular expressions are interpreted by this function. Test any queries that use regular expressions and adjust the expression patterns if necessary. See Incompatible Changes Introduced in Impala 2.0.0 for details.
Because the impala-shell interpreter uses the \
character for escaping, use \\
to represent the regular expression
escape character in any regular expressions that you submit through
impala-shell . You might prefer to use the equivalent character class
names, such as [[:digit:]]
instead of \d
which you
would have to escape as \\d
.
Examples:
This example shows escaping one of special characters in RE2.
+------------------------------------------------------+
| regexp_escape('Hello.world') |
+------------------------------------------------------+
| Hello\.world |
+------------------------------------------------------+
This example shows escaping all the special characters in RE2.
+------------------------------------------------------------+
| regexp_escape('a.b\\c+d*e?f[g]h$i(j)k{l}m=n!o<p>q|r:s-t') |
+------------------------------------------------------------+
| a\.b\\c\+d\*e\?f\[g\]h\$i\(j\)k\{l\}m\=n\!o\<p\>q\|r\:s\-t |
+------------------------------------------------------------+
(...)
portion.
Return type: STRING
In Impala 2.0 and later, the Impala regular expression syntax conforms to the POSIX
Extended Regular Expression syntax used by the Google RE2 library. For details, see
the RE2
documentation. It has most idioms familiar from regular expressions in Perl,
Python, and so on, including .*?
for non-greedy matches.
In Impala 2.0 and later, a change in the underlying regular expression library could cause changes in the way regular expressions are interpreted by this function. Test any queries that use regular expressions and adjust the expression patterns if necessary. See Incompatible Changes Introduced in Impala 2.0.0 for details.
Because the impala-shell interpreter uses the \
character for escaping, use \\
to represent the regular expression
escape character in any regular expressions that you submit through
impala-shell . You might prefer to use the equivalent character class
names, such as [[:digit:]]
instead of \d
which you
would have to escape as \\d
.
Examples:
This example shows how group 0 matches the full pattern string, including the
portion outside any ()
group:
[localhost:21000] > select regexp_extract('abcdef123ghi456jkl','.*?(\\d+)',0);
+------------------------------------------------------+
| regexp_extract('abcdef123ghi456jkl', '.*?(\\d+)', 0) |
+------------------------------------------------------+
| abcdef123ghi456 |
+------------------------------------------------------+
Returned 1 row(s) in 0.11s
This example shows how group 1 matches just the contents inside the first
()
group in the pattern string:
[localhost:21000] > select regexp_extract('abcdef123ghi456jkl','.*?(\\d+)',1);
+------------------------------------------------------+
| regexp_extract('abcdef123ghi456jkl', '.*?(\\d+)', 1) |
+------------------------------------------------------+
| 456 |
+------------------------------------------------------+
Returned 1 row(s) in 0.11s
Unlike in earlier Impala releases, the regular expression library used in Impala 2.0
and later supports the .*?
idiom for non-greedy matches. This
example shows how a pattern string starting with .*?
matches the
shortest possible portion of the source string, returning the rightmost set of
lowercase letters. A pattern string both starting and ending with
.*?
finds two potential matches of equal length, and returns the
first one found (the leftmost set of lowercase letters).
[localhost:21000] > select regexp_extract('AbcdBCdefGHI','.*?([[:lower:]]+)',1);
+--------------------------------------------------------+
| regexp_extract('abcdbcdefghi', '.*?([[:lower:]]+)', 1) |
+--------------------------------------------------------+
| def |
+--------------------------------------------------------+
[localhost:21000] > select regexp_extract('AbcdBCdefGHI','.*?([[:lower:]]+).*?',1);
+-----------------------------------------------------------+
| regexp_extract('abcdbcdefghi', '.*?([[:lower:]]+).*?', 1) |
+-----------------------------------------------------------+
| bcd |
+-----------------------------------------------------------+
true
or false
to indicate
whether the source string contains anywhere inside it the regular expression given by
the pattern. The optional third argument consists of letter flags that change how the
match is performed, such as i
for case-insensitive matching.
Syntax:
The flags that you can include in the optional third argument are:
c
: Case-sensitive matching (the default).
i
: Case-insensitive matching. If multiple instances of
c
and i
are included in the third argument, the
last such option takes precedence.
m
: Multi-line matching. The ^
and
$
operators match the start or end of any line within the source
string, not the start and end of the entire string.
n
: Newline matching. The .
operator can match
the newline character. A repetition operator such as .*
can match
a portion of the source string that spans multiple lines.
Return type: BOOLEAN
In Impala 2.0 and later, the Impala regular expression syntax conforms to the POSIX
Extended Regular Expression syntax used by the Google RE2 library. For details, see
the RE2
documentation. It has most idioms familiar from regular expressions in Perl,
Python, and so on, including .*?
for non-greedy matches.
In Impala 2.0 and later, a change in the underlying regular expression library could cause changes in the way regular expressions are interpreted by this function. Test any queries that use regular expressions and adjust the expression patterns if necessary. See Incompatible Changes Introduced in Impala 2.0.0 for details.
Because the impala-shell interpreter uses the \
character for escaping, use \\
to represent the regular expression
escape character in any regular expressions that you submit through
impala-shell . You might prefer to use the equivalent character class
names, such as [[:digit:]]
instead of \d
which you
would have to escape as \\d
.
Examples:
This example shows how regexp_like()
can test for the existence of
various kinds of regular expression patterns within a source string:
-- Matches because the 'f' appears somewhere in 'foo'.
select regexp_like('foo','f');
+-------------------------+
| regexp_like('foo', 'f') |
+-------------------------+
| true |
+-------------------------+
-- Does not match because the comparison is case-sensitive by default.
select regexp_like('foo','F');
+-------------------------+
| regexp_like('foo', 'f') |
+-------------------------+
| false |
+-------------------------+
-- The 3rd argument can change the matching logic, such as 'i' meaning case-insensitive.
select regexp_like('foo','F','i');
+------------------------------+
| regexp_like('foo', 'f', 'i') |
+------------------------------+
| true |
+------------------------------+
-- The familiar regular expression notations work, such as ^ and $ anchors...
select regexp_like('foo','f$');
+--------------------------+
| regexp_like('foo', 'f$') |
+--------------------------+
| false |
+--------------------------+
select regexp_like('foo','o$');
+--------------------------+
| regexp_like('foo', 'o$') |
+--------------------------+
| true |
+--------------------------+
-- ...and repetition operators such as * and +
select regexp_like('foooooobar','fo+b');
+-----------------------------------+
| regexp_like('foooooobar', 'fo+b') |
+-----------------------------------+
| true |
+-----------------------------------+
select regexp_like('foooooobar','fx*y*o*b');
+---------------------------------------+
| regexp_like('foooooobar', 'fx*y*o*b') |
+---------------------------------------+
| true |
+---------------------------------------+
Return type: STRING
In Impala 2.0 and later, the Impala regular expression syntax conforms to the POSIX
Extended Regular Expression syntax used by the Google RE2 library. For details, see
the RE2
documentation. It has most idioms familiar from regular expressions in Perl,
Python, and so on, including .*?
for non-greedy matches.
In Impala 2.0 and later, a change in the underlying regular expression library could cause changes in the way regular expressions are interpreted by this function. Test any queries that use regular expressions and adjust the expression patterns if necessary. See Incompatible Changes Introduced in Impala 2.0.0 for details.
Because the impala-shell interpreter uses the \
character for escaping, use \\
to represent the regular expression
escape character in any regular expressions that you submit through
impala-shell . You might prefer to use the equivalent character class
names, such as [[:digit:]]
instead of \d
which you
would have to escape as \\d
.
Examples:
These examples show how you can replace parts of a string matching a pattern with
replacement text, which can include backreferences to any ()
groups
in the pattern string. The backreference numbers start at 1, and any
\
characters must be escaped as \\
.
Replace a character pattern with new text:
[localhost:21000] > select regexp_replace('aaabbbaaa','b+','xyz');
+------------------------------------------+
| regexp_replace('aaabbbaaa', 'b+', 'xyz') |
+------------------------------------------+
| aaaxyzaaa |
+------------------------------------------+
Returned 1 row(s) in 0.11s
Replace a character pattern with substitution text that includes the original matching text:
[localhost:21000] > select regexp_replace('aaabbbaaa','(b+)','<\\1>');
+----------------------------------------------+
| regexp_replace('aaabbbaaa', '(b+)', '<\\1>') |
+----------------------------------------------+
| aaa<bbb>aaa |
+----------------------------------------------+
Returned 1 row(s) in 0.11s
Remove all characters that are not digits:
[localhost:21000] > select regexp_replace('123-456-789','[^[:digit:]]','');
+---------------------------------------------------+
| regexp_replace('123-456-789', '[^[:digit:]]', '') |
+---------------------------------------------------+
| 123456789 |
+---------------------------------------------------+
Returned 1 row(s) in 0.12s
Return type: STRING
Return type: STRING
Usage notes:
Because this function does not use any regular expression patterns, it is typically
faster than regexp_replace()
for simple string substitutions.
If any argument is NULL
, the return value is NULL
.
Matching is case-sensitive.
If the replacement string contains another instance of the target string, the expansion is only performed once, instead of applying again to the newly constructed string.
Added in: Impala 2.9.0
Examples:
-- Replace one string with another.
select replace('hello world','world','earth');
+------------------------------------------+
| replace('hello world', 'world', 'earth') |
+------------------------------------------+
| hello earth |
+------------------------------------------+
-- All occurrences of the target string are replaced.
select replace('hello world','o','0');
+----------------------------------+
| replace('hello world', 'o', '0') |
+----------------------------------+
| hell0 w0rld |
+----------------------------------+
-- If no match is found, the original string is returned unchanged.
select replace('hello world','xyz','abc');
+--------------------------------------+
| replace('hello world', 'xyz', 'abc') |
+--------------------------------------+
| hello world |
+--------------------------------------+
Return type: STRING
STRRIGHT()
function.
Return type: STRING
Return type: STRING
REPEAT(' ',n)
.
Return type: STRING
index
th part of the input
source string split by the delimiter.
The delimiter can consist of multiple characters, not just a single character.
All matching of the delimiter is done exactly, not using any regular expression patterns.
Return type: STRING
Examples:
SPLIT_PART('x,y,z',',',2)
returns 'y'
.
SPLIT_PART('one***two***three','***',2)
returns
'two'
.
SPLIT_PART('abc@@def@@ghi', '@@', 3)
returns
'ghi'
.
SPLIT_PART('abc@@def@@ghi', '@@', -3)
returns
'abc'
.
SUBSTR()
with 2 arguments.
Return type: STRING
SUBSTR()
with 2 arguments.
Return type: STRING
Return type: STRING
input
string with each character in the
from
argument replaced with the corresponding character in the
to
argument. The characters are matched in the order they appear in
from
and to
.
For example: translate ('hello world','world','earth')
returns
'hetta earth'
.
Return type: STRING
Usage notes:
If from
contains more characters than to
, the
from
characters that are beyond the length of to
are removed in the result.
For example:
translate('abcdedg', 'bcd', '1')
returns 'a1eg'
.
translate('Unit Number#2', '# ', '_')
returns
'UnitNumber_2'
.
If from
is NULL
, the function returns
NULL
.
If to
contains more characters than from
, the
extra characters in to
are ignored.
If from
contains duplicate characters, the duplicate character is
replaced with the first matching character in to
.
For example: translate ('hello','ll','67')
returns
'he66o'
.
LTRIM()
and
RTRIM()
.
Usage notes: Often used during data cleansing operations during the ETL
cycle, if input values might still have surrounding spaces. For a more
general-purpose function that can remove other leading and trailing characters
besides spaces, see BTRIM()
.
Return type: STRING
Return type: STRING
Usage notes:
In Impala 2.5 and higher, you can simplify queries that use many
UPPER()
and LOWER()
calls to do case-insensitive
comparisons, by using the ILIKE
or IREGEXP
operators
instead. See ILIKE Operator and
IREGEXP Operator for details.