User-defined functions (frequently abbreviated as UDFs) let you code your own application logic for processing column values during an Impala query. For example, a UDF could perform calculations using an external math library, combine several column values into one, do geospatial calculations, or other kinds of tests and transformations that are outside the scope of the built-in SQL operators and functions.
You can use UDFs to simplify query logic when producing reports, or to transform data in flexible ways when
copying from one table to another with the INSERT ... SELECT
syntax.
You might be familiar with this feature from other database products, under names such as stored functions or stored routines.
Impala support for UDFs is available in Impala 1.2 and higher:
Depending on your use case, you might write all-new functions, reuse Java UDFs that you have already written for Hive, or port Hive Java UDF code to higher-performance native Impala UDFs in C++. You can code either scalar functions for producing results one row at a time, or more complex aggregate functions for doing analysis across. The following sections discuss these different aspects of working with UDFs.
Depending on your use case, the user-defined functions (UDFs) you write might accept or produce different numbers of input and output values:
select customer_name, is_frequent_customer(customer_id) from customers;
select obfuscate(sensitive_column) from sensitive_data;
COUNT
,
MAX()
, SUM()
, and AVG()
functions. When called in a
query that uses the GROUP BY
clause, the function is called once for each combination
of GROUP BY
values. For example:
-- Evaluates multiple rows but returns a single value.
select closest_restaurant(latitude, longitude) from places;
-- Evaluates batches of rows and returns a separate value for each batch.
select most_profitable_location(store_id, sales, expenses, tax_rate, depreciation) from franchise_data group by year;
Impala supports UDFs written in C++, in addition to supporting existing Hive UDFs written in Java. Where practical, use C++ UDFs because the compiled native code can yield higher performance, with UDF execution time often 10x faster for a C++ UDF than the equivalent Java UDF.
Impala can run Java-based user-defined functions (UDFs), originally written for Hive, with no changes, subject to the following conditions:
org.apache.hadoop.hive.ql.exec.UDF
class.
TIMESTAMP
type are not supported.
Text
or
IntWritable
, rather than a Java primitive type such as String
or
int
. Otherwise, the UDF returns NULL
.
In Impala 2.5 and higher, this restriction is lifted, and both
UDF arguments and return values can be Java primitive types.
CREATE FUNCTION
syntax, without any signature for arguments or the return value.
To take full advantage of the Impala architecture and performance features, you can also write Impala-specific UDFs in C++.
For background about Java-based Hive UDFs, see the Hive documentation for UDFs. For examples or tutorials for writing such UDFs, search the web for related blog posts.
The ideal way to understand how to reuse Java-based UDFs (originally written for Hive) with Impala is to take some of the Hive built-in functions (implemented as Java UDFs) and take the applicable JAR files through the UDF deployment process for Impala, creating new UDFs with different names:
jar tf jar_file
to see a list of the classes inside the JAR.
You will see names like org/apache/hadoop/hive/ql/udf/UDFLower.class
and
org/apache/hadoop/hive/ql/udf/UDFOPNegative.class
. Make a note of the names of the
functions you want to experiment with. When you specify the entry points for the Impala CREATE
FUNCTION
statement, change the slash characters to dots and strip off the
.class
suffix, for example org.apache.hadoop.hive.ql.udf.UDFLower
and
org.apache.hadoop.hive.ql.udf.UDFOPNegative
.
CREATE FUNCTION
statement, with a LOCATION
clause containing the full HDFS path of the JAR file, and a
SYMBOL
clause with the fully qualified name of the class, using dots as separators and
without the .class
extension. Remember that user-defined functions are associated with
a particular database, so issue a USE
statement for the appropriate database first, or
specify the SQL function name as
db_name.function_name
. Use completely new names
for the SQL functions, because Impala UDFs cannot have the same name as Impala built-in functions.
CAST
functions to ensure correct data types, and so on.
REFRESH FUNCTIONS
statement with the database name as an argument.
Java-based UDFs can be added to the metastore database through Hive CREATE
FUNCTION
statements, and made visible to Impala by subsequently running
REFRESH FUNCTIONS
. For example:
CREATE DATABASE shared_udfs;
USE shared_udfs;
...use CREATE FUNCTION statements in Hive to create some Java-based UDFs
that Impala is not initially aware of...
REFRESH FUNCTIONS shared_udfs;
SELECT udf_created_by_hive(c1) FROM ...
For example, the following impala-shell session creates an Impala UDF
my_lower()
that reuses the Java code for the Hive lower()
: built-in
function. We cannot call it lower()
because Impala does not allow UDFs to have the
same name as built-in functions. From SQL, we call the function in a basic way (in a query with no
WHERE
clause), directly on a column, and on the results of a string expression:
[localhost:21000] > create database udfs;
[localhost:21000] > use udfs;
localhost:21000] > create function lower(string) returns string location '/user/hive/udfs/hive.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFLower';
ERROR: AnalysisException: Function cannot have the same name as a builtin: lower
[localhost:21000] > create function my_lower(string) returns string location '/user/hive/udfs/hive.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFLower';
[localhost:21000] > select my_lower('Some String NOT ALREADY LOWERCASE');
+----------------------------------------------------+
| udfs.my_lower('some string not already lowercase') |
+----------------------------------------------------+
| some string not already lowercase |
+----------------------------------------------------+
Returned 1 row(s) in 0.11s
[localhost:21000] > create table t2 (s string);
[localhost:21000] > insert into t2 values ('lower'),('UPPER'),('Init cap'),('CamelCase');
Inserted 4 rows in 2.28s
[localhost:21000] > select * from t2;
+-----------+
| s |
+-----------+
| lower |
| UPPER |
| Init cap |
| CamelCase |
+-----------+
Returned 4 row(s) in 0.47s
[localhost:21000] > select my_lower(s) from t2;
+------------------+
| udfs.my_lower(s) |
+------------------+
| lower |
| upper |
| init cap |
| camelcase |
+------------------+
Returned 4 row(s) in 0.54s
[localhost:21000] > select my_lower(concat('ABC ',s,' XYZ')) from t2;
+------------------------------------------+
| udfs.my_lower(concat('abc ', s, ' xyz')) |
+------------------------------------------+
| abc lower xyz |
| abc upper xyz |
| abc init cap xyz |
| abc camelcase xyz |
+------------------------------------------+
Returned 4 row(s) in 0.22s
Here is an example that reuses the Hive Java code for the negative()
built-in
function. This example demonstrates how the data types of the arguments must match precisely with the
function signature. At first, we create an Impala SQL function that can only accept an integer
argument. Impala cannot find a matching function when the query passes a floating-point argument,
although we can call the integer version of the function by casting the argument. Then we overload the
same function name to also accept a floating-point argument.
[localhost:21000] > create table t (x int);
[localhost:21000] > insert into t values (1), (2), (4), (100);
Inserted 4 rows in 1.43s
[localhost:21000] > create function my_neg(bigint) returns bigint location '/user/hive/udfs/hive.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFOPNegative';
[localhost:21000] > select my_neg(4);
+----------------+
| udfs.my_neg(4) |
+----------------+
| -4 |
+----------------+
[localhost:21000] > select my_neg(x) from t;
+----------------+
| udfs.my_neg(x) |
+----------------+
| -2 |
| -4 |
| -100 |
+----------------+
Returned 3 row(s) in 0.60s
[localhost:21000] > select my_neg(4.0);
ERROR: AnalysisException: No matching function with signature: udfs.my_neg(FLOAT).
[localhost:21000] > select my_neg(cast(4.0 as int));
+-------------------------------+
| udfs.my_neg(cast(4.0 as int)) |
+-------------------------------+
| -4 |
+-------------------------------+
Returned 1 row(s) in 0.11s
[localhost:21000] > create function my_neg(double) returns double location '/user/hive/udfs/hive.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFOPNegative';
[localhost:21000] > select my_neg(4.0);
+------------------+
| udfs.my_neg(4.0) |
+------------------+
| -4 |
+------------------+
Returned 1 row(s) in 0.11s
By default, Impala copies UDFs into /tmp,
and you can configure this location through the --local_library_dir
startup flag for the impalad daemon.
To develop UDFs for Impala, download and install the impala-udf-devel
package (RHEL-based
distributions) or impala-udf-dev
(Ubuntu and Debian). This package contains
header files, sample source, and build configuration files.
.repo
or list file for your operating system version.
yum
, zypper
, or apt-get
commands
depending on your operating system. For the package name, specify impala-udf-devel
(RHEL-based distributions) or impala-udf-dev
(Ubuntu and Debian).
When you are ready to start writing your own UDFs, download the sample code and build scripts from the Impala sample UDF github. Then see Writing User-Defined Functions (UDFs) for how to code UDFs, and Examples of Creating and Using UDFs for how to build and run UDFs.
Before starting UDF development, make sure to install the development package and download the UDF code samples, as described in Installing the UDF Development Package.
When writing UDFs:
To understand the layout and member variables and functions of the predefined UDF data types, examine the header file /usr/include/impala_udf/udf.h:
// This is the only Impala header required to develop UDFs and UDAs. This header
// contains the types that need to be used and the FunctionContext object. The context
// object serves as the interface object between the UDF/UDA and the impala process.
For the basic declarations needed to write a scalar UDF, see the header file
udf-sample.h
within the sample build environment, which defines a simple function
named AddUdf()
:
#ifndef IMPALA_UDF_SAMPLE_UDF_H
#define IMPALA_UDF_SAMPLE_UDF_H
#include <impala_udf/udf.h>
using namespace impala_udf;
IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& arg2);
#endif
For sample C++ code for a simple function named AddUdf()
, see the source file
udf-sample.cc within the sample build environment:
#include "udf-sample.h"
// In this sample we are declaring a UDF that adds two ints and returns an int.
IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& arg2) {
if (arg1.is_null || arg2.is_null) return IntVal::null();
return IntVal(arg1.val + arg2.val);
}
// Multiple UDFs can be defined in the same file
Each value that a user-defined function can accept as an argument or return as a result value must map to a SQL data type that you could specify for a table column.
Currently, Impala UDFs cannot accept arguments or return values of the Impala complex
types (STRUCT
, ARRAY
, or MAP
).
Each data type has a corresponding structure defined in the C++ and Java header files, with two member fields and some predefined comparison operators and constructors:
is_null
indicates whether the value is NULL
or not.
val
holds the actual argument or return value when it is non-NULL
.
Each struct also defines a null()
member function that constructs an instance of the
struct with the is_null
flag set.
The built-in SQL comparison operators and clauses such as <
,
>=
, BETWEEN
, and ORDER BY
all work
automatically based on the SQL return type of each UDF. For example, Impala knows how to evaluate
BETWEEN 1 AND udf_returning_int(col1)
or ORDER BY
udf_returning_string(col2)
without you declaring any comparison operators within the UDF
itself.
For convenience within your UDF code, each struct defines ==
and !=
operators for comparisons with other structs of the same type. These are for typical C++ comparisons
within your own code, not necessarily reproducing SQL semantics. For example, if the
is_null
flag is set in both structs, they compare as equal. That behavior of
null
comparisons is different from SQL (where NULL == NULL
is
NULL
rather than true
), but more in line with typical C++ behavior.
Each kind of struct has one or more constructors that define a filled-in instance of the struct, optionally with default values.
Impala cannot process UDFs that accept composite or nested types as arguments or return them as result values. This limitation applies both to Impala UDFs written in C++ and Java-based Hive UDFs.
You can overload functions by creating multiple functions with the same SQL name but different argument types. For overloaded functions, you must use different C++ or Java entry point names in the underlying functions.
The data types defined on the C++ side (in /usr/include/impala_udf/udf.h) are:
IntVal
represents an INT
column.
BigIntVal
represents a BIGINT
column. Even if you do not need the
full range of a BIGINT
value, it can be useful to code your function arguments as
BigIntVal
to make it convenient to call the function with different kinds of integer
columns and expressions as arguments. Impala automatically casts smaller integer types to larger ones
when appropriate, but does not implicitly cast large integer types to smaller ones.
SmallIntVal
represents a SMALLINT
column.
TinyIntVal
represents a TINYINT
column.
StringVal
represents a STRING
column. It has a len
field representing the length of the string, and a ptr
field pointing to the string
data. It has constructors that create a new StringVal
struct based on a
null-terminated C-style string, or a pointer plus a length; these new structs still refer to the
original string data rather than allocating a new buffer for the data. It also has a constructor that
takes a pointer to a FunctionContext
struct and a length, that does allocate space
for a new copy of the string data, for use in UDFs that return string values.
BooleanVal
represents a BOOLEAN
column.
FloatVal
represents a FLOAT
column.
DoubleVal
represents a DOUBLE
column.
TimestampVal
represents a TIMESTAMP
column. It has a
date
field, a 32-bit integer representing the Gregorian date, that is, the days past
the epoch date. It also has a time_of_day
field, a 64-bit integer representing the
current time of day in nanoseconds.
UDFs typically take a fixed number of arguments, with each one named explicitly in the signature of your C++ function. Your function can also accept additional optional arguments, all of the same type. For example, you can concatenate two strings, three strings, four strings, and so on. Or you can compare two numbers, three numbers, four numbers, and so on.
To accept a variable-length argument list, code the signature of your function like this:
StringVal Concat(FunctionContext* context, const StringVal& separator,
int num_var_args, const StringVal* args);
In the CREATE FUNCTION
statement, after the type of the first optional argument, include
...
to indicate it could be followed by more arguments of the same type. For example,
the following function accepts a STRING
argument, followed by one or more additional
STRING
arguments:
[localhost:21000] > create function my_concat(string, string ...) returns string location '/user/test_user/udfs/sample.so' symbol='Concat';
The call from the SQL query must pass at least one argument to the variable-length portion of the argument list.
When Impala calls the function, it fills in the initial set of required arguments, then passes the number of extra arguments and a pointer to the first of those optional arguments.
For correctness, performance, and reliability, it is important for each UDF to handle all situations
where any NULL
values are passed to your function. For example, when passed a
NULL
, UDFs typically also return NULL
. In an aggregate function, which
could be passed a combination of real and NULL
values, you might make the final value
into a NULL
(as in CONCAT()
), ignore the NULL
value
(as in AVG()
), or treat it the same as a numeric zero or empty string.
Each parameter type, such as IntVal
or StringVal
, has an
is_null
Boolean member.
Test this flag immediately for each argument to your function, and if it is set, do not refer to the
val
field of the argument structure. The val
field is undefined when
the argument is NULL
, so your function could go into an infinite loop or produce
incorrect results if you skip the special handling for NULL
.
If your function returns NULL
when passed a NULL
value, or in other
cases such as when a search string is not found, you can construct a null instance of the return type by
using its null()
member function.
By default, memory allocated within a UDF is deallocated when the function exits, which could be before
the query is finished. The input arguments remain allocated for the lifetime of the function, so you can
refer to them in the expressions for your return values. If you use temporary variables to construct
all-new string values, use the StringVal()
constructor that takes an initial
FunctionContext*
argument followed by a length, and copy the data into the newly
allocated memory buffer.
One way to improve performance of UDFs is to specify the optional PREPARE_FN
and
CLOSE_FN
clauses on the CREATE FUNCTION
statement. The "prepare"
function sets up a thread-safe data structure in memory that you can use as a work area. The "close"
function deallocates that memory. Each subsequent call to the UDF within the same thread can access that
same memory area. There might be several such memory areas allocated on the same host, as UDFs are
parallelized using multiple threads.
Within this work area, you can set up predefined lookup tables, or record the results of complex
operations on data types such as STRING
or TIMESTAMP
. Saving the
results of previous computations rather than repeating the computation each time is an optimization known
as http://en.wikipedia.org/wiki/Memoization. For example,
if your UDF performs a regular expression match or date manipulation on a column that repeats the same
value over and over, you could store the last-computed value or a hash table of already-computed values,
and do a fast lookup to find the result for subsequent iterations of the UDF.
Each such function must have the signature:
void function_name(impala_udf::FunctionContext*, impala_udf::FunctionContext::FunctionScope)
Currently, only THREAD_SCOPE
is implemented, not FRAGMENT_SCOPE
. See
udf.h for details about the scope values.
To handle errors in UDFs, you call functions that are members of the initial
FunctionContext*
argument passed to your function.
A UDF can record one or more warnings, for conditions that indicate minor, recoverable problems that do not cause the query to stop. The signature for this function is:
bool AddWarning(const char* warning_msg);
For a serious problem that requires cancelling the query, a UDF can set an error flag that prevents the query from returning any results. The signature for this function is:
void SetError(const char* error_msg);
User-defined aggregate functions (UDAFs or UDAs) are a powerful and flexible category of user-defined
functions. If a query processes N rows, calling a UDAF during the query condenses the result set, anywhere
from a single value (such as with the SUM
or MAX
functions), or some
number less than or equal to N (as in queries using the GROUP BY
or
HAVING
clause).
A UDAF must maintain a state value across subsequent calls, so that it can accumulate a result across a set of calls, rather than derive it purely from one set of arguments. For that reason, a UDAF is represented by multiple underlying functions:
Intermediate values returned by the init, update and merge functions that referred to allocations
must be allocated using FunctionContext::Allocate()
and freed using FunctionContext::Free()
.
Both serialize and finalize functions are responsible for cleaning up the intermediate value and freeing such allocations.
StringVals returned to Impala directly by Serialize(), Finalize() or GetValue() functions should be backed by
temporary results memory allocated using the StringVal(FunctionContext*, int) constructor
,
StringVal::CopyFrom(FunctionContext*, const uint8_t*, size_t)
, or StringVal::Resize()
.
In the SQL syntax, you create a UDAF by using the statement CREATE AGGREGATE FUNCTION
.
You specify the entry points of the underlying C++ functions using the clauses INIT_FN
,
UPDATE_FN
, MERGE_FN
, SERIALIZE_FN
, and
FINALIZE_FN
.
For convenience, you can use a naming convention for the underlying functions and Impala automatically
recognizes those entry points. Specify the UPDATE_FN
clause, using an entry point name
containing the string update
or Update
. When you omit the other
_FN
clauses from the SQL statement, Impala looks for entry points with names formed by
substituting the update
or Update
portion of the specified name.
A user-defined aggregate function might produce and combine intermediate results during some phases of
processing, using a different data type than the final return value. For example, if you implement a
function similar to the built-in AVG()
function, it must keep track of two values, the
number of values counted and the sum of those values. Or, you might accumulate a string value over the
course of a UDA, then in the end return a numeric or Boolean result.
In such a case, specify the data type of the intermediate results using the optional INTERMEDIATE
type_name
clause of the CREATE AGGREGATE FUNCTION
statement.
If the intermediate data is a typeless byte array (for example, to represent a C++ struct or array),
specify the type name as CHAR(n)
, with n
representing the number of bytes in the intermediate result buffer.
For an example of this technique, see the trunc_sum()
aggregate function, which accumulates
intermediate results of type DOUBLE
and returns BIGINT
at the end.
View the CREATE FUNCTION
statement
and the implementation of the underlying TruncSum*() functions
on Github.
This section explains the steps to compile Impala UDFs from C++ source code, and deploy the resulting libraries for use in Impala queries.
Impala ships with a sample build environment for UDFs, that you can study, experiment with, and adapt for your own use. This sample build environment starts with the cmake configuration command, which reads the file CMakeLists.txt and generates a Makefile customized for your particular directory paths. Then the make command runs the actual build steps based on the rules in the Makefile.
Impala loads the shared library from an HDFS location. After building a shared library containing one or
more UDFs, use hdfs dfs
or hadoop fs
commands to copy the binary file to
an HDFS location readable by Impala.
The final step in deployment is to issue a CREATE FUNCTION
statement in the
impala-shell interpreter to make Impala aware of the new function. See
CREATE FUNCTION Statement for syntax details. Because each function is
associated with a particular database, always issue a USE
statement to the appropriate
database before creating a function, or specify a fully qualified name, that is, CREATE FUNCTION
db_name.function_name
.
As you update the UDF code and redeploy updated versions of a shared library, use DROP
FUNCTION
and CREATE FUNCTION
to let Impala pick up the latest version of the
code.
In Impala 2.5 and higher, Impala UDFs and UDAs written in C++ are
persisted in the metastore database. Java UDFs are also persisted, if they were created
with the new CREATE FUNCTION
syntax for Java UDFs, where the Java
function argument and return types are omitted. Java-based UDFs created with the old
CREATE FUNCTION
syntax do not persist across restarts because they are
held in the memory of the catalogd daemon. Until you re-create such
Java UDFs using the new CREATE FUNCTION
syntax, you must reload those
Java-based UDFs by running the original CREATE FUNCTION
statements
again each time you restart the catalogd daemon. Prior to
Impala 2.5 the requirement to reload functions after a restart
applied to both C++ and Java functions.
See CREATE FUNCTION Statement and DROP FUNCTION Statement for the new syntax for the persistent Java UDFs.
Prerequisites for the build environment are:
# Use the appropriate package installation command for your Linux distribution.
sudo yum install gcc-c++ cmake boost-devel
sudo yum install impala-udf-devel
# The package name on Ubuntu and Debian is impala-udf-dev.
Then, unpack the sample code in udf_samples.tar.gz and use that as a template to set up your build environment.
To build the original samples:
# Process CMakeLists.txt and set up appropriate Makefiles.
cmake .
# Generate shared libraries from UDF and UDAF sample code,
# udf_samples/libudfsample.so and udf_samples/libudasample.so
make
The sample code to examine, experiment with, and adapt is in these files:
AddUDF
).
COUNT
, AVG
, and
STRINGCONCAT
. Because aggregate functions require more elaborate coding to handle the
processing for multiple phases, there are several underlying C++ functions such as
CountInit
, AvgUpdate
, and StringConcatFinalize
.
COUNT
function keeps track of a single incrementing
number; the merge functions combine the intermediate count values from each Impala node, and the
combined number is returned verbatim by the finalize function.
AVG
function keeps track of two numbers, a count of rows
processed and the sum of values for a column. These numbers are updated and merged as with
COUNT
, then the finalize function divides them to produce and return the final
average value.
Because a UDF typically processes each row of a table, potentially being called billions of times, the performance of each UDF is a critical factor in the speed of the overall ETL or ELT pipeline. Tiny optimizations you can make within the function body can pay off in a big way when the function is called over and over when processing a huge result set.
This section demonstrates how to create and use all kinds of user-defined functions (UDFs).
This example shows 3 separate UDFs that operate on strings and return different data types. In the C++
code, the functions are HasVowels()
(checks if a string contains any vowels),
CountVowels()
(returns the number of vowels in a string), and
StripVowels()
(returns a new string with vowels removed).
First, we add the signatures for these functions to udf-sample.h in the demo build environment:
BooleanVal HasVowels(FunctionContext* context, const StringVal& input);
IntVal CountVowels(FunctionContext* context, const StringVal& arg1);
StringVal StripVowels(FunctionContext* context, const StringVal& arg1);
Then, we add the bodies of these functions to udf-sample.cc:
BooleanVal HasVowels(FunctionContext* context, const StringVal& input)
{
if (input.is_null) return BooleanVal::null();
int index;
uint8_t *ptr;
for (ptr = input.ptr, index = 0; index <= input.len; index++, ptr++)
{
uint8_t c = tolower(*ptr);
if (c == 'a' || c == 'e' || c == 'i' || c == 'o' || c == 'u')
{
return BooleanVal(true);
}
}
return BooleanVal(false);
}
IntVal CountVowels(FunctionContext* context, const StringVal& arg1)
{
if (arg1.is_null) return IntVal::null();
int count;
int index;
uint8_t *ptr;
for (ptr = arg1.ptr, count = 0, index = 0; index <= arg1.len; index++, ptr++)
{
uint8_t c = tolower(*ptr);
if (c == 'a' || c == 'e' || c == 'i' || c == 'o' || c == 'u')
{
count++;
}
}
return IntVal(count);
}
StringVal StripVowels(FunctionContext* context, const StringVal& arg1)
{
if (arg1.is_null) return StringVal::null();
int index;
std::string original((const char *)arg1.ptr,arg1.len);
std::string shorter("");
for (index = 0; index < original.length(); index++)
{
uint8_t c = original[index];
uint8_t l = tolower(c);
if (l == 'a' || l == 'e' || l == 'i' || l == 'o' || l == 'u')
{
;
}
else
{
shorter.append(1, (char)c);
}
}
// The modified string is stored in 'shorter', which is destroyed when this function ends. We need to make a string val
// and copy the contents.
StringVal result(context, shorter.size()); // Only the version of the ctor that takes a context object allocates new memory
memcpy(result.ptr, shorter.c_str(), shorter.size());
return result;
}
We build a shared library, libudfsample.so, and put the library file into HDFS where Impala can read it:
$ make
[ 0%] Generating udf_samples/uda-sample.ll
[ 16%] Built target uda-sample-ir
[ 33%] Built target udasample
[ 50%] Built target uda-sample-test
[ 50%] Generating udf_samples/udf-sample.ll
[ 66%] Built target udf-sample-ir
Scanning dependencies of target udfsample
[ 83%] Building CXX object CMakeFiles/udfsample.dir/udf-sample.o
Linking CXX shared library udf_samples/libudfsample.so
[ 83%] Built target udfsample
Linking CXX executable udf_samples/udf-sample-test
[100%] Built target udf-sample-test
$ hdfs dfs -put ./udf_samples/libudfsample.so /user/hive/udfs/libudfsample.so
Finally, we go into the impala-shell interpreter where we set up some sample data,
issue CREATE FUNCTION
statements to set up the SQL function names, and call the
functions in some queries:
[localhost:21000] > create database udf_testing;
[localhost:21000] > use udf_testing;
[localhost:21000] > create function has_vowels (string) returns boolean location '/user/hive/udfs/libudfsample.so' symbol='HasVowels';
[localhost:21000] > select has_vowels('abc');
+------------------------+
| udfs.has_vowels('abc') |
+------------------------+
| true |
+------------------------+
Returned 1 row(s) in 0.13s
[localhost:21000] > select has_vowels('zxcvbnm');
+----------------------------+
| udfs.has_vowels('zxcvbnm') |
+----------------------------+
| false |
+----------------------------+
Returned 1 row(s) in 0.12s
[localhost:21000] > select has_vowels(null);
+-----------------------+
| udfs.has_vowels(null) |
+-----------------------+
| NULL |
+-----------------------+
Returned 1 row(s) in 0.11s
[localhost:21000] > select s, has_vowels(s) from t2;
+-----------+--------------------+
| s | udfs.has_vowels(s) |
+-----------+--------------------+
| lower | true |
| UPPER | true |
| Init cap | true |
| CamelCase | true |
+-----------+--------------------+
Returned 4 row(s) in 0.24s
[localhost:21000] > create function count_vowels (string) returns int location '/user/hive/udfs/libudfsample.so' symbol='CountVowels';
[localhost:21000] > select count_vowels('cat in the hat');
+-------------------------------------+
| udfs.count_vowels('cat in the hat') |
+-------------------------------------+
| 4 |
+-------------------------------------+
Returned 1 row(s) in 0.12s
[localhost:21000] > select s, count_vowels(s) from t2;
+-----------+----------------------+
| s | udfs.count_vowels(s) |
+-----------+----------------------+
| lower | 2 |
| UPPER | 2 |
| Init cap | 3 |
| CamelCase | 4 |
+-----------+----------------------+
Returned 4 row(s) in 0.23s
[localhost:21000] > select count_vowels(null);
+-------------------------+
| udfs.count_vowels(null) |
+-------------------------+
| NULL |
+-------------------------+
Returned 1 row(s) in 0.12s
[localhost:21000] > create function strip_vowels (string) returns string location '/user/hive/udfs/libudfsample.so' symbol='StripVowels';
[localhost:21000] > select strip_vowels('abcdefg');
+------------------------------+
| udfs.strip_vowels('abcdefg') |
+------------------------------+
| bcdfg |
+------------------------------+
Returned 1 row(s) in 0.11s
[localhost:21000] > select strip_vowels('ABCDEFG');
+------------------------------+
| udfs.strip_vowels('abcdefg') |
+------------------------------+
| BCDFG |
+------------------------------+
Returned 1 row(s) in 0.12s
[localhost:21000] > select strip_vowels(null);
+-------------------------+
| udfs.strip_vowels(null) |
+-------------------------+
| NULL |
+-------------------------+
Returned 1 row(s) in 0.16s
[localhost:21000] > select s, strip_vowels(s) from t2;
+-----------+----------------------+
| s | udfs.strip_vowels(s) |
+-----------+----------------------+
| lower | lwr |
| UPPER | PPR |
| Init cap | nt cp |
| CamelCase | CmlCs |
+-----------+----------------------+
Returned 4 row(s) in 0.24s
This example demonstrates a user-defined aggregate function (UDA) that produces the sum of the squares of its input values.
The coding for a UDA is a little more involved than a scalar UDF, because the processing is split into several phases, each implemented by a different function. Each phase is relatively straightforward: the "update" and "merge" phases, where most of the work is done, read an input value and combine it with some accumulated intermediate value.
As in our sample UDF from the previous example, we add function signatures to a header file (in this case, uda-sample.h). Because this is a math-oriented UDA, we make two versions of each function, one accepting an integer value and the other accepting a floating-point value.
void SumOfSquaresInit(FunctionContext* context, BigIntVal* val);
void SumOfSquaresInit(FunctionContext* context, DoubleVal* val);
void SumOfSquaresUpdate(FunctionContext* context, const BigIntVal& input, BigIntVal* val);
void SumOfSquaresUpdate(FunctionContext* context, const DoubleVal& input, DoubleVal* val);
void SumOfSquaresMerge(FunctionContext* context, const BigIntVal& src, BigIntVal* dst);
void SumOfSquaresMerge(FunctionContext* context, const DoubleVal& src, DoubleVal* dst);
BigIntVal SumOfSquaresFinalize(FunctionContext* context, const BigIntVal& val);
DoubleVal SumOfSquaresFinalize(FunctionContext* context, const DoubleVal& val);
We add the function bodies to a C++ source file (in this case, uda-sample.cc):
void SumOfSquaresInit(FunctionContext* context, BigIntVal* val) {
val->is_null = false;
val->val = 0;
}
void SumOfSquaresInit(FunctionContext* context, DoubleVal* val) {
val->is_null = false;
val->val = 0.0;
}
void SumOfSquaresUpdate(FunctionContext* context, const BigIntVal& input, BigIntVal* val) {
if (input.is_null) return;
val->val += input.val * input.val;
}
void SumOfSquaresUpdate(FunctionContext* context, const DoubleVal& input, DoubleVal* val) {
if (input.is_null) return;
val->val += input.val * input.val;
}
void SumOfSquaresMerge(FunctionContext* context, const BigIntVal& src, BigIntVal* dst) {
dst->val += src.val;
}
void SumOfSquaresMerge(FunctionContext* context, const DoubleVal& src, DoubleVal* dst) {
dst->val += src.val;
}
BigIntVal SumOfSquaresFinalize(FunctionContext* context, const BigIntVal& val) {
return val;
}
DoubleVal SumOfSquaresFinalize(FunctionContext* context, const DoubleVal& val) {
return val;
}
As with the sample UDF, we build a shared library and put it into HDFS:
$ make
[ 0%] Generating udf_samples/uda-sample.ll
[ 16%] Built target uda-sample-ir
Scanning dependencies of target udasample
[ 33%] Building CXX object CMakeFiles/udasample.dir/uda-sample.o
Linking CXX shared library udf_samples/libudasample.so
[ 33%] Built target udasample
Scanning dependencies of target uda-sample-test
[ 50%] Building CXX object CMakeFiles/uda-sample-test.dir/uda-sample-test.o
Linking CXX executable udf_samples/uda-sample-test
[ 50%] Built target uda-sample-test
[ 50%] Generating udf_samples/udf-sample.ll
[ 66%] Built target udf-sample-ir
[ 83%] Built target udfsample
[100%] Built target udf-sample-test
$ hdfs dfs -put ./udf_samples/libudasample.so /user/hive/udfs/libudasample.so
To create the SQL function, we issue a CREATE AGGREGATE FUNCTION
statement and specify
the underlying C++ function names for the different phases:
[localhost:21000] > use udf_testing;
[localhost:21000] > create table sos (x bigint, y double);
[localhost:21000] > insert into sos values (1, 1.1), (2, 2.2), (3, 3.3), (4, 4.4);
Inserted 4 rows in 1.10s
[localhost:21000] > create aggregate function sum_of_squares(bigint) returns bigint
> location '/user/hive/udfs/libudasample.so'
> init_fn='SumOfSquaresInit'
> update_fn='SumOfSquaresUpdate'
> merge_fn='SumOfSquaresMerge'
> finalize_fn='SumOfSquaresFinalize';
[localhost:21000] > -- Compute the same value using literals or the UDA;
[localhost:21000] > select 1*1 + 2*2 + 3*3 + 4*4;
+-------------------------------+
| 1 * 1 + 2 * 2 + 3 * 3 + 4 * 4 |
+-------------------------------+
| 30 |
+-------------------------------+
Returned 1 row(s) in 0.12s
[localhost:21000] > select sum_of_squares(x) from sos;
+------------------------+
| udfs.sum_of_squares(x) |
+------------------------+
| 30 |
+------------------------+
Returned 1 row(s) in 0.35s
Until we create the overloaded version of the UDA, it can only handle a single data type. To allow it to
handle DOUBLE
as well as BIGINT
, we issue another CREATE
AGGREGATE FUNCTION
statement:
[localhost:21000] > select sum_of_squares(y) from sos;
ERROR: AnalysisException: No matching function with signature: udfs.sum_of_squares(DOUBLE).
[localhost:21000] > create aggregate function sum_of_squares(double) returns double
> location '/user/hive/udfs/libudasample.so'
> init_fn='SumOfSquaresInit'
> update_fn='SumOfSquaresUpdate'
> merge_fn='SumOfSquaresMerge'
> finalize_fn='SumOfSquaresFinalize';
[localhost:21000] > -- Compute the same value using literals or the UDA;
[localhost:21000] > select 1.1*1.1 + 2.2*2.2 + 3.3*3.3 + 4.4*4.4;
+-----------------------------------------------+
| 1.1 * 1.1 + 2.2 * 2.2 + 3.3 * 3.3 + 4.4 * 4.4 |
+-----------------------------------------------+
| 36.3 |
+-----------------------------------------------+
Returned 1 row(s) in 0.12s
[localhost:21000] > select sum_of_squares(y) from sos;
+------------------------+
| udfs.sum_of_squares(y) |
+------------------------+
| 36.3 |
+------------------------+
Returned 1 row(s) in 0.35s
Typically, you use a UDA in queries with GROUP BY
clauses, to produce a result set with
a separate aggregate value for each combination of values from the GROUP BY
clause.
Let's change our sample table to use 0
to indicate rows containing even values, and
1
to flag rows containing odd values. Then the GROUP BY
query can
return two values, the sum of the squares for the even values, and the sum of the squares for the odd
values:
[localhost:21000] > insert overwrite sos values (1, 1), (2, 0), (3, 1), (4, 0);
Inserted 4 rows in 1.24s
[localhost:21000] > -- Compute 1 squared + 3 squared, and 2 squared + 4 squared;
[localhost:21000] > select y, sum_of_squares(x) from sos group by y;
+---+------------------------+
| y | udfs.sum_of_squares(x) |
+---+------------------------+
| 1 | 10 |
| 0 | 20 |
+---+------------------------+
Returned 2 row(s) in 0.43s
When the Impala authorization feature is enabled:
CREATE FUNCTION
statement requires:CREATE
privilege on the database.ALL
privilege on URI where URI is the value
you specified for the LOCATION
in the
CREATE FUNCTION
statement. See Impala Authorization for details about authorization in Impala.
The following limitations and restrictions apply to Impala UDFs in the current release:
The Hive current_user()
function cannot be called from a Java UDF
through Impala.
rand()
to
produce different values for each invocation. It must not retrieve data from external sources, such as
from disk or over the network.
CREATE FUNCTION
syntax for Java UDFs, which excludes the function signature,
to remove the limitation entirely.
CHAR
and VARCHAR
types cannot be used as input arguments or return
values for UDFs.