CREATE FUNCTION Statement

Creates a user-defined function (UDF), which you can use to implement custom logic during SELECT or INSERT operations.

Syntax:

The syntax is different depending on whether you create a scalar UDF, which is called once for each row and implemented by a single function, or a user-defined aggregate function (UDA), which is implemented by multiple functions that compute intermediate results across sets of rows.

In Impala 2.5 and higher, the syntax is also different for creating or dropping scalar Java-based UDFs. The statements for Java UDFs use a new syntax, without any argument types or return type specified. Java-based UDFs created using the new syntax persist across restarts of the Impala catalog server, and can be shared transparently between Impala and Hive.

To create a persistent scalar C++ UDF with CREATE FUNCTION:

CREATE FUNCTION [IF NOT EXISTS] [db_name.]function_name([arg_type[, arg_type...])
  RETURNS return_type
  LOCATION 'hdfs_path_to_dot_so'
  SYMBOL='symbol_name'
To create a persistent Java UDF with CREATE FUNCTION:
CREATE FUNCTION [IF NOT EXISTS] [db_name.]function_name
  LOCATION 'hdfs_path_to_jar'
  SYMBOL='class_name'

To create a persistent UDA, which must be written in C++, issue a CREATE AGGREGATE FUNCTION statement:

CREATE [AGGREGATE] FUNCTION [IF NOT EXISTS] [db_name.]function_name([arg_type[, arg_type...])
  RETURNS return_type
  [INTERMEDIATE type_spec]
  LOCATION 'hdfs_path'
  [INIT_FN='function]
  UPDATE_FN='function
  MERGE_FN='function
  [PREPARE_FN='function]
  [CLOSEFN='function]
  [SERIALIZE_FN='function]
  [FINALIZE_FN='function]

Statement type: DDL

Varargs notation:

Note:

Variable-length argument lists are supported for C++ UDFs, but currently not for Java UDFs.

If the underlying implementation of your function accepts a variable number of arguments:

See Variable-Length Argument Lists for how to code a C++ UDF to accept variable-length argument lists.

Scalar and aggregate functions:

The simplest kind of user-defined function returns a single scalar value each time it is called, typically once for each row in the result set. This general kind of function is what is usually meant by UDF. User-defined aggregate functions (UDAs) are a specialized kind of UDF that produce a single value based on the contents of multiple rows. You usually use UDAs in combination with a GROUP BY clause to condense a large result set into a smaller one, or even a single row summarizing column values across an entire table.

You create UDAs by using the CREATE AGGREGATE FUNCTION syntax. The clauses INIT_FN, UPDATE_FN, MERGE_FN, SERIALIZE_FN, FINALIZE_FN, and INTERMEDIATE only apply when you create a UDA rather than a scalar UDF.

The *_FN clauses specify functions to call at different phases of function processing.

If you use a consistent naming convention for each of the underlying functions, Impala can automatically determine the names based on the first such clause, so the others are optional.

For end-to-end examples of UDAs, see User-Defined Functions (UDFs).

Complex type considerations:

Currently, Impala UDFs cannot accept arguments or return values of the Impala complex types (STRUCT, ARRAY, or MAP).

Usage notes:

If you connect to different Impala nodes within an impala-shell session for load-balancing purposes, you can enable the SYNC_DDL query option to make each DDL statement wait before returning, until the new or changed metadata has been received by all the Impala nodes. See SYNC_DDL Query Option for details.

Compatibility:

Impala can run UDFs that were created through Hive, as long as they refer to Impala-compatible data types (not composite or nested column types). Hive can run Java-based UDFs that were created through Impala, but not Impala UDFs written in C++.

The Hive current_user() function cannot be called from a Java UDF through Impala.

Persistence:

In Impala 2.5 and higher, Impala UDFs and UDAs written in C++ are persisted in the metastore database. Java UDFs are also persisted, if they were created with the new CREATE FUNCTION syntax for Java UDFs, where the Java function argument and return types are omitted. Java-based UDFs created with the old CREATE FUNCTION syntax do not persist across restarts because they are held in the memory of the catalogd daemon. Until you re-create such Java UDFs using the new CREATE FUNCTION syntax, you must reload those Java-based UDFs by running the original CREATE FUNCTION statements again each time you restart the catalogd daemon. Prior to Impala 2.5 the requirement to reload functions after a restart applied to both C++ and Java functions.

Cancellation: Cannot be cancelled.

HDFS permissions: This statement does not touch any HDFS files or directories, therefore no HDFS permissions are required.

Examples:

For additional examples of all kinds of user-defined functions, see User-Defined Functions (UDFs).

The following example shows how to take a Java jar file and make all the functions inside one of its classes into UDFs under a single (overloaded) function name in Impala. Each CREATE FUNCTION or DROP FUNCTION statement applies to all the overloaded Java functions with the same name. This example uses the signatureless syntax for CREATE FUNCTION and DROP FUNCTION, which is available in Impala 2.5 and higher.

At the start, the jar file is in the local filesystem. Then it is copied into HDFS, so that it is available for Impala to reference through the CREATE FUNCTION statement and queries that refer to the Impala function name.


$ jar -tvf udf-examples.jar
     0 Mon Feb 22 04:06:50 PST 2016 META-INF/
   122 Mon Feb 22 04:06:48 PST 2016 META-INF/MANIFEST.MF
     0 Mon Feb 22 04:06:46 PST 2016 org/
     0 Mon Feb 22 04:06:46 PST 2016 org/apache/
     0 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/
  2460 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/IncompatibleUdfTest.class
   541 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/TestUdfException.class
  3438 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/JavaUdfTest.class
  5872 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/TestUdf.class
...
$ hdfs dfs -put udf-examples.jar /user/impala/udfs
$ hdfs dfs -ls /user/impala/udfs
Found 2 items
-rw-r--r--   3 jrussell supergroup        853 2015-10-09 14:05 /user/impala/udfs/hello_world.jar
-rw-r--r--   3 jrussell supergroup       7366 2016-06-08 14:25 /user/impala/udfs/udf-examples.jar

In impala-shell, the CREATE FUNCTION refers to the HDFS path of the jar file and the fully qualified class name inside the jar. Each of the functions inside the class becomes an Impala function, each one overloaded under the specified Impala function name.


[localhost:21000] > create function testudf location '/user/impala/udfs/udf-examples.jar' symbol='org.apache.impala.TestUdf';
[localhost:21000] > show functions;
+-------------+---------------------------------------+-------------+---------------+
| return type | signature                             | binary type | is persistent |
+-------------+---------------------------------------+-------------+---------------+
| BIGINT      | testudf(BIGINT)                       | JAVA        | true          |
| BOOLEAN     | testudf(BOOLEAN)                      | JAVA        | true          |
| BOOLEAN     | testudf(BOOLEAN, BOOLEAN)             | JAVA        | true          |
| BOOLEAN     | testudf(BOOLEAN, BOOLEAN, BOOLEAN)    | JAVA        | true          |
| DOUBLE      | testudf(DOUBLE)                       | JAVA        | true          |
| DOUBLE      | testudf(DOUBLE, DOUBLE)               | JAVA        | true          |
| DOUBLE      | testudf(DOUBLE, DOUBLE, DOUBLE)       | JAVA        | true          |
| FLOAT       | testudf(FLOAT)                        | JAVA        | true          |
| FLOAT       | testudf(FLOAT, FLOAT)                 | JAVA        | true          |
| FLOAT       | testudf(FLOAT, FLOAT, FLOAT)          | JAVA        | true          |
| INT         | testudf(INT)                          | JAVA        | true          |
| DOUBLE      | testudf(INT, DOUBLE)                  | JAVA        | true          |
| INT         | testudf(INT, INT)                     | JAVA        | true          |
| INT         | testudf(INT, INT, INT)                | JAVA        | true          |
| SMALLINT    | testudf(SMALLINT)                     | JAVA        | true          |
| SMALLINT    | testudf(SMALLINT, SMALLINT)           | JAVA        | true          |
| SMALLINT    | testudf(SMALLINT, SMALLINT, SMALLINT) | JAVA        | true          |
| STRING      | testudf(STRING)                       | JAVA        | true          |
| STRING      | testudf(STRING, STRING)               | JAVA        | true          |
| STRING      | testudf(STRING, STRING, STRING)       | JAVA        | true          |
| TINYINT     | testudf(TINYINT)                      | JAVA        | true          |
+-------------+---------------------------------------+-------------+---------------+

These are all simple functions that return their single arguments, or sum, concatenate, and so on their multiple arguments. Impala determines which overloaded function to use based on the number and types of the arguments.


insert into bigint_x values (1), (2), (4), (3);
select testudf(x) from bigint_x;
+-----------------+
| udfs.testudf(x) |
+-----------------+
| 1               |
| 2               |
| 4               |
| 3               |
+-----------------+

insert into int_x values (1), (2), (4), (3);
select testudf(x, x+1, x*x) from int_x;
+-------------------------------+
| udfs.testudf(x, x + 1, x * x) |
+-------------------------------+
| 4                             |
| 9                             |
| 25                            |
| 16                            |
+-------------------------------+

select testudf(x) from string_x;
+-----------------+
| udfs.testudf(x) |
+-----------------+
| one             |
| two             |
| four            |
| three           |
+-----------------+
select testudf(x,x) from string_x;
+--------------------+
| udfs.testudf(x, x) |
+--------------------+
| oneone             |
| twotwo             |
| fourfour           |
| threethree         |
+--------------------+

The previous example used the same Impala function name as the name of the class. This example shows how the Impala function name is independent of the underlying Java class or function names. A second CREATE FUNCTION statement results in a set of overloaded functions all named my_func, to go along with the overloaded functions all named testudf.


create function my_func location '/user/impala/udfs/udf-examples.jar'
  symbol='org.apache.impala.TestUdf';

show functions;
+-------------+---------------------------------------+-------------+---------------+
| return type | signature                             | binary type | is persistent |
+-------------+---------------------------------------+-------------+---------------+
| BIGINT      | my_func(BIGINT)                       | JAVA        | true          |
| BOOLEAN     | my_func(BOOLEAN)                      | JAVA        | true          |
| BOOLEAN     | my_func(BOOLEAN, BOOLEAN)             | JAVA        | true          |
...
| BIGINT      | testudf(BIGINT)                       | JAVA        | true          |
| BOOLEAN     | testudf(BOOLEAN)                      | JAVA        | true          |
| BOOLEAN     | testudf(BOOLEAN, BOOLEAN)             | JAVA        | true          |
...

The corresponding DROP FUNCTION statement with no signature drops all the overloaded functions with that name.


drop function my_func;
show functions;
+-------------+---------------------------------------+-------------+---------------+
| return type | signature                             | binary type | is persistent |
+-------------+---------------------------------------+-------------+---------------+
| BIGINT      | testudf(BIGINT)                       | JAVA        | true          |
| BOOLEAN     | testudf(BOOLEAN)                      | JAVA        | true          |
| BOOLEAN     | testudf(BOOLEAN, BOOLEAN)             | JAVA        | true          |
...

The signatureless CREATE FUNCTION syntax for Java UDFs ensures that the functions shown in this example remain available after the Impala service (specifically, the Catalog Server) are restarted.

Related information:

User-Defined Functions (UDFs) for more background information, usage instructions, and examples for Impala UDFs; DROP FUNCTION Statement