Enabling Sentry Authorization for Impala

Authorization determines which users are allowed to access which resources, and what operations they are allowed to perform. In Impala 1.1 and higher, you use Apache Sentry for authorization. Sentry adds a fine-grained authorization framework for Hadoop. By default (when authorization is not enabled), Impala does all read and write operations with the privileges of the impala user, which is suitable for a development/test environment but not for a secure production environment. When authorization is enabled, Impala uses the OS user ID of the user who runs impala-shell or other client program, and associates various privileges with each user.

Note: Sentry is typically used in conjunction with Kerberos authentication, which defines which hosts are allowed to connect to each server. Using the combination of Sentry and Kerberos prevents malicious users from being able to connect by creating a named account on an untrusted machine. See Enabling Kerberos Authentication for Impala for details about Kerberos authentication.

See the following sections for details about using the Impala authorization features:

The Sentry Privilege Model

Privileges can be granted on different objects in the schema. Any privilege that can be granted is associated with a level in the object hierarchy. If a privilege is granted on a parent object in the hierarchy, the child object automatically inherits it. This is the same privilege model as Hive and other database systems.

The objects in the Impala schema hierarchy are:

Server
    URI
    Database
        Table
            Column

The table-level privileges apply to views as well. Anywhere you specify a table name, you can specify a view name instead.

In Impala 2.3 and higher, you can specify privileges for individual columns.

The table below lists the minimum level of privileges and the scope required to execute SQL statements in Impala 3.0 and higher. The following notations are used:
  • ANY denotes the SELECT, INSERT, CREATE, ALTER, DROP, or REFRESH privilege.
  • ALL privilege denotes the SELECT, INSERT, CREATE, ALTER, DROP, and REFRESH privileges.
  • The owner of an object effectively has the ALL privilege on the object.
  • The parent levels of the specified scope are implicitly supported where a scope refers to the specific level in the object hierarchy that the privilege is granted. For example, if a privilege is listed with the TABLE scope, the same privilege granted on DATABASE and SERVER will allow the user to execute the specified SQL statement.
SQL Statement Privileges Scope
SELECT SELECT TABLE
WITH SELECT SELECT TABLE
EXPLAIN SELECT SELECT TABLE
INSERT INSERT TABLE
EXPLAIN INSERT INSERT TABLE
TRUNCATE INSERT TABLE
LOAD INSERT TABLE
  ALL URI
CREATE DATABASE CREATE SERVER
CREATE DATABASE LOCATION CREATE SERVER
  ALL URI
CREATE TABLE CREATE DATABASE
CREATE TABLE LIKE CREATE DATABASE
  SELECT, INSERT, or REFRESH TABLE
CREATE TABLE AS SELECT CREATE DATABASE
  INSERT DATABASE
  SELECT TABLE
EXPLAIN CREATE TABLE AS SELECT CREATE DATABASE
  INSERT DATABASE
  SELECT TABLE
CREATE TABLE LOCATION CREATE TABLE
  ALL URI
CREATE VIEW CREATE DATABASE
  SELECT TABLE
ALTER DATABASE SET OWNER ALL WITH GRANT DATABASE
ALTER TABLE ALTER TABLE
ALTER TABLE SET LOCATION ALTER TABLE
  ALL URI
ALTER TABLE RENAME CREATE DATABASE
  ALL TABLE
ALTER TABLE SET OWNER ALL WITH GRANT TABLE
ALTER VIEW ALTER TABLE
  SELECT TABLE
ALTER VIEW RENAME CREATE DATABASE
  ALL TABLE
ALTER VIEW SET OWNER ALL WITH GRANT VIEW
DROP DATABASE DROP DATABASE
DROP TABLE DROP TABLE
DROP VIEW DROP TABLE
CREATE FUNCTION CREATE DATABASE
  ALL URI
DROP FUNCTION DROP DATABASE
COMPUTE STATS ALTER and SELECT TABLE
DROP STATS ALTER TABLE
INVALIDATE METADATA REFRESH SERVER
INVALIDATE METADATA <table> REFRESH TABLE
REFRESH <table> REFRESH TABLE
REFRESH FUNCTIONS REFRESH DATABASE
COMMENT ON DATABASE ALTER DATABASE
COMMENT ON TABLE ALTER TABLE
COMMENT ON VIEW ALTER TABLE
COMMENT ON COLUMN ALTER TABLE
DESCRIBE DATABASE SELECT, INSERT, or REFRESH DATABASE
DESCRIBE <table/view> SELECT, INSERT, or REFRESH TABLE
If the user has the SELECT privilege at the COLUMN level, only the columns the user has access will show. SELECT COLUMN
USE ANY TABLE
SHOW DATABASES ANY TABLE
SHOW TABLES ANY TABLE
SHOW FUNCTIONS SELECT, INSERT, or REFRESH DATABASE
SHOW PARTITIONS SELECT, INSERT, or REFRESH TABLE
SHOW TABLE STATS SELECT, INSERT, or REFRESH TABLE
SHOW COLUMN STATS SELECT, INSERT, or REFRESH TABLE
SHOW FILES SELECT, INSERT, or REFRESH TABLE
SHOW CREATE TABLE SELECT, INSERT, or REFRESH TABLE
SHOW CREATE VIEW SELECT, INSERT, or REFRESH TABLE
SHOW CREATE FUNCTION SELECT, INSERT, or REFRESH DATABASE
SHOW RANGE PARTITIONS (Kudu only) SELECT, INSERT, or REFRESH TABLE
UPDATE (Kudu only) ALL TABLE
EXPLAIN UPDATE (Kudu only) ALL TABLE
UPSERT (Kudu only) ALL TABLE
WITH UPSERT (Kudu only) ALL TABLE
EXPLAIN UPSERT (Kudu only) ALL TABLE
DELETE (Kudu only) ALL TABLE
EXPLAIN DELETE (Kudu only) ALL TABLE

Originally, privileges were encoded in a policy file, stored in HDFS. This mode of operation is still an option, but the emphasis of privilege management is moving towards being SQL-based. The mode of operation with GRANT and REVOKE statements instead of the policy file requires that a special Sentry service be enabled; this service stores, retrieves, and manipulates privilege information stored inside the metastore database.

Note:

Although this document refers to the ALL privilege, currently if you use the policy file mode, you do not use the actual keyword ALL in the policy file. When you code role entries in the policy file:

  • To specify the ALL privilege for a server, use a role like server=server_name.
  • To specify the ALL privilege for a database, use a role like server=server_name->db=database_name.
  • To specify the ALL privilege for a table, use a role like server=server_name->db=database_name->table=table_name->action=*.

If you change privileges in Sentry, e.g. adding a user, removing a user, modifying privileges, you must clear the Impala Catalog server cache by running the INVALIDATE METADATA statement. INVALIDATE METADATA is not required if you make the changes to privileges within Impala.

Starting the impalad Daemon with Sentry Authorization Enabled

To run the impalad daemon with authorization enabled, you add one or more options to the IMPALA_SERVER_ARGS declaration in the /etc/default/impala configuration file:

For example, you might adapt your /etc/default/impala configuration to contain lines like the following. To use the Sentry service rather than the policy file:

IMPALA_SERVER_ARGS=" \
-server_name=server1 \
...

Or to use the policy file, as in releases prior to Impala 1.4:

IMPALA_SERVER_ARGS=" \
-authorization_policy_file=/user/hive/warehouse/auth-policy.ini \
-server_name=server1 \
...

The preceding examples set up a symbolic name of server1 to refer to the current instance of Impala. Specify the symbolic name for the sentry.hive.server property in the sentry-site.xml configuration file for Hive, as well as in the -server_name option for impalad.

Now restart the impalad daemons on all the nodes.

Using Impala with the Sentry Service

When you use the Sentry service, you set up privileges through the GRANT and REVOKE statements in either Impala or Hive. Then both components use those same privileges automatically. (Impala added the GRANT and REVOKE statements in Impala 2.0.)

For information about using the Impala GRANT and REVOKE statements, see GRANT Statement (Impala 2.0 or higher only) and REVOKE Statement (Impala 2.0 or higher only).

URIs represent the file paths you specify as part of statements such as CREATE EXTERNAL TABLE and LOAD DATA. Typically, you specify what look like UNIX paths, but these locations can also be prefixed with hdfs:// to make clear that they are really URIs. To set privileges for a URI, specify the name of a directory, and the privilege applies to all the files in that directory and any directories underneath it.

URIs must start with hdfs://, s3a://, adl://, or file://. If a URI starts with an absolute path, the path will be appended to the default filesystem prefix. For example, if you specify:

GRANT ALL ON URI '/tmp';
The above statement effectively becomes the following where the default filesystem is HDFS.

GRANT ALL ON URI 'hdfs://localhost:20500/tmp';
When defining URIs for HDFS, you must also specify the NameNode. For example:
GRANT ALL ON URI file:///path/to/dir TO <role>
GRANT ALL ON URI hdfs://namenode:port/path/to/dir TO <role>
Warning:

Because the NameNode host and port must be specified, it is strongly recommended that you use High Availability (HA). This ensures that the URI will remain constant even if the NameNode changes. For example:

GRANT ALL ON URI hdfs://ha-nn-uri/path/to/dir TO <role>

Examples of Setting up Authorization for Security Scenarios

The following examples show how to set up authorization to deal with various scenarios.

A User with No Privileges

If a user has no privileges at all, that user cannot access any schema objects in the system. The error messages do not disclose the names or existence of objects that the user is not authorized to read.

This is the experience you want a user to have if they somehow log into a system where they are not an authorized Impala user. Or in a real deployment, a user might have no privileges because they are not a member of any of the authorized groups.

Examples of Privileges for Administrative Users

In this example, the SQL statements grant the entire_server role all privileges on both the databases and URIs within the server.

CREATE ROLE entire_server;
GRANT ROLE entire_server TO GROUP admin_group;
GRANT ALL ON SERVER server1 TO ROLE entire_server;

A User with Privileges for Specific Databases and Tables

If a user has privileges for specific tables in specific databases, the user can access those things but nothing else. They can see the tables and their parent databases in the output of SHOW TABLES and SHOW DATABASES, USE the appropriate databases, and perform the relevant actions (SELECT and/or INSERT) based on the table privileges. To actually create a table requires the ALL privilege at the database level, so you might define separate roles for the user that sets up a schema and other users or applications that perform day-to-day operations on the tables.


CREATE ROLE one_database;
GRANT ROLE one_database TO GROUP admin_group;
GRANT ALL ON DATABASE db1 TO ROLE one_database;

CREATE ROLE instructor;
GRANT ROLE instructor TO GROUP trainers;
GRANT ALL ON TABLE db1.lesson TO ROLE instructor;

# This particular course is all about queries, so the students can SELECT but not INSERT or CREATE/DROP.
CREATE ROLE student;
GRANT ROLE student TO GROUP visitors;
GRANT SELECT ON TABLE db1.training TO ROLE student;

Privileges for Working with External Data Files

When data is being inserted through the LOAD DATA statement, or is referenced from an HDFS location outside the normal Impala database directories, the user also needs appropriate permissions on the URIs corresponding to those HDFS locations.

In this example:

  • The external_table role can insert into and query the Impala table, external_table.sample.
  • The staging_dir role can specify the HDFS path /user/impala-user/external_data with the LOAD DATA statement. When Impala queries or loads data files, it operates on all the files in that directory, not just a single file, so any Impala LOCATION parameters refer to a directory rather than an individual file.
CREATE ROLE external_table;
GRANT ROLE external_table TO GROUP impala_users;
GRANT ALL ON TABLE external_table.sample TO ROLE external_table;

CREATE ROLE staging_dir;
GRANT ROLE staging TO GROUP impala_users;
GRANT ALL ON URI 'hdfs://127.0.0.1:8020/user/impala-user/external_data' TO ROLE staging_dir;

Separating Administrator Responsibility from Read and Write Privileges

To create a database, you need the full privilege on that database while day-to-day operations on tables within that database can be performed with lower levels of privilege on specific table. Thus, you might set up separate roles for each database or application: an administrative one that could create or drop the database, and a user-level one that can access only the relevant tables.

In this example, the responsibilities are divided between users in 3 different groups:

  • Members of the supergroup group have the training_sysadmin role and so can set up a database named training.
  • Members of the impala_users group have the instructor role and so can create, insert into, and query any tables in the training database, but cannot create or drop the database itself.
  • Members of the visitor group have the student role and so can query those tables in the training database.
CREATE ROLE training_sysadmin;
GRANT ROLE training_sysadmin TO GROUP supergroup;
GRANT ALL ON DATABASE training1 TO ROLE training_sysadmin;

CREATE ROLE instructor;
GRANT ROLE instructor TO GROUP impala_users;
GRANT ALL ON TABLE training1.course1 TO ROLE instructor;

CREATE ROLE visitor;
GRANT ROLE student TO GROUP visitor;
GRANT SELECT ON TABLE training1.course1 TO ROLE student;

Using Impala with the Sentry Policy File

The policy file is a file that you put in a designated location in HDFS, and is read during the startup of the impalad daemon when you specify both the -server_name and -authorization_policy_file startup options. It controls which objects (databases, tables, and HDFS directory paths) can be accessed by the user who connects to impalad, and what operations that user can perform on the objects.

Note: The policy-file based authorization was deprecated in Impala 2.6. We recommend managing privileges through SQL statements as described in Using Impala with the Sentry Service. If you are still using policy files, plan to migrate to the new approach some time in the future.

The location of the policy file is listed in the auth-site.xml configuration file.

When authorization is enabled, Impala uses the policy file as a whitelist, representing every privilege available to any user on any object. That is, only operations specified for the appropriate combination of object, role, group, and user are allowed. All other operations are not allowed. If a group or role is defined multiple times in the policy file, the last definition takes precedence.

To understand the notion of whitelisting, set up a minimal policy file that does not provide any privileges for any object. When you connect to an Impala node where this policy file is in effect, you get no results for SHOW DATABASES, and an error when you issue any SHOW TABLES, USE database_name, DESCRIBE table_name, SELECT, and or other statements that expect to access databases or tables, even if the corresponding databases and tables exist.

The contents of the policy file are cached, to avoid a performance penalty for each query. The policy file is re-checked by each impalad node every 5 minutes. When you make a non-time-sensitive change such as adding new privileges or new users, you can let the change take effect automatically a few minutes later. If you remove or reduce privileges, and want the change to take effect immediately, restart the impalad daemon on all nodes, again specifying the -server_name and -authorization_policy_file options so that the rules from the updated policy file are applied.

Policy File Format

The policy file uses the familiar .ini format, divided into the major sections [groups] and [roles].

There is also an optional [databases] section, which allows you to specify a specific policy file for a particular database, as explained in Using Multiple Policy Files for Different Databases.

Another optional section, [users], allows you to override the OS-level mapping of users to groups; that is an advanced technique primarily for testing and debugging, and is beyond the scope of this document.

In the [groups] section, you define various categories of users and select which roles are associated with each category. The group and usernames correspond to Linux groups and users on the server where the impalad daemon runs.

The group and usernames in the [groups] section correspond to Hadoop groups and users on the server where the impalad daemon runs. When you access Impala through the impalad interpreter, for purposes of authorization, the user is the logged-in Linux user and the groups are the Linux groups that user is a member of. When you access Impala through the ODBC or JDBC interfaces, the user and password specified through the connection string are used as login credentials for the Linux server, and authorization is based on that username and the associated Linux group membership.

In the [roles] section, you a set of roles. For each role, you specify precisely the set of privileges is available. That is, which objects users with that role can access, and what operations they can perform on those objects. This is the lowest-level category of security information; the other sections in the policy file map the privileges to higher-level divisions of groups and users. In the [groups] section, you specify which roles are associated with which groups. The group and usernames correspond to Linux groups and users on the server where the impalad daemon runs. The privileges are specified using patterns like:
server=server_name->db=database_name->table=table_name->action=SELECT
server=server_name->db=database_name->table=table_name->action=CREATE
server=server_name->db=database_name->table=table_name->action=ALL
For the server_name value, substitute the same symbolic name you specify with the impalad -server_name option. You can use * wildcard characters at each level of the privilege specification to allow access to all such objects. For example:
server=impala-host.example.com->db=default->table=t1->action=SELECT
server=impala-host.example.com->db=*->table=*->action=CREATE
server=impala-host.example.com->db=*->table=audit_log->action=SELECT
server=impala-host.example.com->db=default->table=t1->action=*

Using Multiple Policy Files for Different Databases

For an Impala cluster with many databases being accessed by many users and applications, it might be cumbersome to update the security policy file for each privilege change or each new database, table, or view. You can allow security to be managed separately for individual databases, by setting up a separate policy file for each database:

  • Add the optional [databases] section to the main policy file.
  • Add entries in the [databases] section for each database that has its own policy file.
  • For each listed database, specify the HDFS path of the appropriate policy file.

For example:

[databases]
# Defines the location of the per-DB policy files for the 'customers' and 'sales' databases.
customers = hdfs://ha-nn-uri/etc/access/customers.ini
sales = hdfs://ha-nn-uri/etc/access/sales.ini

To enable URIs in per-DB policy files, the Java configuration option sentry.allow.uri.db.policyfile must be set to true. For example:

JAVA_TOOL_OPTIONS="-Dsentry.allow.uri.db.policyfile=true"
Important: Enabling URIs in per-DB policy files introduces a security risk by allowing the owner of the db-level policy file to grant himself/herself load privileges to anything the impala user has read permissions for in HDFS (including data in other databases controlled by different db-level policy files).

Setting Up Schema Objects for a Secure Impala Deployment

In your role definitions, you must specify privileges at the level of individual databases and tables, or all databases or all tables within a database. To simplify the structure of these rules, plan ahead of time how to name your schema objects so that data with different authorization requirements is divided into separate databases.

If you are adding security on top of an existing Impala deployment, you can rename tables or even move them between databases using the ALTER TABLE statement.

Debugging Failed Sentry Authorization Requests

Sentry logs all facts that lead up to authorization decisions at the debug level. If you do not understand why Sentry is denying access, the best way to debug is to temporarily turn on debug logging:
  • Add log4j.logger.org.apache.sentry=DEBUG to the log4j.properties file on each host in the cluster, in the appropriate configuration directory for each service.
Specifically, look for exceptions and messages such as:
FilePermission server..., RequestPermission server...., result [true|false]
which indicate each evaluation Sentry makes. The FilePermission is from the policy file, while RequestPermission is the privilege required for the query. A RequestPermission will iterate over all appropriate FilePermission settings until a match is found. If no matching privilege is found, Sentry returns false indicating "Access Denied".

The DEFAULT Database in a Secure Deployment

Because of the extra emphasis on granular access controls in a secure deployment, you should move any important or sensitive information out of the DEFAULT database into a named database whose privileges are specified in the policy file. Sometimes you might need to give privileges on the DEFAULT database for administrative reasons; for example, as a place you can reliably specify with a USE statement when preparing to drop a database.