Enabling Sentry Authorization for Impala
Authorization determines which users are allowed to access which resources, and what operations they are
allowed to perform. In Impala 1.1 and higher, you use Apache Sentry for
authorization. Sentry adds a fine-grained authorization framework for Hadoop. By default (when authorization
is not enabled), Impala does all read and write operations with the privileges of the impala
user, which is suitable for a development/test environment but not for a secure production environment. When
authorization is enabled, Impala uses the OS user ID of the user who runs impala-shell or
other client program, and associates various privileges with each user.
See the following sections for details about using the Impala authorization features:
The Sentry Privilege Model
Privileges can be granted on different objects in the schema. Any privilege that can be granted is associated with a level in the object hierarchy. If a privilege is granted on a container object in the hierarchy, the child object automatically inherits it. This is the same privilege model as Hive and other database systems such as MySQL.
The object hierarchy for Impala covers Server, URI, Database, Table, and Column. (The Table privileges apply to views as well; anywhere you specify a table name, you can specify a view name instead.) Column-level authorization is available in Impala 2.3 and higher. Previously, you constructed views to query specific columns and assigned privilege based on the views rather than the base tables. Now, you can use Impala's GRANT Statement (Impala 2.0 or higher only) and REVOKE Statement (Impala 2.0 or higher only) statements to assign and revoke privileges from specific columns in a table.
A restricted set of privileges determines what you can do with each object:
- SELECT privilege
-
Lets you read data from a table or view, for example with the
SELECT
statement, theINSERT...SELECT
syntax, orCREATE TABLE...LIKE
. Also required to issue theDESCRIBE
statement or theEXPLAIN
statement for a query against a particular table. Only objects for which a user has this privilege are shown in the output forSHOW DATABASES
andSHOW TABLES
statements. TheREFRESH
statement andINVALIDATE METADATA
statements only access metadata for tables for which the user has this privilege. - INSERT privilege
-
Lets you write data to a table. Applies to the
INSERT
andLOAD DATA
statements. - ALL privilege
-
Lets you create or modify the object. Required to run DDL statements such as
CREATE TABLE
,ALTER TABLE
, orDROP TABLE
for a table,CREATE DATABASE
orDROP DATABASE
for a database, orCREATE VIEW
,ALTER VIEW
, orDROP VIEW
for a view. Also required for the URI of the "location" parameter for theCREATE EXTERNAL TABLE
andLOAD DATA
statements.
Privileges can be specified for a table or view before that object actually exists. If you do not have sufficient privilege to perform an operation, the error message does not disclose if the object exists or not.
Originally, privileges were encoded in a policy file, stored in HDFS. This mode of operation is still an
option, but the emphasis of privilege management is moving towards being SQL-based. Although currently
Impala does not have GRANT
or REVOKE
statements, Impala can make use of
privileges assigned through GRANT
and REVOKE
statements done through
Hive. The mode of operation with GRANT
and REVOKE
statements instead of
the policy file requires that a special Sentry service be enabled; this service stores, retrieves, and
manipulates privilege information stored inside the metastore database.
Starting the impalad Daemon with Sentry Authorization Enabled
To run the impalad daemon with authorization enabled, you add one or more options to the
IMPALA_SERVER_ARGS
declaration in the /etc/default/impala
configuration file:
-
The
-server_name
option turns on Sentry authorization for Impala. The authorization rules refer to a symbolic server name, and you specify the name to use as the argument to the-server_name
option. -
If you specify just
-server_name
, Impala uses the Sentry service for authorization, relying on the results ofGRANT
andREVOKE
statements issued through Hive. (This mode of operation is available in Impala 1.4.0 and higher.) Prior to Impala 1.4.0, or if you want to continue storing privilege rules in the policy file, also specify the-authorization_policy_file
option as in the following item. -
Specifying the
-authorization_policy_file
option in addition to-server_name
makes Impala read privilege information from a policy file, rather than from the metastore database. The argument to the-authorization_policy_file
option specifies the HDFS path to the policy file that defines the privileges on different schema objects.
For example, you might adapt your /etc/default/impala configuration to contain lines like the following. To use the Sentry service rather than the policy file:
IMPALA_SERVER_ARGS=" \
-server_name=server1 \
...
Or to use the policy file, as in releases prior to Impala 1.4:
IMPALA_SERVER_ARGS=" \
-authorization_policy_file=/user/hive/warehouse/auth-policy.ini \
-server_name=server1 \
...
The preceding examples set up a symbolic name of server1
to refer to the current instance
of Impala. This symbolic name is used in the following ways:
-
Specify the
server1
value for thesentry.hive.server
property in the sentry-site.xml configuration file for Hive, as well as in the-server_name
option for impalad.If the impalad daemon is not already running, start it as described in Starting Impala. If it is already running, restart it with the command
sudo /etc/init.d/impala-server restart
. Run the appropriate commands on all the nodes where impalad normally runs. -
If you use the mode of operation using the policy file, the rules in the
[roles]
section of the policy file refer to this sameserver1
name. For example, the following rule sets up a rolereport_generator
that lets users with that role query any table in a database namedreporting_db
on a node where the impalad daemon was started up with the-server_name=server1
option:[roles] report_generator = server=server1->db=reporting_db->table=*->action=SELECT
When impalad is started with one or both of the -server_name=server1
and -authorization_policy_file
options, Impala authorization is enabled. If Impala detects
any errors or inconsistencies in the authorization settings or the policy file, the daemon refuses to
start.
Using Impala with the Sentry Service (Impala 1.4 or higher only)
When you use the Sentry service rather than the policy file, you set up privileges through
GRANT
and REVOKE
statement in either Impala or Hive, then both components
use those same privileges automatically. (Impala added the GRANT
and
REVOKE
statements in Impala 2.0.)
Using Impala with the Sentry Policy File
The policy file is a file that you put in a designated location in HDFS, and is read during the startup of
the impalad daemon when you specify both the -server_name
and
-authorization_policy_file
startup options. It controls which objects (databases, tables,
and HDFS directory paths) can be accessed by the user who connects to impalad, and what
operations that user can perform on the objects.
The Sentry service, as described in Using Impala with the Sentry Service (Impala 1.4 or higher only), stores
authorization metadata in a relational database. This means you can manage user privileges for Impala tables
using traditional GRANT
and REVOKE
SQL statements, rather than the
policy file approach described here.If you are still using policy files, migrate to the
database-backed service whenever practical.
The location of the policy file is listed in the auth-site.xml configuration file. To minimize overhead, the security information from this file is cached by each impalad daemon and refreshed automatically, with a default interval of 5 minutes. After making a substantial change to security policies, restart all Impala daemons to pick up the changes immediately.
Policy File Location and Format
The policy file uses the familiar .ini
format, divided into the major sections
[groups]
and [roles]
. There is also an optional
[databases]
section, which allows you to specify a specific policy file for a particular
database, as explained in Using Multiple Policy Files for Different Databases. Another optional section,
[users]
, allows you to override the OS-level mapping of users to groups; that is an
advanced technique primarily for testing and debugging, and is beyond the scope of this document.
In the [groups]
section, you define various categories of users and select which roles
are associated with each category. The group and usernames correspond to Linux groups and users on the
server where the impalad daemon runs.
The group and usernames in the [groups]
section correspond to Linux groups and users on
the server where the impalad daemon runs. When you access Impala through the
impalad interpreter, for purposes of authorization, the user is the logged-in Linux
user and the groups are the Linux groups that user is a member of. When you access Impala through the
ODBC or JDBC interfaces, the user and password specified through the connection string are used as login
credentials for the Linux server, and authorization is based on that username and the associated Linux
group membership.
[roles]
section, you a set of roles. For each role, you specify precisely the set
of privileges is available. That is, which objects users with that role can access, and what operations
they can perform on those objects. This is the lowest-level category of security information; the other
sections in the policy file map the privileges to higher-level divisions of groups and users. In the
[groups]
section, you specify which roles are associated with which groups. The group
and usernames correspond to Linux groups and users on the server where the impalad
daemon runs. The privileges are specified using patterns like:
server=server_name->db=database_name->table=table_name->action=SELECT
server=server_name->db=database_name->table=table_name->action=CREATE
server=server_name->db=database_name->table=table_name->action=ALL
For the server_name value, substitute the same symbolic name you specify with the
impalad -server_name
option. You can use *
wildcard
characters at each level of the privilege specification to allow access to all such objects. For example:
server=impala-host.example.com->db=default->table=t1->action=SELECT
server=impala-host.example.com->db=*->table=*->action=CREATE
server=impala-host.example.com->db=*->table=audit_log->action=SELECT
server=impala-host.example.com->db=default->table=t1->action=*
When authorization is enabled, Impala uses the policy file as a whitelist, representing every privilege available to any user on any object. That is, only operations specified for the appropriate combination of object, role, group, and user are allowed; all other operations are not allowed. If a group or role is defined multiple times in the policy file, the last definition takes precedence.
To understand the notion of whitelisting, set up a minimal policy file that does not provide any
privileges for any object. When you connect to an Impala node where this policy file is in effect, you
get no results for SHOW DATABASES
, and an error when you issue any SHOW
TABLES
, USE database_name
, DESCRIBE
table_name
, SELECT
, and or other statements that expect to
access databases or tables, even if the corresponding databases and tables exist.
The contents of the policy file are cached, to avoid a performance penalty for each query. The policy
file is re-checked by each impalad node every 5 minutes. When you make a
non-time-sensitive change such as adding new privileges or new users, you can let the change take effect
automatically a few minutes later. If you remove or reduce privileges, and want the change to take effect
immediately, restart the impalad daemon on all nodes, again specifying the
-server_name
and -authorization_policy_file
options so that the rules
from the updated policy file are applied.
Examples of Policy File Rules for Security Scenarios
The following examples show rules that might go in the policy file to deal with various authorization-related scenarios. For illustration purposes, this section shows several very small policy files with only a few rules each. In your environment, typically you would define many roles to cover all the scenarios involving your own databases, tables, and applications, and a smaller number of groups, whose members are given the privileges from one or more roles.
A User with No Privileges
If a user has no privileges at all, that user cannot access any schema objects in the system. The error messages do not disclose the names or existence of objects that the user is not authorized to read.
This is the experience you want a user to have if they somehow log into a system where they are not an authorized Impala user. In a real deployment with a filled-in policy file, a user might have no privileges because they are not a member of any of the relevant groups mentioned in the policy file.
Examples of Privileges for Administrative Users
When an administrative user has broad access to tables or databases, the associated rules in the
[roles]
section typically use wildcards and/or inheritance. For example, in the
following sample policy file, db=*
refers to all databases and
db=*->table=*
refers to all tables in all databases.
Omitting the rightmost portion of a rule means that the privileges apply to all the objects that could
be specified there. For example, in the following sample policy file, the
all_databases
role has all privileges for all tables in all databases, while the
one_database
role has all privileges for all tables in one specific database. The
all_databases
role does not grant privileges on URIs, so a group with that role could
not issue a CREATE TABLE
statement with a LOCATION
clause. The
entire_server
role has all privileges on both databases and URIs within the server.
[groups]
supergroup = all_databases
[roles]
read_all_tables = server=server1->db=*->table=*->action=SELECT
all_tables = server=server1->db=*->table=*
all_databases = server=server1->db=*
one_database = server=server1->db=test_db
entire_server = server=server1
A User with Privileges for Specific Databases and Tables
If a user has privileges for specific tables in specific databases, the user can access those things
but nothing else. They can see the tables and their parent databases in the output of SHOW
TABLES
and SHOW DATABASES
, USE
the appropriate databases,
and perform the relevant actions (SELECT
and/or INSERT
) based on the
table privileges. To actually create a table requires the ALL
privilege at the
database level, so you might define separate roles for the user that sets up a schema and other users
or applications that perform day-to-day operations on the tables.
The following sample policy file shows some of the syntax that is appropriate as the policy file grows,
such as the #
comment syntax, \
continuation syntax, and comma
separation for roles assigned to groups or privileges assigned to roles.
[groups]
employee = training_sysadmin, instructor
visitor = student
[roles]
training_sysadmin = server=server1->db=training, \
server=server1->db=instructor_private, \
server=server1->db=lesson_development
instructor = server=server1->db=training->table=*->action=*, \
server=server1->db=instructor_private->table=*->action=*, \
server=server1->db=lesson_development->table=lesson*
# This particular course is all about queries, so the students can SELECT but not INSERT or CREATE/DROP.
student = server=server1->db=training->table=lesson_*->action=SELECT
Privileges for Working with External Data Files
When data is being inserted through the LOAD DATA
statement, or is referenced from an
HDFS location outside the normal Impala database directories, the user also needs appropriate
permissions on the URIs corresponding to those HDFS locations.
In this sample policy file:
-
The
external_table
role lets us insert into and query the Impala table,external_table.sample
. -
The
staging_dir
role lets us specify the HDFS path /user/username/external_data with theLOAD DATA
statement. Remember, when Impala queries or loads data files, it operates on all the files in that directory, not just a single file, so any ImpalaLOCATION
parameters refer to a directory rather than an individual file. -
We included the IP address and port of the Hadoop name node in the HDFS URI of the
staging_dir
rule. We found those details in /etc/hadoop/conf/core-site.xml, under thefs.default.name
element. That is what we use in any roles that specify URIs (that is, the locations of directories in HDFS). -
We start this example after the table
external_table.sample
is already created. In the policy file for the example, we have already taken away theexternal_table_admin
role from theusername
group, and replaced it with the lesser-privilegedexternal_table
role. - We assign privileges to a subdirectory underneath /user/username in HDFS, because such privileges also apply to any subdirectories underneath. If we had assigned privileges to the parent directory /user/username, it would be too likely to mess up other files by specifying a wrong location by mistake.
-
The
username
under the[groups]
section refers to theusername
group. (In this example, there is ausername
user that is a member of ausername
group.)
Policy file:
[groups]
username = external_table, staging_dir
[roles]
external_table_admin = server=server1->db=external_table
external_table = server=server1->db=external_table->table=sample->action=*
staging_dir = server=server1->uri=hdfs://127.0.0.1:8020/user/username/external_data->action=*
impala-shell session:
[localhost:21000] > use external_table;
Query: use external_table
[localhost:21000] > show tables;
Query: show tables
Query finished, fetching results ...
+--------+
| name |
+--------+
| sample |
+--------+
Returned 1 row(s) in 0.02s
[localhost:21000] > select * from sample;
Query: select * from sample
Query finished, fetching results ...
+-----+
| x |
+-----+
| 1 |
| 5 |
| 150 |
+-----+
Returned 3 row(s) in 1.04s
[localhost:21000] > load data inpath '/user/username/external_data' into table sample;
Query: load data inpath '/user/username/external_data' into table sample
Query finished, fetching results ...
+----------------------------------------------------------+
| summary |
+----------------------------------------------------------+
| Loaded 1 file(s). Total files in destination location: 2 |
+----------------------------------------------------------+
Returned 1 row(s) in 0.26s
[localhost:21000] > select * from sample;
Query: select * from sample
Query finished, fetching results ...
+-------+
| x |
+-------+
| 2 |
| 4 |
| 6 |
| 8 |
| 64738 |
| 49152 |
| 1 |
| 5 |
| 150 |
+-------+
Returned 9 row(s) in 0.22s
[localhost:21000] > load data inpath '/user/username/unauthorized_data' into table sample;
Query: load data inpath '/user/username/unauthorized_data' into table sample
ERROR: AuthorizationException: User 'username' does not have privileges to access: hdfs://127.0.0.1:8020/user/username/unauthorized_data
Separating Administrator Responsibility from Read and Write Privileges
Remember that to create a database requires full privilege on that database, while day-to-day operations on tables within that database can be performed with lower levels of privilege on specific table. Thus, you might set up separate roles for each database or application: an administrative one that could create or drop the database, and a user-level one that can access only the relevant tables.
For example, this policy file divides responsibilities between users in 3 different groups:
-
Members of the
supergroup
group have thetraining_sysadmin
role and so can set up a database namedtraining
. - Members of the
employee
group have theinstructor
role and so can create, insert into, and query any tables in thetraining
database, but cannot create or drop the database itself. -
Members of the
visitor
group have thestudent
role and so can query those tables in thetraining
database.
[groups]
supergroup = training_sysadmin
employee = instructor
visitor = student
[roles]
training_sysadmin = server=server1->db=training
instructor = server=server1->db=training->table=*->action=*
student = server=server1->db=training->table=*->action=SELECT
Using Multiple Policy Files for Different Databases
For an Impala cluster with many databases being accessed by many users and applications, it might be cumbersome to update the security policy file for each privilege change or each new database, table, or view. You can allow security to be managed separately for individual databases, by setting up a separate policy file for each database:
-
Add the optional
[databases]
section to the main policy file. -
Add entries in the
[databases]
section for each database that has its own policy file. - For each listed database, specify the HDFS path of the appropriate policy file.
For example:
[databases]
# Defines the location of the per-DB policy files for the 'customers' and 'sales' databases.
customers = hdfs://ha-nn-uri/etc/access/customers.ini
sales = hdfs://ha-nn-uri/etc/access/sales.ini
To enable URIs in per-DB policy files, the Java configuration option sentry.allow.uri.db.policyfile
must be set to true
. For example:
JAVA_TOOL_OPTIONS="-Dsentry.allow.uri.db.policyfile=true"
impala
user has
read permissions for in HDFS (including data in other databases controlled by different db-level policy
files).
Setting Up Schema Objects for a Secure Impala Deployment
Remember that in your role definitions, you specify privileges at the level of individual databases and tables, or all databases or all tables within a database. To simplify the structure of these rules, plan ahead of time how to name your schema objects so that data with different authorization requirements is divided into separate databases.
If you are adding security on top of an existing Impala deployment, remember that you can rename tables or
even move them between databases using the ALTER TABLE
statement. In Impala, creating new
databases is a relatively inexpensive operation, basically just creating a new directory in HDFS.
You can also plan the security scheme and set up the policy file before the actual schema objects named in
the policy file exist. Because the authorization capability is based on whitelisting, a user can only
create a new database or table if the required privilege is already in the policy file: either by listing
the exact name of the object being created, or a *
wildcard to match all the applicable
objects within the appropriate container.
Privilege Model and Object Hierarchy
Privileges can be granted on different objects in the schema. Any privilege that can be granted is associated with a level in the object hierarchy. If a privilege is granted on a container object in the hierarchy, the child object automatically inherits it. This is the same privilege model as Hive and other database systems such as MySQL.
The kinds of objects in the schema hierarchy are:
Server
URI
Database
Table
The server name is specified by the -server_name
option when impalad
starts. Specify the same name for all impalad nodes in the cluster.
URIs represent the HDFS paths you specify as part of statements such as CREATE EXTERNAL
TABLE
and LOAD DATA
. Typically, you specify what look like UNIX paths, but these
locations can also be prefixed with hdfs://
to make clear that they are really URIs. To
set privileges for a URI, specify the name of a directory, and the privilege applies to all the files in
that directory and any directories underneath it.
In Impala 2.3 and higher, you can specify privileges for individual columns.
Formerly, to specify read privileges at this level, you created a view that queried specific columns
and/or partitions from a base table, and gave SELECT
privilege on the view but not
the underlying table. Now, you can use Impala's GRANT Statement (Impala 2.0 or higher only) and
REVOKE Statement (Impala 2.0 or higher only) statements to assign and revoke privileges from specific columns
in a table.
hdfs://
or file://
. If a URI starts with
anything else, it will cause an exception and the policy file will be invalid. When defining URIs for HDFS,
you must also specify the NameNode. For example:
data_read = server=server1->uri=file:///path/to/dir, \
server=server1->uri=hdfs://namenode:port/path/to/dir
Because the NameNode host and port must be specified, enable High Availability (HA) to ensure that the URI will remain constant even if the NameNode changes.
data_read = server=server1->uri=file:///path/to/dir,\ server=server1->uri=hdfs://ha-nn-uri/path/to/dir
Privilege | Object |
---|---|
INSERT | DB, TABLE |
SELECT | DB, TABLE, COLUMN |
ALL | SERVER, TABLE, DB, URI |
Although this document refers to the ALL
privilege, currently if you use the policy file
mode, you do not use the actual keyword ALL
in the policy file. When you code role
entries in the policy file:
-
To specify the
ALL
privilege for a server, use a role likeserver=server_name
. -
To specify the
ALL
privilege for a database, use a role likeserver=server_name->db=database_name
. -
To specify the
ALL
privilege for a table, use a role likeserver=server_name->db=database_name->table=table_name->action=*
.
Operation | Scope | Privileges | URI |
---|---|---|---|
EXPLAIN | TABLE; COLUMN | SELECT | |
LOAD DATA | TABLE | INSERT | URI |
CREATE DATABASE | SERVER | ALL | |
DROP DATABASE | DATABASE | ALL | |
CREATE TABLE | DATABASE | ALL | |
DROP TABLE | TABLE | ALL | |
DESCRIBE TABLE -Output shows all columns if the
user has table level-privileges or |
TABLE | SELECT/INSERT | |
ALTER TABLE .. ADD COLUMNS | TABLE | ALL on DATABASE | |
ALTER TABLE .. REPLACE COLUMNS | TABLE | ALL on DATABASE | |
ALTER TABLE .. CHANGE column | TABLE | ALL on DATABASE | |
ALTER TABLE .. RENAME | TABLE | ALL on DATABASE | |
ALTER TABLE .. SET TBLPROPERTIES | TABLE | ALL on DATABASE | |
ALTER TABLE .. SET FILEFORMAT | TABLE | ALL on DATABASE | |
ALTER TABLE .. SET LOCATION | TABLE | ALL on DATABASE | URI |
ALTER TABLE .. ADD PARTITION | TABLE | ALL on DATABASE | |
ALTER TABLE .. ADD PARTITION location | TABLE | ALL on DATABASE | URI |
ALTER TABLE .. DROP PARTITION | TABLE | ALL on DATABASE | |
ALTER TABLE .. PARTITION SET FILEFORMAT | TABLE | ALL on DATABASE | |
ALTER TABLE .. SET SERDEPROPERTIES | TABLE | ALL on DATABASE | |
CREATE VIEW -This operation is allowed if you have
column-level |
DATABASE; SELECT on TABLE; | ALL | |
DROP VIEW | VIEW/TABLE | ALL | |
ALTER VIEW |
You need ALL privilege on the named view and the parent
database, plus SELECT privilege for any tables or views referenced by the
view query. Once the view is created or altered by a high-privileged system administrator, it can
be queried by a lower-privileged user who does not have full query privileges for the base tables.
|
ALL, SELECT | |
ALTER TABLE .. SET LOCATION | TABLE | ALL on DATABASE | URI |
CREATE EXTERNAL TABLE | Database (ALL), URI (SELECT) | ALL, SELECT | |
SELECT -You can grant the SELECT privilege on a view to give users access to specific columns of a table they do not otherwise have access to. -See the documentation for Apache Sentry for details on allowed column-level operations. |
VIEW/TABLE; COLUMN | SELECT | |
USE <dbName> | Any | ||
CREATE FUNCTION | SERVER | ALL | |
DROP FUNCTION | SERVER | ALL | |
REFRESH <table name> or REFRESH <table name> PARTITION (<partition_spec>) | TABLE | SELECT/INSERT | |
INVALIDATE METADATA | SERVER | ALL | |
INVALIDATE METADATA <table name> | TABLE | SELECT/INSERT | |
COMPUTE STATS | TABLE | ALL | |
SHOW TABLE STATS, SHOW PARTITIONS | TABLE | SELECT/INSERT | |
SHOW COLUMN STATS | TABLE | SELECT/INSERT | |
SHOW FUNCTIONS | DATABASE | SELECT | |
SHOW TABLES | No special privileges needed to issue the statement, but only shows objects you are authorized for | ||
SHOW DATABASES, SHOW SCHEMAS | No special privileges needed to issue the statement, but only shows objects you are authorized for |
Debugging Failed Sentry Authorization Requests
-
Add
log4j.logger.org.apache.sentry=DEBUG
to the log4j.properties file on each host in the cluster, in the appropriate configuration directory for each service.
FilePermission server..., RequestPermission server...., result [true|false]
which indicate each evaluation Sentry makes. The FilePermission
is from the policy file,
while RequestPermission
is the privilege required for the query. A
RequestPermission
will iterate over all appropriate FilePermission
settings until a match is found. If no matching privilege is found, Sentry returns false
indicating "Access Denied" .
The DEFAULT Database in a Secure Deployment
Because of the extra emphasis on granular access controls in a secure deployment, you should move any
important or sensitive information out of the DEFAULT
database into a named database whose
privileges are specified in the policy file. Sometimes you might need to give privileges on the
DEFAULT
database for administrative reasons; for example, as a place you can reliably
specify with a USE
statement when preparing to drop a database.