Security Guidelines for Impala
The following are the major steps to harden a cluster running Impala against accidents and mistakes, or malicious attackers trying to access sensitive data:
-
Secure the
root
account. Theroot
user can tamper with the impalad daemon, read and write the data files in HDFS, log into other user accounts, and access other system services that are beyond the control of Impala. -
Restrict membership in the
sudoers
list (in the /etc/sudoers file). The users who can run thesudo
command can do many of the same things as theroot
user. -
Ensure the Hadoop ownership and permissions for Impala data files are restricted.
-
Ensure the Hadoop ownership and permissions for Impala log files are restricted.
-
Ensure that the Impala web UI (available by default on port 25000 on each Impala node) is password-protected. See Impala Web User Interface for Debugging for details.
-
Create a policy file that specifies which Impala privileges are available to users in particular Hadoop groups (which by default map to Linux OS groups). Create the associated Linux groups using the groupadd command if necessary.
-
The Impala authorization feature makes use of the HDFS file ownership and permissions mechanism; for background information, see the HDFS Permissions Guide. Set up users and assign them to groups at the OS level, corresponding to the different categories of users with different access levels for various databases, tables, and HDFS locations (URIs). Create the associated Linux users using the useradd command if necessary, and add them to the appropriate groups with the usermod command.
-
Design your databases, tables, and views with database and table structure to allow policy rules to specify simple, consistent rules. For example, if all tables related to an application are inside a single database, you can assign privileges for that database and use the
*
wildcard for the table name. If you are creating views with different privileges than the underlying base tables, you might put the views in a separate database so that you can use the*
wildcard for the database containing the base tables, while specifying the precise names of the individual views. (For specifying table or database names, you either specify the exact name or*
to mean all the databases on a server, or all the tables and views in a database.) -
Enable authorization by running the
impalad
daemons with the-server_name
and-authorization_policy_file
options on all nodes. (The authorization feature does not apply to the statestored daemon, which has no access to schema objects or data files.) -
Set up authentication using Kerberos, to make sure users really are who they say they are.