The following are the major steps to harden a cluster running Impala against accidents and mistakes, or malicious attackers trying to access sensitive data:
Secure the root
account. The root
user can tamper with the
impalad daemon, read and write the data files in HDFS, log into other user accounts, and
access other system services that are beyond the control of Impala.
Restrict membership in the sudoers
list (in the /etc/sudoers file).
The users who can run the sudo
command can do many of the same things as the
root
user.
Ensure the Hadoop ownership and permissions for Impala data files are restricted.
Ensure the Hadoop ownership and permissions for Impala log files are restricted.
Ensure that the Impala web UI (available by default on port 25000 on each Impala node) is password-protected. See Impala Web User Interface for Debugging for details.
Create a policy file that specifies which Impala privileges are available to users in particular Hadoop groups (which by default map to Linux OS groups). Create the associated Linux groups using the groupadd command if necessary.
The Impala authorization feature makes use of the HDFS file ownership and permissions mechanism; for background information, see the HDFS Permissions Guide. Set up users and assign them to groups at the OS level, corresponding to the different categories of users with different access levels for various databases, tables, and HDFS locations (URIs). Create the associated Linux users using the useradd command if necessary, and add them to the appropriate groups with the usermod command.
Design your databases, tables, and views with database and table structure to allow policy rules to specify
simple, consistent rules. For example, if all tables related to an application are inside a single
database, you can assign privileges for that database and use the *
wildcard for the table
name. If you are creating views with different privileges than the underlying base tables, you might put
the views in a separate database so that you can use the *
wildcard for the database
containing the base tables, while specifying the precise names of the individual views. (For specifying
table or database names, you either specify the exact name or *
to mean all the databases
on a server, or all the tables and views in a database.)
Enable authorization by running the impalad
daemons with the -server_name
and -authorization_policy_file
options on all nodes. (The authorization feature does not
apply to the statestored daemon, which has no access to schema objects or data files.)
Set up authentication using Kerberos, to make sure users really are who they say they are.