Securing Impala Data and Log Files
One aspect of security is to protect files from unauthorized access at the filesystem level. For example, if you store sensitive data in HDFS, you specify permissions on the associated files and directories in HDFS to restrict read and write permissions to the appropriate users and groups.
If you issue queries containing sensitive values in the WHERE
clause, such as financial
account numbers, those values are stored in Impala log files in the Linux filesystem and you must secure
those files also. For the locations of Impala log files, see Using Impala Logging.
All Impala read and write operations are performed under the filesystem privileges of the
impala
user. The impala
user must be able to read all directories and data
files that you query, and write into all the directories and data files for INSERT
and
LOAD DATA
statements. At a minimum, make sure the impala
user is in the
hive
group so that it can access files and directories shared between Impala and Hive. See
User Account Requirements for more details.
Setting file permissions is necessary for Impala to function correctly, but is not an effective security practice by itself:
-
The way to ensure that only authorized users can submit requests for databases and tables they are allowed to access is to set up Ranger authorization, as explained in Impala Authorization. With authorization enabled, the checking of the user ID and group is done by Impala, and unauthorized access is blocked by Impala itself. The actual low-level read and write requests are still done by the
impala
user, so you must have appropriate file and directory permissions for that user ID. -
You must also set up Kerberos authentication, as described in Enabling Kerberos Authentication for Impala, so that users can only connect from trusted hosts. With Kerberos enabled, if someone connects a new host to the network and creates user IDs that match your privileged IDs, they will be blocked from connecting to Impala at all from that host.