To monitor how Impala data is being used within your organization, ensure that your Impala authorization and authentication policies are effective. To detect attempts at intrusion or unauthorized access to Impala data, you can use the auditing feature in Impala 1.2.1 and higher:
‑‑audit_event_log_dir=directory_path
in
your impalad startup options. The log directory must be a local
directory on the server, not an HDFS directory.
‑‑max_audit_event_log_file_size=number_of_queries
in the impalad startup options.
‑‑max_audit_event_log_files=number_of_log_files
in the impalad startup options. Once the limit is reached, older
files are rotated out using the same mechanism as for other Impala log files. The
default value for this setting is 0, representing an unlimited number of audit event log
files.
The auditing feature only imposes performance overhead while auditing is enabled.
Because any Impala host can process a query, enable auditing on all hosts where the
impalad daemon
runs. Each host stores its own log
files, in a directory in the local filesystem. The log data is periodically flushed to
disk (through an fsync()
system call) to avoid loss of audit data in
case of a crash.
The runtime overhead of auditing applies to whichever host serves as the coordinator for the query, that is, the host you connect to when you issue the query. This might be the same host for all queries, or different applications or users might connect to and issue queries through different hosts.
To avoid excessive I/O overhead on busy coordinator hosts, Impala syncs the audit log
data (using the fsync()
system call) periodically rather than after
every query. Currently, the fsync()
calls are issued at a fixed
interval, every 5 seconds.
By default, Impala avoids losing any audit log data in the case of an error during a
logging operation (such as a disk full error), by immediately shutting down
impalad on the host where the auditing problem occurred.
You can override this setting by specifying the
option ‑‑abort_on_failed_audit_event=false
in the
impalad startup options.
The audit log files represent the query information in JSON format, one query per line. Typically, rather than looking at the log files themselves, you should use cluster-management software to consolidate the log data from all Impala hosts and filter and visualize the results in useful ways. (If you do examine the raw log data, you might run the files through a JSON pretty-printer first.)
All the information about schema objects accessed by the query is encoded in a single
nested record on the same line. For example, the audit log for an INSERT ...
SELECT
statement records that a select operation occurs on the source table and
an insert operation occurs on the destination table. The audit log for a query against a
view records the base table accessed by the view, or multiple base tables in the case of
a view that includes a join query. Every Impala operation that corresponds to a SQL
statement is recorded in the audit logs, whether the operation succeeds or fails. Impala
records more information for a successful operation than for a failed one, because an
unauthorized query is stopped immediately, before all the query planning is completed.
The information logged for each query includes:
SELECT
,
INSERT
, CREATE
, and so on)
The following types of SQL operations are recorded in the audit log:
The audit log does not contain entries for queries that could not be parsed and analyzed. For example, a query that fails due to a syntax error is not recorded in the audit log.
The audit log does not contain queries that fail due to a reference to a table that does not exist.
Certain statements in the impala-shell interpreter,
such as CONNECT
, SUMMARY
,
PROFILE
, SET
, and
QUIT
, do not correspond to actual SQL queries, and
these statements are not recorded in the audit log.