Viewing Lineage Information for Impala Data
Lineage is a feature that helps you track where data originated, and how
data propagates through the system through SQL statements such as
SELECT
, INSERT
, and CREATE
TABLE AS SELECT
.
This type of tracking is important in high-security configurations, especially in highly regulated industries such as healthcare, pharmaceuticals, financial services and intelligence. For such kinds of sensitive data, it is important to know all the places in the system that contain that data or other data derived from it; to verify who has accessed that data; and to be able to doublecheck that the data used to make a decision was processed correctly and not tampered with.
Column Lineage
Column lineage tracks information in fine detail, at the level of particular columns rather than entire tables.
For example, if you have a table with information derived from web logs, you might copy that data into
other tables as part of the ETL process. The ETL operations might involve transformations through
expressions and function calls, and rearranging the columns into more or fewer tables
(normalizing or denormalizing the data). Then for reporting, you might issue
queries against multiple tables and views. In this example, column lineage helps you determine that data
that entered the system as RAW_LOGS.FIELD1
was then turned into
WEBSITE_REPORTS.IP_ADDRESS
through an INSERT ... SELECT
statement. Or,
conversely, you could start with a reporting query against a view, and trace the origin of the data in a
field such as TOP_10_VISITORS.USER_ID
back to the underlying table and even further back
to the point where the data was first loaded into Impala.
When you have tables where you need to track or control access to sensitive information at the column level, see Impala Authorization for how to implement column-level security. You set up authorization using the Ranger framework, create views that refer to specific sets of columns, and then assign authorization privileges to those views rather than the underlying tables.
Lineage Data for Impala
The lineage feature is enabled by default. When lineage logging is enabled, the serialized column lineage graph is computed for each query and stored in a specialized log file in JSON format.
Impala records queries in the lineage log if they complete successfully, or fail due to authorization
errors. For write operations such as INSERT
and CREATE TABLE AS SELECT
,
the statement is recorded in the lineage log only if it successfully completes. Therefore, the lineage
feature tracks data that was accessed by successful queries, or that was attempted to be accessed by
unsuccessful queries that were blocked due to authorization failure. These kinds of queries represent data
that really was accessed, or where the attempted access could represent malicious activity.
Impala does not record in the lineage log queries that fail due to syntax errors or that fail or are cancelled before they reach the stage of requesting rows from the result set.
To enable or disable this feature, set or remove the -lineage_event_log_dir
configuration option for the impalad daemon.