Viewing Lineage Information for Impala Data

Lineage is a feature that helps you track where data originated, and how data propagates through the system through SQL statements such as SELECT, INSERT, and CREATE TABLE AS SELECT.

This type of tracking is important in high-security configurations, especially in highly regulated industries such as healthcare, pharmaceuticals, financial services and intelligence. For such kinds of sensitive data, it is important to know all the places in the system that contain that data or other data derived from it; to verify who has accessed that data; and to be able to doublecheck that the data used to make a decision was processed correctly and not tampered with.

Column Lineage

Column lineage tracks information in fine detail, at the level of particular columns rather than entire tables.

For example, if you have a table with information derived from web logs, you might copy that data into other tables as part of the ETL process. The ETL operations might involve transformations through expressions and function calls, and rearranging the columns into more or fewer tables (normalizing or denormalizing the data). Then for reporting, you might issue queries against multiple tables and views. In this example, column lineage helps you determine that data that entered the system as RAW_LOGS.FIELD1 was then turned into WEBSITE_REPORTS.IP_ADDRESS through an INSERT ... SELECT statement. Or, conversely, you could start with a reporting query against a view, and trace the origin of the data in a field such as TOP_10_VISITORS.USER_ID back to the underlying table and even further back to the point where the data was first loaded into Impala.

When you have tables where you need to track or control access to sensitive information at the column level, see Impala Authorization for how to implement column-level security. You set up authorization using the Ranger framework, create views that refer to specific sets of columns, and then assign authorization privileges to those views rather than the underlying tables.

Lineage Data for Impala

The lineage feature is enabled by default. When lineage logging is enabled, the serialized column lineage graph is computed for each query and stored in a specialized log file in JSON format.

Impala records queries in the lineage log if they complete successfully, or fail due to authorization errors. For write operations such as INSERT and CREATE TABLE AS SELECT, the statement is recorded in the lineage log only if it successfully completes. Therefore, the lineage feature tracks data that was accessed by successful queries, or that was attempted to be accessed by unsuccessful queries that were blocked due to authorization failure. These kinds of queries represent data that really was accessed, or where the attempted access could represent malicious activity.

Impala does not record in the lineage log queries that fail due to syntax errors or that fail or are cancelled before they reach the stage of requesting rows from the result set.

To enable or disable this feature, set or remove the -lineage_event_log_dir configuration option for the impalad daemon.