Viewing Lineage Information for Impala Data
      
      
      Lineage is a feature that helps you track where data originated, and how
      data propagates through the system through SQL statements such as
        SELECT, INSERT, and CREATE
        TABLE AS SELECT.
    
This type of tracking is important in high-security configurations, especially in highly regulated industries such as healthcare, pharmaceuticals, financial services and intelligence. For such kinds of sensitive data, it is important to know all the places in the system that contain that data or other data derived from it; to verify who has accessed that data; and to be able to doublecheck that the data used to make a decision was processed correctly and not tampered with.
Column Lineage
Column lineage tracks information in fine detail, at the level of particular columns rather than entire tables.
        For example, if you have a table with information derived from web logs, you might copy that data into
        other tables as part of the ETL process. The ETL operations might involve transformations through
        expressions and function calls, and rearranging the columns into more or fewer tables
        (normalizing or denormalizing the data). Then for reporting, you might issue
        queries against multiple tables and views. In this example, column lineage helps you determine that data
        that entered the system as RAW_LOGS.FIELD1 was then turned into
        WEBSITE_REPORTS.IP_ADDRESS through an INSERT ... SELECT statement. Or,
        conversely, you could start with a reporting query against a view, and trace the origin of the data in a
        field such as TOP_10_VISITORS.USER_ID back to the underlying table and even further back
        to the point where the data was first loaded into Impala.
      
When you have tables where you need to track or control access to sensitive information at the column level, see Impala Authorization for how to implement column-level security. You set up authorization using the Ranger framework, create views that refer to specific sets of columns, and then assign authorization privileges to those views rather than the underlying tables.
Lineage Data for Impala
The lineage feature is enabled by default. When lineage logging is enabled, the serialized column lineage graph is computed for each query and stored in a specialized log file in JSON format.
        Impala records queries in the lineage log if they complete successfully, or fail due to authorization
        errors. For write operations such as INSERT and CREATE TABLE AS SELECT,
        the statement is recorded in the lineage log only if it successfully completes. Therefore, the lineage
        feature tracks data that was accessed by successful queries, or that was attempted to be accessed by
        unsuccessful queries that were blocked due to authorization failure. These kinds of queries represent data
        that really was accessed, or where the attempted access could represent malicious activity.
      
Impala does not record in the lineage log queries that fail due to syntax errors or that fail or are cancelled before they reach the stage of requesting rows from the result set.
        To enable or disable this feature, set or remove the -lineage_event_log_dir
        configuration option for the impalad daemon.