All articles


Codegen cache for low latency queries

Apache Impala is a high-performance engine - written primarily in C++ - for executing low-latency SQL queries. At a high level, Impala generates a distributed query plan (first two phases in yellow), admits the query once sufficient capacity is available, and finally executes the query. For a more in-depth description of these phases please refer to Impala: A Modern, Open-Source SQL Engine for Hadoop.

Query Execution

During Distributed Execution, each fragment of the query plan is run on one or more Impala executors, with a degree of parallelism determined by the planner. A fragment is a distinct block of work that can be executed on a single node, and often comprises steps such as scanning and filtering rows from files (or other data sources), hashing that data to group or order it, and sending it to other executors via an exchange for distributed aggregation.

Code Generation

The steps taken within each fragment comprise the bulk of the work an executor does, and databases use different techniques to optimize that work. The actual operations needed will depend on the types of the specific columns being manipulated, which may be simple scalar types or complex data such as structs and arrays. At the beginning of executing each fragment, Impala leverages the LLVM project to generate machine code specific to the steps and columns in the fragment.

Code generation can dramatically speed up the operations done on each row, but has an initial overhead in generating the code that offsets that benefit. This initial overhead of generating code becomes relevant to sub second and low second queries because codegen time of say 100-250 ms is relevant if the query only takes 2 seconds to finish. Typical examples of such queries are queries on kudu tables that finish in seconds. Historically we recommended users to either set DISABLE_CODEGEN=true or to set a higher value for DISABLE_CODEGEN_ROWS_THRESHOLD, so that for very small queries Impala disables codegen.

DISABLE_CODEGEN_ROWS_THRESHOLD currently estimates the number rows being processed on each of the nodes and then decides whether codegen should be disabled. There are scenarios where the planner estimate is incorrect or the query is complex and codegen would have actually helped.

To help mitigate the cost of codegen for short running queries that are run repeatedly we've introduced a new codegen caching feature. With codegen cache enabled, code generation for queries will be cached, and subsequent runs will be faster by not needing to regenerate that code.

Using Cloudera Data Warehouse 1.9.2 with Runtime 2024.0.18.0-206 on AWS EC2 r5d.4xlarge instances, we performed a TPC-DS 1 TB benchmark with 10 executors to evaluate codegen cache performance. Across the whole test suite we saw geometric mean times improve by 4.8%. Since we expect codegen cache to help more with faster queries, we isolate the queries that executed in less than 2s:

Codegen cache performance

For these queries, we see a geometric mean improvement of 22%, significantly improving the performance of low latency queries by eliminating most of the code generation time.

The Codegen Cache

Caching Codegen Functions has been added to reduce the cost of code generation when repeating queries or running substantially similar queries by caching the results of code generation. The codegen cache in Impala works at the fragment level, meaning that it caches and reuses the machine code for specific fragments of a query.

When Impala generates code using LLVM and the codegen cache is enabled, it will store the generated objects using LLVM’s Object Caching. Impala goes through several steps during codegen:

  1. Load pre-parsed and partially optimized Impala library functions so that new code generation can reference them.
  2. Define functions representing the operations to be performed using LLVM’s intermediate representation (IR).
  3. Prune unused library functions loaded in step (1).
  4. Run LLVM’s builtin passes to optimize the IR generated through steps 1-3.
  5. Generate machine code from the optimized IR.

The most time consuming portion of these are optimization passes and generating machine code. When using the codegen cache, Impala performs steps 1-3, then constructs a key based on a serialization of the IR. It then looks for a match for the key in the codegen cache; if found, the result will be a machine code object that’s ready for immediate use; otherwise steps 4 and 5 are performed to generate machine code, which will then be stored to the codegen cache and used.

The codegen cache stores all objects in-memory. Its capacity is determined by CODEGEN_CACHE_CAPACITY. When the cache is full, it evicts the Least-Recently Used (LRU) object to make space for new entries.

Example of Caching Codegen Functions

Consider the following table:

create table sales_data (product_id int, category string, sales double);

We run two similar queries sequentially:

  1. select category, sum(sales) from sales_data where category = 'a' group by category;
  2. select category, sum(sales) from sales_data where category = 'b' group by category;

After running Query 1, the query profile shows the plan as follows, with zero cached functions and a total codegen compilation time of several dozen milliseconds for each fragment.

F02:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
...
04:EXCHANGE [UNPARTITIONED]
...
F01:PLAN FRAGMENT [HASH(category)] hosts=1 instances=1
03:AGGREGATE [FINALIZE]
...
02:EXCHANGE [HASH(category)]
...
F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
01:AGGREGATE [STREAMING]
...
00:SCAN HDFS [default.sales_data, RANDOM]
...
        Fragment F02:
          CodeGen:
            ...
            - NumCachedFunctions: 0 (0)
            ...
            - NumOptimizedFunctions: 2 (2)
            ...
            - TotalTime: 52.000ms
        Fragment F01:
          CodeGen:
             ...
             - NumCachedFunctions: 0 (0)
             ...
             - NumOptimizedFunctions: 20 (20)
             ...
             - TotalTime: 100.000ms
        Fragment F00:
          CodeGen:
             ...
             - NumCachedFunctions: 0 (0)
             ...
             - NumOptimizedFunctions: 20 (20)
             ...
             - TotalTime: 116.000ms

After running Query 2, the functions of fragments F02 and F01 are successfully loaded from the codegen cache, because these fragments are identical in both queries, largely reducing the total codegen compilation time. However, Fragment F00 does not hit the codegen cache because different predicates are used in the two queries, like in our case, category = 'a' vs. category = 'b'. As a result, the codegen functions in the corresponding scan nodes are treated as distinct in the current version.

        Fragment F02:
          CodeGen:
            ...
            - NumCachedFunctions: 2 (2)
            ...
            - NumOptimizedFunctions: 2 (2)
            ...
            - TotalTime: 32.000ms
        Fragment F01:
          CodeGen:
            ...
            - NumCachedFunctions: 20 (20)
            ...
            - NumOptimizedFunctions: 20 (20)
            ...
            - TotalTime: 40.000ms
        Fragment F00:
          CodeGen:
            ...
            - NumCachedFunctions: 0 (0)
            ...
            - NumOptimizedFunctions: 20 (20)
            ...
            - TotalTime: 112.000ms

Note that native UDF won't be supported by the codegen cache, if a fragment contains any native UDF, the codegen of that fragment won't be cached.

Summary

Codegen Cache is supported and enabled by default since Impala 4.3. By setting the flag file option CODEGEN_CACHE_CAPACITY, you can adjust its default value of the memory used for codegen cache.

Interested in contributing? We have future work planned here for codegen caching - IMPALA-13187

Reblogged with edit from Engineering@Cloudera on Medium


Healing Iceberg Tables with Impala

Apache Iceberg handles dynamically changing data at large scale. However, frequent modifications come at a cost: eventually, tables will become fragmented. This degrades the performance of read operations over time. To address this challenge, we introduced table maintenance features in Apache Impala, the high performance, distributed DB engine for big data.

The new OPTIMIZE statement merges small data files and eliminates delete files to uphold table health. It allows rewriting the table according to the latest schema and partition layout, and also offers the flexibility of file filtering to optimize recurring maintenance jobs. Additionally, the DROP PARTITION statement allows selective partition removal based on predicates.

Discover in this session how Impala ensures high performance on top of dynamically changing data.

[slides]

Appeared in Community Over Code NA 2024


Intelligent Utilization Aware Autoscaling for Impala Virtual Compute Clusters

Sizing Virtual compute clusters for diverse and complex workloads is hard. Queries often require large clusters to meet SLA requirements which can lead to low utilization and excessive cloud spend. To solve this problem, we present intelligent autoscaling for Impala virtual warehouses. This powerful feature dynamically analyzes query execution plans and resource requirements and adjusts cluster size to follow the workload. Once cluster size limits have been set, the autoscaler works in the background to maintain the ideal cluster size and is transparent to users. Observability UIs have also been enhanced to help understand autoscaler behavior and further tune workloads.

We achieved 2 key goals with the design - delivering better ROI for compute cost and ease of use for admins/users to manage Virtual Clusters.

[slides]

Appeared in Community Over Code NA 2024


Impalas living on Iceberg

Apache Impala is a horizontally scalable database engine renowned for its emphasis on query efficiency. Over recent years, the Impala team has dedicated substantial effort to support Iceberg tables. Impala can read, write, modify, and optimize Iceberg tables. Additionally, it facilitates schema and partition evolution through DDL statements.

Attendees can anticipate a comprehensive overview of all Iceberg-related features within Impala, along with insights into forthcoming developments expected this year. Given Impala's Java/C++ hybrid architecture — where the frontend, responsible for analysis and planning, is Java-based while backend executors are coded in C++ — the integration process encountered its own set of challenges. This presentation will delve into some details of this integration work, shedding light on the technical nuances.

[slides]

Appeared in Community Over Code NA 2024


This Impala not only reads, but modifies and optimizes Iceberg tables

Apache Impala is a distributed, massively parallel query engine for big data. Initially, it focused on fast query execution on top of large datasets that were ingested via long-running batch jobs. The table schema and the ingested data typically remained unchanged, and row-level modifications were impractical to say the least.

Today’s expectations for modern data warehouse engines have risen significantly. Users now want to have RDBMS-like capabilities in their data warehouses. E.g., they often need to comply with regulations like GDPR or CCPA, i.e. they need to be able to remove or update records belonging to certain individuals.

Apache Iceberg is a cutting-edge table format that delivers advanced write capabilities for large-scale data. It allows schema and partition evolution, time-travel, and the focus of this talk: row-level modifications and table maintenance features. Impala has had support for reading Iceberg tables and inserting data for a while, but the capability of deleting and updating rows only recently became available.

Frequent modifications come with a cost: eventually, the table will become full of small data and so-called delete files. This degrades the performance of read operations over time. The new table maintenance statement in Impala, OPTIMIZE, merges small data files and eliminates delete files to keep our table healthy. To make partition-level maintenance easier, DROP PARTITION statement allows selective partition removal based on predicates.

Join us for this session to discover how Apache Impala evolved to meet emerging requirements without compromising performance.

Appeared in https://eu.communityovercode.org/sessions/2024/this-impala-not-only-reads-but-modifies-and-optimizes-iceberg-tables/


Let’s see how fast Impala runs on Iceberg

Apache Impala is a distributed massively parallel query engine designed for high-performance querying of large-scale data. There has been a long list of new features recently around supporting Apache Iceberg tables such as reading, writing, time traveling, and so on. However, in a big data environment it is also a must to be performant. Since Impala has been designed to be fast, it has its own way of reading Iceberg tables. Other engines might simply use the Iceberg library to perform reads, while Impala has a C++ implementation itself optimized for speed.

Nowadays, even big data storage techniques have to offer the possibility not just to store data but also to alter and delete data on a row level. Apache Iceberg solves this by using delete files that live alongside the data files. It is the responsibility of the query engines to then apply the delete files on the data files when querying the data. To efficiently read the data of such tables we implemented new Iceberg-specific operators in Impala.

In this talk we will go into the implementation details and reveal what is the secret behind Impala’s great performance in general and also when reading Iceberg tables with position delete files. We will also show some measurements where we compare Impala’s performance with other open-source query engines.

By the end of this talk, you should have a high-level understanding of Impala’s and Iceberg’s architecture, the performance tricks we implemented in Impala specifically for Iceberg, and you will see how Impala competes with other engines.

Appeared in https://eu.communityovercode.org/sessions/2024/lets-see-how-fast-impala-runs-on-iceberg/


Anatomy of reading Apache Parquet files (from the Apache Impala perspective)

Reading file formats efficiently is a crucial part of big data systems - in selective scans data is often only big before hitting the first filter and becomes manageable during the rest of the processing. The talk describes this early stage of query execution in Apache Impala, from reading the bytes of Parquet files on the filesystem to applying predicates and runtime filters on individual rows.

Apache Impala is a distributed massively parallel analytic query engine written in C++ and Java. It is optimized both for object stores (S3, ABFS) and on-prem distributed file systems (HDFS, Ozone). Apache Parquet is one of the most widely used open source column-oriented file formats in Big Data.

Impala has its own Parquet scanner written in C++ instead of using existing Parquet libraries like Parquet-mr or Parquet-cpp. This allows tighter integration with IO and and memory management, enabling features like:

  • Data caching to memory and local drive
  • Execution within memory bounds
  • Efficient parallelism

These features all play an important role in giving Impala an edge in the world of Big Data query engines.

Appeared in https://eu.communityovercode.org/sessions/2024/anatomy-parquet-files/



Impala 2.5 performance overview

Impala has proven to be a high-performance analytics query engine since the beginning. Even as an initial production release in 2013, it demonstrated performance 2x faster than a traditional DBMS, and each subsequent release has continued to demonstrate the wide performance gap between Impala’s analytic-database architecture and SQL-on-Apache Hadoop alternatives. Today, we are excited to continue that track record via some important performance gains for Impala 2.5 (with more to come on the roadmap), summarized below.

Overall, compared to Impala 2.3, in Impala 2.5:

  • TPC-DS queries run on average 4.3x faster.
  • TPC-H queries run 2.2x faster on flat tables, and 1.71x faster on nested tables.

Nested Types in Impala

This document discusses nested data types in Impala, including structs, maps, and arrays. It provides an example schema using these types, describes Impala's SQL syntax extensions for querying nested data, and discusses techniques for advanced querying capabilities like correlated subqueries. The execution model materializes minimal nested structures in memory and uses new execution nodes to handle nested data types.

Presented in Impala Meetup, PA, March 24th, 2015