All articles


This Impala not only reads, but modifies and optimizes Iceberg tables

Apache Impala is a distributed, massively parallel query engine for big data. Initially, it focused on fast query execution on top of large datasets that were ingested via long-running batch jobs. The table schema and the ingested data typically remained unchanged, and row-level modifications were impractical to say the least.

Today’s expectations for modern data warehouse engines have risen significantly. Users now want to have RDBMS-like capabilities in their data warehouses. E.g., they often need to comply with regulations like GDPR or CCPA, i.e. they need to be able to remove or update records belonging to certain individuals.

Apache Iceberg is a cutting-edge table format that delivers advanced write capabilities for large-scale data. It allows schema and partition evolution, time-travel, and the focus of this talk: row-level modifications and table maintenance features. Impala has had support for reading Iceberg tables and inserting data for a while, but the capability of deleting and updating rows only recently became available.

Frequent modifications come with a cost: eventually, the table will become full of small data and so-called delete files. This degrades the performance of read operations over time. The new table maintenance statement in Impala, OPTIMIZE, merges small data files and eliminates delete files to keep our table healthy. To make partition-level maintenance easier, DROP PARTITION statement allows selective partition removal based on predicates.

Join us for this session to discover how Apache Impala evolved to meet emerging requirements without compromising performance.

Appeared in https://eu.communityovercode.org/sessions/2024/this-impala-not-only-reads-but-modifies-and-optimizes-iceberg-tables/


Let’s see how fast Impala runs on Iceberg

Apache Impala is a distributed massively parallel query engine designed for high-performance querying of large-scale data. There has been a long list of new features recently around supporting Apache Iceberg tables such as reading, writing, time traveling, and so on. However, in a big data environment it is also a must to be performant. Since Impala has been designed to be fast, it has its own way of reading Iceberg tables. Other engines might simply use the Iceberg library to perform reads, while Impala has a C++ implementation itself optimized for speed.

Nowadays, even big data storage techniques have to offer the possibility not just to store data but also to alter and delete data on a row level. Apache Iceberg solves this by using delete files that live alongside the data files. It is the responsibility of the query engines to then apply the delete files on the data files when querying the data. To efficiently read the data of such tables we implemented new Iceberg-specific operators in Impala.

In this talk we will go into the implementation details and reveal what is the secret behind Impala’s great performance in general and also when reading Iceberg tables with position delete files. We will also show some measurements where we compare Impala’s performance with other open-source query engines.

By the end of this talk, you should have a high-level understanding of Impala’s and Iceberg’s architecture, the performance tricks we implemented in Impala specifically for Iceberg, and you will see how Impala competes with other engines.

Appeared in https://eu.communityovercode.org/sessions/2024/lets-see-how-fast-impala-runs-on-iceberg/


Anatomy of reading Apache Parquet files (from the Apache Impala perspective)

Reading file formats efficiently is a crucial part of big data systems - in selective scans data is often only big before hitting the first filter and becomes manageable during the rest of the processing. The talk describes this early stage of query execution in Apache Impala, from reading the bytes of Parquet files on the filesystem to applying predicates and runtime filters on individual rows.

Apache Impala is a distributed massively parallel analytic query engine written in C++ and Java. It is optimized both for object stores (S3, ABFS) and on-prem distributed file systems (HDFS, Ozone). Apache Parquet is one of the most widely used open source column-oriented file formats in Big Data.

Impala has its own Parquet scanner written in C++ instead of using existing Parquet libraries like Parquet-mr or Parquet-cpp. This allows tighter integration with IO and and memory management, enabling features like:

  • Data caching to memory and local drive
  • Execution within memory bounds
  • Efficient parallelism

These features all play an important role in giving Impala an edge in the world of Big Data query engines.

Appeared in https://eu.communityovercode.org/sessions/2024/anatomy-parquet-files/



Impala 2.5 performance overview

Impala has proven to be a high-performance analytics query engine since the beginning. Even as an initial production release in 2013, it demonstrated performance 2x faster than a traditional DBMS, and each subsequent release has continued to demonstrate the wide performance gap between Impala’s analytic-database architecture and SQL-on-Apache Hadoop alternatives. Today, we are excited to continue that track record via some important performance gains for Impala 2.5 (with more to come on the roadmap), summarized below.

Overall, compared to Impala 2.3, in Impala 2.5:

  • TPC-DS queries run on average 4.3x faster.
  • TPC-H queries run 2.2x faster on flat tables, and 1.71x faster on nested tables.

Nested Types in Impala

This document discusses nested data types in Impala, including structs, maps, and arrays. It provides an example schema using these types, describes Impala's SQL syntax extensions for querying nested data, and discusses techniques for advanced querying capabilities like correlated subqueries. The execution model materializes minimal nested structures in memory and uses new execution nodes to handle nested data types.

Presented in Impala Meetup, PA, March 24th, 2015


Impala: A Modern, Open-Source SQL Engine for Hadoop

Presented at The Conference on Innovative Data Systems Research (CIDR) 2015.

ABSTRACT

Cloudera Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Apache Hive. This paper presents Impala from a user’s perspective, gives an overview of its architecture and main components and briefly demonstrates its superior performance compared against other popular SQL-on-Hadoop systems.

Paper | Slides