Using Impala with Amazon S3 Object Store

You can use Impala to query data residing on the Amazon S3 object store. This capability allows convenient access to a storage system that is remotely managed, accessible from anywhere, and integrated with various cloud-based services. Impala can query files in any supported file format from S3. The S3 storage location can be for an entire table, or individual partitions in a partitioned table.

Best Practices for Using Impala with S3

The following guidelines summarize the best practices described in the rest of this topic:

How Impala SQL Statements Work with S3

Impala SQL statements work with data in S3 as follows:

For usage information about Impala SQL statements with S3 tables, see Creating Impala Databases, Tables, and Partitions for Data Stored in S3 and Using Impala DML Statements for S3 Data.

Specifying Impala Credentials to Access Data in S3

To allow Impala to access data in S3, specify values for the following configuration settings in your core-site.xml file:


<property>
<name>fs.s3a.access.key</name>
<value>your_access_key</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>your_secret_key</value>
</property>

After specifying the credentials, restart both the Impala and Hive services. Restarting Hive is required because Impala statements, such as CREATE TABLE, go through the Hive Metastore.

Important:

Although you can specify the access key ID and secret key as part of the s3a:// URL in the LOCATION attribute, doing so makes this sensitive information visible in many places, such as DESCRIBE FORMATTED output and Impala log files. Therefore, specify this information centrally in the core-site.xml file, and restrict read access to that file to only trusted users.

See Authenticating with S3 for additional authentication mechanisms to access S3.

Loading Data into S3 for Impala Queries

If your ETL pipeline involves moving data into S3 and then querying through Impala, you can either use Impala DML statements to create, move, or copy the data, or use the same data loading techniques as you would for non-Impala data.

Using Impala DML Statements for S3 Data

The Impala DML statements (INSERT, LOAD DATA, and CREATE TABLE AS SELECT) can write data into a table or partition that resides in S3. The syntax of the DML statements is the same as for any other tables because the S3 location for tables and partitions is specified by an s3a:// prefix in the LOCATION attribute of CREATE TABLE or ALTER TABLE statements. If you bring data into S3 using the normal S3 transfer mechanisms instead of Impala DML statements, issue a REFRESH statement for the table before using Impala to query the S3 data.

Because of differences between S3 and traditional filesystems, DML operations for S3 tables can take longer than for tables on HDFS. For example, both the LOAD DATA statement and the final stage of the INSERT and CREATE TABLE AS SELECT statements involve moving files from one directory to another. (In the case of INSERT and CREATE TABLE AS SELECT, the files are moved from a temporary staging directory to the final destination directory.) Because S3 does not support a "rename" operation for existing objects, in these cases Impala actually copies the data files from one location to another and then removes the original files. In Impala 2.6, the S3_SKIP_INSERT_STAGING query option provides a way to speed up INSERT statements for S3 tables and partitions, with the tradeoff that a problem during statement execution could leave data in an inconsistent state. It does not apply to INSERT OVERWRITE or LOAD DATA statements. See S3_SKIP_INSERT_STAGING Query Option for details.

Manually Loading Data into Impala Tables in S3

As an alternative, or on earlier Impala releases without DML support for S3, you can use the Amazon-provided methods to bring data files into S3 for querying through Impala. See the Amazon S3 web site for details.

Important:
For best compatibility with the S3 write support in Impala 2.6 and higher:
  • Use native Hadoop techniques to create data files in S3 for querying through Impala.
  • Use the PURGE clause of DROP TABLE when dropping internal (managed) tables.
By default, when you drop an internal (managed) table, the data files are moved to the HDFS trashcan. This operation is expensive for tables that reside on the Amazon S3 object store. Therefore, for S3 tables, prefer to use DROP TABLE table_name PURGE rather than the default DROP TABLE statement. The PURGE clause makes Impala delete the data files immediately, skipping the HDFS trashcan. For the PURGE clause to work effectively, you must originally create the data files on S3 using one of the tools from the Hadoop ecosystem, such as hadoop fs -cp, or INSERT in Impala or Hive.

After you upload data files to a location already mapped to an Impala table or partition, or if you delete files in S3 from such a location, issue the REFRESH statement to make Impala aware of the new set of data files.

Creating Impala Databases, Tables, and Partitions for Data Stored in S3

To create a table that resides in S3, run the CREATE TABLE or ALTER TABLE statement with the LOCATION clause.

ALTER TABLE can set the LOCATION property for an individual partition, so that some data in a table resides in S3 and other data in the same table resides on HDFS.

The syntax for the LOCATION clause is:

LOCATION 's3a://bucket_name/path/to/file'

The file system prefix is always s3a://. Impala does not support the s3:// or s3n:// prefixes.

For a partitioned table, either specify a separate LOCATION clause for each new partition, or specify a base LOCATION for the table and set up a directory structure in S3 to mirror the way Impala partitioned tables are structured in S3.

You point a nonpartitioned table or an individual partition at S3 by specifying a single directory path in S3, which could be any arbitrary directory. To replicate the structure of an entire Impala partitioned table or database in S3 requires more care, with directories and subdirectories nested and named to match the equivalent directory tree in HDFS. Consider setting up an empty staging area if necessary in HDFS, and recording the complete directory structure so that you can replicate it in S3.

When working with multiple tables with data files stored in S3, you can create a database with a LOCATION attribute pointing to an S3 path. Specify a URL of the form s3a://bucket/root/path/for/database for the LOCATION attribute of the database. Any tables created inside that database automatically create directories underneath the one specified by the database LOCATION attribute.

The following example creates a table with one partition for the year 2017 resides on HDFS and one partition for the year 2018 resides in S3.

The partition for year 2018 includes a LOCATION attribute with an s3a:// URL, and so refers to data residing in S3, under a specific path underneath the bucket impala-demo.

CREATE TABLE mostly_on_hdfs (x int) PARTITIONED BY (year INT);
ALTER TABLE mostly_on_hdfs ADD PARTITION (year=2017);
ALTER TABLE mostly_on_hdfs ADD PARTITION (year=2018) 
   LOCATION 's3a://impala-demo/dir1/dir2/dir3/t1';

The following session creates a database and two partitioned tables residing entirely in S3, one partitioned by a single column and the other partitioned by multiple columns.

CREATE DATABASE db_on_s3 LOCATION 's3a://impala-demo/dir1/dir2/dir3';
CREATE TABLE partitioned_multiple_keys (x INT)
   PARTITIONED BY (year SMALLINT, month TINYINT, day TINYINT);

ALTER TABLE partitioned_multiple_keys
   ADD PARTITION (year=2015,month=1,day=1);
ALTER TABLE partitioned_multiple_keys
   ADD PARTITION (year=2015,month=1,day=31);

!hdfs dfs -ls -R s3a://impala-demo/dir1/dir2/dir3
2015-03-17 13:56:34          0 dir1/dir2/dir3/
2015-03-17 16:47:13          0 dir1/dir2/dir3/partitioned_multiple_keys/
2015-03-17 16:47:44          0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=1/
2015-03-17 16:47:50          0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=31/

The CREATE DATABASE and CREATE TABLE statements create the associated directory paths if they do not already exist. You can specify multiple levels of directories, and the CREATE statement creates all appropriate levels, similar to using mkdir -p.

Use the standard S3 file upload methods to put the actual data files into the right locations. You can also put the directory paths and data files in place before creating the associated Impala databases or tables, and Impala automatically uses the data from the appropriate location after the associated databases and tables are created.

Use the ALTER TABLE statement with the LOCATION clause to switch whether an existing table or partition points to data in HDFS or S3. For example, if you have an Impala table or partition pointing to data files in HDFS or S3, and you later transfer those data files to the other filesystem, use the ALTER TABLE statement to adjust the LOCATION attribute of the corresponding table or partition to reflect that change.

Internal and External Tables Located in S3

Just as with tables located on HDFS storage, you can designate S3-based tables as either internal (managed by Impala) or external, by using the syntax CREATE TABLE or CREATE EXTERNAL TABLE respectively.

When you drop an internal table, the files associated with the table are removed, even if they are in S3 storage. When you drop an external table, the files associated with the table are left alone, and are still available for access by other tools or components.

If the data in S3 is intended to be long-lived and accessed by other tools in addition to Impala, create any associated S3 tables with the CREATE EXTERNAL TABLE syntax, so that the files are not deleted from S3 when the table is dropped.

If the data in S3 is only needed for querying by Impala and can be safely discarded once the Impala workflow is complete, create the associated S3 tables using the CREATE TABLE syntax, so that dropping the table also deletes the corresponding data files in S3.

Running and Tuning Impala Queries for Data Stored in S3

Once a table or partition is designated as residing in S3, the SELECT statement transparently accesses the data files from the appropriate storage layer.

Understanding and Tuning Impala Query Performance for S3 Data

Here are techniques you can use to interpret explain plans and profiles for queries against S3 data, and tips to achieve the best performance possible for such queries.

All else being equal, performance is expected to be lower for queries running against data in S3 rather than HDFS. The actual mechanics of the SELECT statement are somewhat different when the data is in S3. Although the work is still distributed across the DataNodes of the cluster, Impala might parallelize the work for a distributed query differently for data on HDFS and S3.

S3 does not have the same block notion as HDFS, so Impala uses heuristics to determine how to split up large S3 files for processing in parallel. Because all hosts can access any S3 data file with equal efficiency, the distribution of work might be different than for HDFS data, where the data blocks are physically read using short-circuit local reads by hosts that contain the appropriate block replicas. Although the I/O to read the S3 data might be spread evenly across the hosts of the cluster, the fact that all data is initially retrieved across the network means that the overall query performance is likely to be lower for S3 data than for HDFS data.

Use the PARQUET_OBJECT_STORE_SPLIT_SIZE query option to control the Parquet-specific split size. The default value is 256 MB.

When optimizing aspects of complex queries, such as the join order, Impala treats tables on HDFS and S3 the same way. Therefore, follow all the same tuning recommendations for S3 tables as for HDFS ones, such as using the COMPUTE STATS statement to help Impala construct accurate estimates of row counts and cardinality. See Tuning Impala for Performance for details.

In query profile reports, the numbers for BytesReadLocal, BytesReadShortCircuit, BytesReadDataNodeCached, and BytesReadRemoteUnexpected are blank because those metrics come from HDFS. By definition, all the I/O for S3 tables involves remote reads.

Restrictions on Impala Support for S3

The following restrictions apply when using Impala with S3: