You can use Impala to query HBase tables. This is useful for accessing any of your existing HBase tables via SQL and performing analytics over them. HDFS and Kudu tables are preferred over HBase for analytic workloads and offer superior performance. Kudu supports efficient inserts, updates and deletes of small numbers of rows and can replace HBase for most analytics-oriented use cases. See Using Impala to Query Kudu Tables for information on using Impala with Kudu.
From the perspective of an Impala user, coming from an RDBMS background, HBase is a kind of key-value store where the value consists of multiple fields. The key is mapped to one column in the Impala table, and the various fields of the value are mapped to the other columns in the Impala table.
For background information on HBase, see the Apache HBase documentation.
When you use Impala with HBase:
CREATE
TABLE
statement currently does not support custom SerDes and some other syntax needed for these
tables:
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
clause on the Hive CREATE
TABLE
statement.
TBLPROPERTIES("hbase.table.name" = "table_name_in_hbase")
on the
Hive CREATE TABLE
statement.
#string
keyword, or map it to a STRING
column.
INVALIDATE METADATA
statement in impala-shell to make Impala aware of
the new table.)
WHERE
clause to find a single key value or a
range of key values wherever practical, by testing the Impala column
corresponding to the HBase row key. Avoid queries that do full-table
scans, which are efficient for regular Impala tables but inefficient
in HBase.
To work with an HBase table from Impala, ensure that the impala
user has read/write
privileges for the HBase table, using the GRANT
command in the HBase shell. For details
about HBase security, see the Security chapter in the Apache HBase documentation.
HBase works out of the box with Impala. There is no mandatory configuration needed to use these two components together.
To avoid delays if HBase is unavailable during Impala startup or after an INVALIDATE
METADATA
statement, set timeout values similar to the following in
/etc/impala/conf/hbase-site.xml:
<property>
<name>hbase.client.retries.number</name>
<value>3</value>
</property>
<property>
<name>hbase.rpc.timeout</name>
<value>3000</value>
</property>
To understand how Impala column data types are mapped to fields in HBase, you should have some background
knowledge about HBase first. You set up the mapping by running the CREATE TABLE
statement
in the Hive shell. See
the
Hive wiki for a starting point, and Examples of Querying HBase Tables from Impala for examples.
HBase works as a kind of "bit bucket", in the sense that HBase does not enforce any typing for the key or value fields. All the type enforcement is done on the Impala side.
For best performance of Impala queries against HBase tables, most
queries will perform comparisons in the WHERE
clause
against the column that corresponds to the HBase row key. When creating
the table through the Hive shell, use the STRING
data
type for the column that corresponds to the HBase row key. Impala can
translate predicates (through operators such as =
,
<
, and BETWEEN
) against this
column into fast lookups in HBase, but this optimization ("predicate
pushdown") only works when that column is defined as
STRING
.
Starting in Impala 1.1, Impala also supports reading and writing to columns that are defined in the Hive
CREATE TABLE
statement using binary data types, represented in the Hive table definition
using the #binary
keyword, often abbreviated as #b
. Defining numeric
columns as binary can reduce the overall data volume in the HBase tables. You should still define the
column that corresponds to the HBase row key as a STRING
, to allow fast lookups using
those columns.
To understand the performance characteristics of SQL queries against data stored in HBase, you should have some background knowledge about how HBase interacts with SQL-oriented systems first. See the Hive wiki for a starting point; because Impala shares the same metastore database as Hive, the information about mapping columns from Hive tables to HBase tables is generally applicable to Impala too.
Impala uses the HBase client API via Java Native Interface (JNI) to query data stored in HBase. This querying does not read HFiles directly. The extra communication overhead makes it important to choose what data to store in HBase or in HDFS, and construct efficient queries that can retrieve the HBase data efficiently:
WHERE
clause referencing that table,
that is a strong indicator that it is an inefficient query for an HBase table.)
Query predicates are applied to row keys as start and stop keys, thereby limiting the scope of a particular lookup. If row keys are not mapped to string columns, then ordering is typically incorrect and comparison operations do not work. For example, if row keys are not mapped to string columns, evaluating for greater than (>) or less than (<) cannot be completed.
Predicates on non-key columns can be sent to HBase to scan as SingleColumnValueFilters
,
providing some performance gains. In such a case, HBase returns fewer rows than if those same predicates
were applied using Impala. While there is some improvement, it is not as great when start and stop rows are
used. This is because the number of rows that HBase must examine is not limited as it is when start and
stop rows are used. As long as the row key predicate only applies to a single row, HBase will locate and
return that row. Conversely, if a non-key predicate is used, even if it only applies to a single row, HBase
must still scan the entire table to find the correct result.
For example, here are some queries against the following Impala table, which is mapped to an HBase table.
The examples show excerpts from the output of the EXPLAIN
statement, demonstrating what
things to look for to indicate an efficient or inefficient query against an HBase table.
The first column (cust_id
) was specified as the key column in the CREATE
EXTERNAL TABLE
statement; for performance, it is important to declare this column as
STRING
. Other columns, such as BIRTH_YEAR
and
NEVER_LOGGED_ON
, are also declared as STRING
, rather than their
"natural" types of INT
or BOOLEAN
, because Impala can optimize
those types more effectively in HBase tables. For comparison, we leave one column,
YEAR_REGISTERED
, as INT
to show that filtering on this column is
inefficient.
describe hbase_table;
Query: describe hbase_table
+-----------------------+--------+---------+
| name | type | comment |
+-----------------------+--------+---------+
| cust_id | string | |
| birth_year | string | |
| never_logged_on | string | |
| private_email_address | string | |
| year_registered | int | |
+-----------------------+--------+---------+
The best case for performance involves a single row lookup using an equality comparison on the column defined as the row key:
explain select count(*) from hbase_table where cust_id = 'some_user@example.com';
+------------------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| hbase.hbase_table |
| |
| 03:AGGREGATE [MERGE FINALIZE] |
| | output: sum(count(*)) |
| | |
| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
| | |
| 01:AGGREGATE |
| | output: count(*) |
| | |
| 00:SCAN HBASE [hbase.hbase_table] |
| start key: some_user@example.com |
| stop key: some_user@example.com\0 |
+------------------------------------------------------------------------------------+
Another type of efficient query involves a range lookup on the row key column, using SQL operators such
as greater than (or equal), less than (or equal), or BETWEEN
. This example also includes
an equality test on a non-key column; because that column is a STRING
, Impala can let
HBase perform that test, indicated by the hbase filters:
line in the
EXPLAIN
output. Doing the filtering within HBase is more efficient than transmitting all
the data to Impala and doing the filtering on the Impala side.
explain select count(*) from hbase_table where cust_id between 'a' and 'b'
and never_logged_on = 'true';
+------------------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------------------+
...
| 01:AGGREGATE |
| | output: count(*) |
| | |
| 00:SCAN HBASE [hbase.hbase_table] |
| start key: a |
| stop key: b\0 |
| hbase filters: cols:never_logged_on EQUAL 'true' |
+------------------------------------------------------------------------------------+
The query is less efficient if Impala has to evaluate any of the predicates, because Impala must scan the
entire HBase table. Impala can only push down predicates to HBase for columns declared as
STRING
. This example tests a column declared as INT
, and the
predicates:
line in the EXPLAIN
output indicates that the test is
performed after the data is transmitted to Impala.
explain select count(*) from hbase_table where year_registered = 2010;
+------------------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------------------+
...
| 01:AGGREGATE |
| | output: count(*) |
| | |
| 00:SCAN HBASE [hbase.hbase_table] |
| predicates: year_registered = 2010 |
+------------------------------------------------------------------------------------+
The same inefficiency applies if the key column is compared to any non-constant value. Here, even though
the key column is a STRING
, and is tested using an equality operator, Impala must scan
the entire HBase table because the key column is compared to another column value rather than a constant.
explain select count(*) from hbase_table where cust_id = private_email_address;
+------------------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------------------+
...
| 01:AGGREGATE |
| | output: count(*) |
| | |
| 00:SCAN HBASE [hbase.hbase_table] |
| predicates: cust_id = private_email_address |
+------------------------------------------------------------------------------------+
Currently, tests on the row key using OR
or IN
clauses are not
optimized into direct lookups either. Such limitations might be lifted in the future, so always check the
EXPLAIN
output to be sure whether a particular SQL construct results in an efficient
query or not for HBase tables.
explain select count(*) from hbase_table where
cust_id = 'some_user@example.com' or cust_id = 'other_user@example.com';
+----------------------------------------------------------------------------------------+
| Explain String |
+----------------------------------------------------------------------------------------+
...
| 01:AGGREGATE |
| | output: count(*) |
| | |
| 00:SCAN HBASE [hbase.hbase_table] |
| predicates: cust_id = 'some_user@example.com' OR cust_id = 'other_user@example.com' |
+----------------------------------------------------------------------------------------+
explain select count(*) from hbase_table where
cust_id in ('some_user@example.com', 'other_user@example.com');
+------------------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------------------+
...
| 01:AGGREGATE |
| | output: count(*) |
| | |
| 00:SCAN HBASE [hbase.hbase_table] |
| predicates: cust_id IN ('some_user@example.com', 'other_user@example.com') |
+------------------------------------------------------------------------------------+
Either rewrite into separate queries for each value and combine the results in the application, or combine the single-row queries using UNION ALL:
select count(*) from hbase_table where cust_id = 'some_user@example.com';
select count(*) from hbase_table where cust_id = 'other_user@example.com';
explain
select count(*) from hbase_table where cust_id = 'some_user@example.com'
union all
select count(*) from hbase_table where cust_id = 'other_user@example.com';
+------------------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------------------+
...
| | 04:AGGREGATE |
| | | output: count(*) |
| | | |
| | 03:SCAN HBASE [hbase.hbase_table] |
| | start key: other_user@example.com |
| | stop key: other_user@example.com\0 |
| | |
| 10:MERGE |
...
| 02:AGGREGATE |
| | output: count(*) |
| | |
| 01:SCAN HBASE [hbase.hbase_table] |
| start key: some_user@example.com |
| stop key: some_user@example.com\0 |
+------------------------------------------------------------------------------------+
If you have an HBase Java application that calls the
setCacheBlocks
or setCaching
methods of the class org.apache.hadoop.hbase.client.Scan, you can set these same
caching behaviors through Impala query options, to control the memory
pressure on the HBase RegionServer. For example, when doing queries in
HBase that result in full-table scans (which by default are
inefficient for HBase), you can reduce memory usage and speed up the
queries by turning off the HBASE_CACHE_BLOCKS
setting
and specifying a large number for the HBASE_CACHING
setting.
To set these options, issue commands like the following in impala-shell:
-- Same as calling setCacheBlocks(true) or setCacheBlocks(false).
set hbase_cache_blocks=true;
set hbase_cache_blocks=false;
-- Same as calling setCaching(rows).
set hbase_caching=1000;
Or update the impalad defaults file /etc/default/impala and
include settings for HBASE_CACHE_BLOCKS
and/or HBASE_CACHING
in the
-default_query_options
setting for IMPALA_SERVER_ARGS
. See
Modifying Impala Startup Options for details.
SET
statement.
The following are representative use cases for using Impala to query HBase tables:
Storing very wide tables in HBase. Wide tables have many columns, possibly thousands, typically
recording many attributes for an important subject such as a user of an online service. These tables
are also often sparse, that is, most of the columns values are NULL
, 0,
false
, empty string, or other blank or placeholder value. (For example, any particular
web site user might have never used some site feature, filled in a certain field in their profile,
visited a particular part of the site, and so on.) A typical query against this kind of table is to
look up a single row to retrieve all the information about a specific subject, rather than summing,
averaging, or filtering millions of rows as in typical Impala-managed tables.
The Impala INSERT
statement works for HBase tables. The INSERT ... VALUES
syntax is ideally suited to HBase tables, because inserting a single row is an efficient operation for an
HBase table. (For regular Impala tables, with data files in HDFS, the tiny data files produced by
INSERT ... VALUES
are extremely inefficient, so you would not use that technique with
tables containing any significant data volume.)
When you use the INSERT ... SELECT
syntax, the result in the HBase table could be fewer
rows than you expect. HBase only stores the most recent version of each unique row key, so if an
INSERT ... SELECT
statement copies over multiple rows containing the same value for the
key column, subsequent queries will only return one row with each key column value:
Although Impala does not have an UPDATE
statement, you can achieve the same effect by
doing successive INSERT
statements using the same value for the key column each time:
The Impala integration with HBase has the following limitations and restrictions, some inherited from the integration between HBase and Hive, and some unique to Impala:
If you issue a DROP TABLE
for an internal (Impala-managed) table that is mapped to an
HBase table, the underlying table is not removed in HBase. The Hive DROP TABLE
statement also removes the HBase table in this case.
The INSERT OVERWRITE
statement is not available for HBase tables. You can insert new
data, or modify an existing row by inserting a new row with the same key value, but not replace the
entire contents of the table. You can do an INSERT OVERWRITE
in Hive if you need this
capability.
If you issue a CREATE TABLE LIKE
statement for a table mapped to an HBase table, the
new table is also an HBase table, but inherits the same underlying HBase table name as the original.
The new table is effectively an alias for the old one, not a new table with identical column structure.
Avoid using CREATE TABLE LIKE
for HBase tables, to avoid any confusion.
Copying data into an HBase table using the Impala INSERT ... SELECT
syntax might
produce fewer new rows than are in the query result set. If the result set contains multiple rows with
the same value for the key column, each row supercedes any previous rows with the same key value.
Because the order of the inserted rows is unpredictable, you cannot rely on this technique to preserve
the "latest" version of a particular key value.
Because the complex data types (ARRAY
, STRUCT
, and MAP
)
available in Impala 2.3 and higher are currently only supported in Parquet tables, you cannot
use these types in HBase tables that are queried through Impala.
The LOAD DATA
statement cannot be used with HBase tables.
The TABLESAMPLE
clause of the SELECT
statement does
not apply to a table reference derived from a view, a subquery, or anything other than a
real base table. This clause only works for tables backed by HDFS or HDFS-like data
files, therefore it does not apply to Kudu or HBase tables.
The following examples create an HBase table with four column families, create a corresponding table through Hive, then insert and query the table through Impala.
In HBase shell, the table
name is quoted in CREATE
and DROP
statements. Tables created in HBase
begin in "enabled" state; before dropping them through the HBase shell, you must issue a
disable 'table_name'
statement.
$ hbase shell
15/02/10 16:07:45
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
...
hbase(main):001:0> create 'hbasealltypessmall', 'boolsCF', 'intsCF', 'floatsCF', 'stringsCF'
0 row(s) in 4.6520 seconds
=> Hbase::Table - hbasealltypessmall
hbase(main):006:0> quit
Issue the following CREATE TABLE
statement in the Hive shell. (The Impala CREATE
TABLE
statement currently does not support the STORED BY
clause, so you switch into Hive to
create the table, then back to Impala and the impala-shell interpreter to issue the
queries.)
This example creates an external table mapped to the HBase table, usable by both Impala and Hive. It is defined as an external table so that when dropped by Impala or Hive, the original HBase table is not touched at all.
The WITH SERDEPROPERTIES
clause
specifies that the first column (ID
) represents the row key, and maps the remaining
columns of the SQL table to HBase column families. The mapping relies on the ordinal order of the
columns in the table, not the column names in the CREATE TABLE
statement.
The first column is defined to be the lookup key; the
STRING
data type produces the fastest key-based lookups for HBase tables.
STRING
column as the row key, as shown in this example.
$ hive
...
hive> use hbase;
OK
Time taken: 4.095 seconds
hive> CREATE EXTERNAL TABLE hbasestringids (
> id string,
> bool_col boolean,
> tinyint_col tinyint,
> smallint_col smallint,
> int_col int,
> bigint_col bigint,
> float_col float,
> double_col double,
> date_string_col string,
> string_col string,
> timestamp_col timestamp)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES (
> "hbase.columns.mapping" =
> ":key,boolsCF:bool_col,intsCF:tinyint_col,intsCF:smallint_col,intsCF:int_col,intsCF:\
> bigint_col,floatsCF:float_col,floatsCF:double_col,stringsCF:date_string_col,\
> stringsCF:string_col,stringsCF:timestamp_col"
> )
> TBLPROPERTIES("hbase.table.name" = "hbasealltypessmall");
OK
Time taken: 2.879 seconds
hive> quit;
Once you have established the mapping to an HBase table, you can issue DML statements and queries
from Impala. The following example shows a series of INSERT
statements followed by a query.
The ideal kind of query from a performance standpoint
retrieves a row from the table based on a row key
mapped to a string column.
An initial INVALIDATE METADATA table_name
statement makes the table created through Hive visible to Impala.
$ impala-shell -i localhost -d hbase
Starting Impala Shell without Kerberos authentication
Connected to localhost:21000
...
Query: use `hbase`
[localhost:21000] > invalidate metadata hbasestringids;
Fetched 0 row(s) in 0.09s
[localhost:21000] > desc hbasestringids;
+-----------------+-----------+---------+
| name | type | comment |
+-----------------+-----------+---------+
| id | string | |
| bool_col | boolean | |
| double_col | double | |
| float_col | float | |
| bigint_col | bigint | |
| int_col | int | |
| smallint_col | smallint | |
| tinyint_col | tinyint | |
| date_string_col | string | |
| string_col | string | |
| timestamp_col | timestamp | |
+-----------------+-----------+---------+
Fetched 11 row(s) in 0.02s
[localhost:21000] > insert into hbasestringids values ('0001',true,3.141,9.94,1234567,32768,4000,76,'2014-12-31','Hello world',now());
Inserted 1 row(s) in 0.26s
[localhost:21000] > insert into hbasestringids values ('0002',false,2.004,6.196,1500,8000,129,127,'2014-01-01','Foo bar',now());
Inserted 1 row(s) in 0.12s
[localhost:21000] > select * from hbasestringids where id = '0001';
+------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+
| id | bool_col | double_col | float_col | bigint_col | int_col | smallint_col | tinyint_col | date_string_col | string_col | timestamp_col |
+------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+
| 0001 | true | 3.141 | 9.939999580383301 | 1234567 | 32768 | 4000 | 76 | 2014-12-31 | Hello world | 2015-02-10 16:36:59.764838000 |
+------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+
Fetched 1 row(s) in 0.54s
INVALIDATE METADATA table_name
statement the next
time you connect to Impala, make Impala aware of the new table. (Prior to Impala 1.2.4,
you could not specify the table name if Impala was not aware of the table yet; in Impala
1.2.4 and higher, specifying the table name avoids reloading the metadata for other
tables that are not changed.)