Using Impala to Query External JDBC Data Sources
Apache Impala now supports reading from external JDBC data sources. An external JDBC table represents a table or a view in a remote RDBMS database or another Impala cluster. Using external JDBC tables, you can connect Impala to a database, such as MySQL, PostgreSQL, or another Impala cluster and read the data in the remote tables.
Syntax
To connect to a remote database, you create an external JDBC table with the appropriate table properties, such as the database type, JDBC URL, driver class, driver file location, JDBC username and password, and name of the remote table to be mapped to the Impala external JDBC table.
CREATE EXTERNAL TABLE [IF NOT EXISTS] [db_name.]table_name
(col_name data_type,
....)
STORED BY JDBC
TBLPROPERTIES (
"database.type"="value",
"jdbc.url"="value",
"jdbc.driver"="value",
"driver.url"="value",
"dbcp.username"="value",
"dbcp.password"="value",
"table"="table.name");
Examples:
CREATE EXTERNAL TABLE student_jdbc (
id INT,
bool_col BOOLEAN,
tinyint_col TINYINT,
smallint_col SMALLINT,
int_col INT,
bigint_col BIGINT,
float_col FLOAT,
double_col DOUBLE,
date_col DATE,
string_col STRING,
timestamp_col TIMESTAMP)
STORED BY JDBC
TBLPROPERTIES (
"database.type"="POSTGRES",
"jdbc.url"="jdbc:postgresql://IP_address:5432/database_name",
"jdbc.driver"="org.postgresql.Driver",
"driver.url"="/test-warehouse/data-sources/jdbc-drivers/postgresql-jdbc.jar",
"dbcp.username"="user",
"dbcp.password"="password",
"table"="student");
CREATE EXTERNAL TABLE student_jdbc (
id INT,
bool_col BOOLEAN,
tinyint_col TINYINT,
smallint_col SMALLINT,
int_col INT,
bigint_col BIGINT,
float_col FLOAT,
double_col DOUBLE,
date_col DATE,
string_col STRING,
timestamp_col TIMESTAMP)
STORED BY JDBC
TBLPROPERTIES (
"database.type"="IMPALA",
"jdbc.url"="jdbc:impala://IP_address:21050/database_name",
"jdbc.auth"="AuthMech=3",
"jdbc.properties"="MEM_LIMIT=1000000000, MAX_ERRORS = 10000",
"jdbc.driver"="com.cloudera.impala.jdbc.Driver",
"driver.url"="hdfs://test-warehouse/data-sources/jdbc-drivers/ImpalaJDBC42.jar",
"dbcp.username"="user",
"dbcp.password.keystore"="jceks://hdfs/test-warehouse/data-sources/test.jceks",
"dbcp.password.key"="password-key",
"table"="student");
Table Properties
While creating an external JDBC table, you are required to specify the following table properties:
database.type
: POSTGRES, MYSQL, or IMPALAjdbc.url
: JDBC connection string with the required parameters — database type, hostname/IP address, port number, and database name.Example: “jdbc:impala://10.96.132.138:21050/sample_db”.
jdbc.driver
: Class name of the JDBC driverdriver.url
: URL to download the JAR file package that is used to access the external databasetable
: Name of the table in the remote database that you want to map in Impala
Besides the above required properties, you can also specify optional parameters that allow you to use different authentication methods, allow case sensitive column names in remote tables, or to specify additional database properties:
jdbc.auth
: Authentication mechanism of the JDBC driverdbcp.username
: JDBC usernamedbcp.password
: JDBC password in clear text.Note: Storing JDBC passwords in clear text is not recommended in production environments. The recommended way is to store the password in a Java keystore file.dbcp.password.key
: Key of the Java keystoredbcp.password.keystore
: URI of the keystore filejdbc.properties
: Additional properties applied to database engines, like Impala Query options. The properties are specified as comma-separated "key-value" pairs.jdbc.fetch.size
: Number of rows to fetch in a batchcolumn.mapping
: Mapping of column names between external table and Impala JDBC table.
Supported Data Types
The following column data types are supported for an Impala external JDBC table:
- Numeric data type: boolean, tinyint, smallint, int, bigint, float, double
- Decimal with scale and precision
- String type: string
- Date
- Timestamp
Limitations
You must be aware of the following limitations while using Impala external JDBC tables:
- Following column data types are not supported: char, varchar, binary,
Complex data types - struct, map, array, and nested type
- JDBC tables have to be defined one table at a time
- Writing to a JDBC table is not supported
- Only supported binary predicates with operators =, !=, <=, >=, <, > to be pushed to RDBMS
Securing the JDBC Password
The dbcp.password
table property stores the JDBC password in clear
text. To avoid the risk of a password leak, the SHOW CREATE TABLE
<table-name>
and DESCRIBE FORMATTED | EXTENDED
<table-name>
statements mask the value of the
dbcp.password
table property in their outputs.
In production environments, it is recommended that you do not store the JDBC password
in clear text using the dbcp.password
table property. Instead, you
can store the password in a Java Keystore file on HDFS or on cloud storage like
Amazon S3 using the following command:
hadoop credential create host1.password -provider jceks://hdfs/user/foo/test.jceks -v passwd1
hadoop credential create impala -provider jceks://s3a@dw-impala-test/jceks/demo.jceks -v passwd2
For more information, see the Apache Hadoop CredentialProvider API Guide.
Support for case-sensitive table and column names
The column names of tables in the remote database can be different from the external
JDBC table schema. For example, Postgres allows case-sensitive column names,
however, Impala saves column names in lowercase. In such situations, you can set the
column.mapping
table property to map column names between
Impala external JDBC tables and the remote tables.
"column.mapping"="id=id, bool_col=Bool_col, tinyint_col=Tinyint_col,
smallint_col=Smallint_col, int_col=Int_col, bigint_col=Bigint_col, float_col=Float_col, double_col=Double_col, date_col=date_col, string_col=String_col, timestamp=Timestamp");
Modifying the external JDBC table
You can use the ALTER TABLE statement to add, drop, or modify columns, or modify the table properties of existing external JDBC tables. The syntax is the same as the other Impala tables.
ALTER TABLE student_jdbc ADD COLUMN IF NOT EXISTS date_col DATE;
ALTER TABLE student_jdbc DROP COLUMN int_col;
ALTER TABLE student_jdbc CHANGE COLUMN date_col timestamp_col TIMESTAMP;
ALTER TABLE student_jdbc
SET TBLPROPERTIES ("dbcp.username"="impala", "dbcp.password"="password");
Querying external JDBC tables
Querying or reading external JDBC tables is the same as querying regular tables in Impala. You can use SELECT statements to query data and can also join the external table with other tables across databases. However, do note that the metadata for the external tables is not persisted in Hive Metastore (HMS).
SELECT * from student_jdbc;
Query options for external JDBC tables
A new query option, CLEAN_DBCP_DS_CACHE is added to save the DBCP SQL DataSource objects in the cache for a longer period of time. This allows the DBCP connection pools to be reused across multiple queries. When the value is set to false, the DBCP SQL DataSource object is not closed when its reference count is 0. The SQL DataSource object is kept in cache until the object is idle for more than 5 minutes.
Type: BOOLEAN
Default: True (1)