For most clusters that have multiple users and production availability requirements, you might want to set up a load-balancing proxy server to relay requests to and from Impala.
Set up a software package of your choice to perform these functions.
Most considerations for load balancing and high availability apply to the impalad daemon. The statestored and catalogd daemons do not have special requirements for high availability, because problems with those daemons do not result in data loss. If those daemons become unavailable due to an outage on a particular host, you can stop the Impala service, delete the Impala StateStore and Impala Catalog Server roles, add the roles on a different host, and restart the Impala service.
Using a load-balancing proxy server for Impala has the following advantages:
The following setup steps are a general outline that apply to any load-balancing proxy software:
-i
option in
impala-shell) to point to the load balancer instead.
Load-balancing software offers a number of algorithms to distribute requests. Each algorithm has its own characteristics that make it suitable in some situations but not others.
Sessions from the same IP address always go to the same coordinator. A good choice
for Impala workloads containing a mix of queries and DDL statements, such as
CREATE TABLE
and ALTER TABLE
. Because the
metadata changes from a DDL statement take time to propagate across the cluster,
prefer to use the Source IP Persistence in this case. If you are unable to choose
Source IP Persistence, run the DDL and subsequent queries that depend on the
results of the DDL through the same session, for example by running
impala-shell -f script_file
to submit several
statements through a single session.
You might need to perform benchmarks and load testing to determine which setting is optimal for your use case. Always set up using two load-balancing algorithms: Source IP Persistence for Hue and Leastconn for others.
In a cluster using Kerberos, applications check host credentials to verify that the host they are connecting to is the same one that is actually processing the request.
In Impala 2.11 and lower versions, once you enable a proxy server in a Kerberized cluster, users will not be able to connect to individual impala daemons directly from impala-shell.
In Impala 2.12 and higher versions, when you
enable a proxy server in a Kerberized cluster, users have an option to connect to Impala
daemons directly from impala-shell using the -b
/
--kerberos_host_fqdn
impala-shell flag. This option
can be used for testing or troubleshooting purposes, but not recommended for live
production environments as it defeats the purpose of a load balancer/proxy.
impala-shell -i impalad-1.mydomain.com -k -b loadbalancer-1.mydomain.com
impala-shell --impalad=impalad-1.mydomain.com:21000 --kerberos --kerberos_host_fqdn=loadbalancer-1.mydomain.com
See impala-shell Configuration Options for information about the option.
To validate the load-balancing proxy server, perform these extra Kerberos setup steps:
impala/proxy_host@realm
in its
keytab. If not, go back over the initial Kerberos configuration
steps for the keytab on each host running the
impalad daemon.
impala/actual_hostname@realm
to
the keytab on each host running the impalad
daemon.
$ ktutil
ktutil: read_kt proxy.keytab
ktutil: read_kt impala.keytab
ktutil: write_kt proxy_impala.keytab
ktutil: quit
klist -k keytabfile
The command lists the credentials for both principal
and
be_principal
on all nodes.
impala
user has the permission to read this merged
keytab file.
impalad
host in the cluster that participates in
the load balancing, add the following configuration options to receive client
connections coming through the load balancer proxy server:
--principal=impala/proxy_host@realm
--be_principal=impala/actual_host@realm
--keytab_file=path_to_merged_keytab
The --principal
setting prevents a client from connecting to a
coordinator impalad
using a principal other than one specified.
--be_principal
because the actual host
name is different on each host. Specify the fully qualified domain name (FQDN) for
the proxy host, not the IP address. Use the exact FQDN as returned by a reverse DNS
lookup for the associated IP address.
When a client connect to Impala, the service principal specified by the client must
match the -principal
setting of the Impala proxy server. And the
client should connect to the proxy server port.
In hue.ini, set the following to configure Hue to automatically connect to the proxy server:
[impala]
server_host=proxy_host
impala_principal=impala/proxy_host
The following are the JDBC connection string formats when connecting through the load balancer with the load balancer's host name in the principal:
jdbc:hive2://proxy_host:load_balancer_port/;principal=impala/_HOST@realm
jdbc:hive2://proxy_host:load_balancer_port/;principal=impala/proxy_host@realm
When starting impala-shell, specify the service principal via the
-b
or --kerberos_host_fqdn
flag.
When TLS/SSL is enabled for Impala, the client application, whether impala-shell, Hue,
or something else, expects the certificate common name (CN) to match the hostname that
it is connected to. With no load balancing proxy server, the hostname and certificate CN
are both that of the impalad
instance. However, with a proxy server,
the certificate presented by the impalad
instance does not match the
load balancing proxy server hostname. If you try to load-balance a TLS/SSL-enabled
Impala installation without additional configuration, you see a certificate mismatch
error when a client attempts to connect to the load balancing proxy host.
You can configure a proxy server in several ways to load balance TLS/SSL enabled Impala:
impalad
. The client and server certificates can
be managed separately. The request or resulting payload is encrypted
in transit at all times. impalad
instance with no interaction from the load balancing proxy
server. Traffic is still encrypted end-to-end.
impalad
instances is unencrypted.
This configuration presumes that cluster hosts reside on a trusted network and only
external client-facing communication need to be encrypted in-transit.
Refer to your load balancer documentation for the steps to set up Impala and the load balancer using one of the options above.
If you are not already using a load-balancing proxy, you can experiment with HAProxy a free, open source load balancer. This example shows how you might install and configure that load balancer on a Red Hat Enterprise Linux system.
Install the load balancer:
yum install haproxy
Set up the configuration file: /etc/haproxy/haproxy.cfg. See the following section for a sample configuration file.
Run the load balancer (on a single host, preferably one not running impalad):
/usr/sbin/haproxy –f /etc/haproxy/haproxy.cfg
In impala-shell, JDBC applications, or ODBC applications, connect
to the listener port of the proxy host, rather than port 21000 or 21050 on a host
actually running impalad. The sample configuration file sets
haproxy to listen on port 25003, therefore you would send all requests to
haproxy_host:25003
.
This is the sample haproxy.cfg used in this example:
global
# To have these messages end up in /var/log/haproxy.log you will
# need to:
#
# 1) configure syslog to accept network log events. This is done
# by adding the '-r' option to the SYSLOGD_OPTIONS in
# /etc/sysconfig/syslog
#
# 2) configure local2 events to go to the /var/log/haproxy.log
# file. A line like the following can be added to
# /etc/sysconfig/syslog
#
# local2.* /var/log/haproxy.log
#
log 127.0.0.1 local0
log 127.0.0.1 local1 notice
chroot /var/lib/haproxy
pidfile /var/run/haproxy.pid
maxconn 4000
user haproxy
group haproxy
daemon
# turn on stats unix socket
#stats socket /var/lib/haproxy/stats
#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#
# You might need to adjust timing values to prevent timeouts.
#
# The timeout values should be dependant on how you use the cluster
# and how long your queries run.
#---------------------------------------------------------------------
defaults
mode http
log global
option httplog
option dontlognull
option http-server-close
option forwardfor except 127.0.0.0/8
option redispatch
retries 3
maxconn 3000
timeout connect 5000
timeout client 3600s
timeout server 3600s
#
# This sets up the admin page for HA Proxy at port 25002.
#
listen stats :25002
balance
mode http
stats enable
stats auth username:password
# Setup for Impala.
# Impala client connect to load_balancer_host:25003.
# HAProxy will balance connections among the list of servers listed below.
# The list of Impalad is listening at port 21000 for beeswax (impala-shell) or original ODBC driver.
# For JDBC or ODBC version 2.x driver, use port 21050 instead of 21000.
listen impala :25003
mode tcp
option tcplog
balance leastconn
server symbolic_name_1 impala-host-1.example.com:21000 check
server symbolic_name_2 impala-host-2.example.com:21000 check
server symbolic_name_3 impala-host-3.example.com:21000 check
server symbolic_name_4 impala-host-4.example.com:21000 check
# Setup for Hue or other JDBC-enabled applications.
# In particular, Hue requires sticky sessions.
# The application connects to load_balancer_host:21051, and HAProxy balances
# connections to the associated hosts, where Impala listens for
# JDBC requests at port 21050.
listen impalajdbc :21051
mode tcp
option tcplog
balance source
server symbolic_name_5 impala-host-1.example.com:21050 check
server symbolic_name_6 impala-host-2.example.com:21050 check
server symbolic_name_7 impala-host-3.example.com:21050 check
server symbolic_name_8 impala-host-4.example.com:21050 check
check
option at end of each line in the above file to
ensure HAProxy can detect any unreachable Impalad server, and
failover can be successful. Without the TCP check, you may hit an error when the
impalad daemon to which Hue tries to connect is down.
haproxy
, be cautious about reusing the connections. If the load
balancer has set up connection timeout values, either check the connection frequently so
that it never sits idle longer than the load balancer timeout value, or check the
connection validity before using it and create a new one if the connection has been
closed.