How to Configure Impala with Dedicated Coordinators

Each host that runs the Impala Daemon acts as both a coordinator and as an executor, by default, managing metadata caching, query compilation, and query execution. In this configuration, Impala clients can connect to any Impala daemon and send query requests.

During highly concurrent workloads for large-scale queries, the dual roles can cause scalability issues because:

The following factors can further exacerbate the above issues:

If such scalability bottlenecks occur, in Impala 2.9 and higher, you can assign one dedicated role to each Impala daemon host, either as a coordinator or as an executor, to address the issues.

Using dedicated coordinators offers the following benefits:

In this configuration with dedicated coordinators / executors, you cannot connect to the dedicated executor hosts through clients such as impala-shell or business intelligence tools as only the coordinator nodes support client connections.

Determining the Optimal Number of Dedicated Coordinators

You should have the smallest number of coordinators that will still satisfy your workload requirements in a cluster. A rough estimation is 1 coordinator for every 50 executors.

To maintain a healthy state and optimal performance, it is recommended that you keep the peak utilization of all resources used by Impala, including CPU, the number of threads, the number of connections, and RPCs, under 80%.

Consider the following factors to determine the right number of coordinators in your cluster:

Start with the below set of steps to determine the initial number of coordinators:

  1. If your cluster has less than 10 nodes, we recommend that you configure one dedicated coordinator. Deploy the dedicated coordinator on a DataNode to avoid losing storage capacity. In most of cases, one dedicated coordinator is enough to support all workloads on a cluster.
  2. Add more coordinators if the dedicated coordinator CPU or network peak utilization is 80% or higher. You might need 1 coordinator for every 50 executors.
  3. If the Impala service is shared by multiple workgroups with a dynamic resource pool assigned, use one coordinator per pool to avoid admission control over admission.
  4. If high availability is required, double the number of coordinators. One set as an active set and the other as a backup set.

Advanced Tuning

Use the following guidelines to further tune the throughput and stability.
  1. The concurrency of DML statements does not typically depend on the number of coordinators or size of the cluster. Queries that return large result sets (10,000+ rows) consume more CPU and memory resources on the coordinator. Add one or two coordinators if the workload has many such queries.
  2. DDL queries, excluding COMPUTE STATS and CREATE TABLE AS SELECT, are executed only on coordinators. If your workload contains many DDL queries running concurrently, you could add one coordinator.
  3. The CPU contention on coordinators can slow down query executions when concurrency is high, especially for very short queries (<10s). Add more coordinators to avoid CPU contention.
  4. On a large cluster with 50+ nodes, the number of network connections from a coordinator to executors can grow quickly as query complexity increases. The growth is much greater on coordinators than executors. Add a few more coordinators if workloads are complex, i.e. (an average number of fragments * number of Impalad) > 500, but with the low memory/CPU usage to share the load. Watch IMPALA-4603 and IMPALA-7213 to track the progress on fixing this issue.
  5. When using multiple coordinators for DML statements, divide queries to different groups (number of groups = number of coordinators). Configure a separate dynamic resource pool for each group and direct each group of query requests to a specific coordinator. This is to avoid query over admission.
  6. The front-end connection requirement is not a factor in determining the number of dedicated coordinators. Consider setting up a connection pool at the client side instead of adding coordinators. For a short-term solution, you could increase the value of fe_service_threads on coordinators to allow more client connections.
  7. In general, you should have a very small number of coordinators so storage capacity reduction is not a concern. On a very small cluster (less than 10 nodes), deploy a dedicated coordinator on a DataNode to avoid storage capacity reduction.

Estimating Coordinator Resource Usage

Resource Safe range Notes / CM tsquery to monitor
Memory

(Max JVM heap setting +

query concurrency *

query mem_limit)

<=

80% of Impala process memory allocation

Memory usage:

SELECT mem_rss WHERE entityName = "Coordinator Instance ID" AND category = ROLE

JVM heap usage (metadata cache):

SELECT impala_jvm_heap_current_usage_bytes WHERE entityName = "Coordinator Instance ID" AND category = ROLE (only in release 5.15 and above)

TCP Connection Incoming + outgoing < 16K

Incoming connection usage:

SELECT thrift_server_backend_connections_in_use WHERE entityName = "Coordinator Instance ID" AND category = ROLE

Outgoing connection usage:

SELECT backends_client_cache_clients_in_use WHERE entityName = "Coordinator Instance ID" AND category = ROLE

Threads < 32K SELECT thread_manager_running_threads WHERE entityName = "Coordinator Instance ID" AND category = ROLE
CPU

Concurrency =

non-DDL query concurrency <=

number of virtual cores allocated to Impala per node

CPU usage estimation should be based on how many cores are allocated to Impala per node, not a sum of all cores of the cluster.

It is recommended that concurrency should not be more than the number of virtual cores allocated to Impala per node.

Query concurrency:

SELECT total_impala_num_queries_registered_across_impalads WHERE entityName = "IMPALA-1" AND category = SERVICE

If usage of any of the above resources exceeds the safe range, add one more coordinator.

Deploying Dedicated Coordinators and Executors

This section describes the process to configure a dedicated coordinator and a dedicated executor roles for Impala.

To configuring dedicated coordinators/executors, you specify one of the following startup flags for the impalad daemon on each host:
  • ‑‑is_executor=false for each host that does not act as an executor for Impala queries. These hosts act exclusively as query coordinators. This setting typically applies to a relatively small number of hosts, because the most common topology is to have nearly all DataNodes doing work for query execution.

  • ‑‑is_coordinator=false for each host that does not act as a coordinator for Impala queries. These hosts act exclusively as executors. The number of hosts with this setting typically increases as the cluster grows larger and handles more table partitions, data files, and concurrent queries. As the overhead for query coordination increases, it becomes more important to centralize that work on dedicated hosts.