Installing Impala

Impala is an open-source analytic database for Apache Hadoop that returns rapid responses to queries.

Follow these steps to set up Impala on a cluster by building from source:

  • Download the latest release. See the Impala downloads page for the link to the latest release.

  • Check the README.md file for a pointer to the build instructions.

  • Please check the MD5 and SHA1 and GPG signature, the latter by using the code signing keys of the release managers.

  • Developers interested in working on Impala can clone the Impala source repository:
    
    git clone https://git-wip-us.apache.org/repos/asf/impala.git
    

What is Included in an Impala Installation

Impala is made up of a set of components that can be installed on multiple nodes throughout your cluster. The key installation step for performance is to install the impalad daemon (which does most of the query processing work) on all DataNodes in the cluster.

Impala primarily consists of these executables, which should be available after you build from source:

  • impalad - The Impala daemon. Plans and executes queries against HDFS, HBase, and Amazon S3 data. Run one impalad process on each node in the cluster that has a DataNode.

  • statestored - Name service that tracks location and status of all impalad instances in the cluster. Run one instance of this daemon on a node in your cluster. Most production deployments run this daemon on the namenode.

  • catalogd - Metadata coordination service that broadcasts changes from Impala DDL and DML statements to all affected Impala nodes, so that new tables, newly loaded data, and so on are immediately visible to queries submitted through any Impala node. (Prior to Impala 1.2, you had to run the REFRESH or INVALIDATE METADATA statement on each node to synchronize changed metadata. Now those statements are only required if you perform the DDL or DML through an external mechanism such as Hive or by uploading data to the Amazon S3 filesystem.) Run one instance of this daemon on a node in your cluster, preferably on the same host as the statestored daemon.

  • impala-shell - Command-line interface for issuing queries to the Impala daemon. You install this on one or more hosts anywhere on your network, not necessarily DataNodes or even within the same cluster as Impala. It can connect remotely to any instance of the Impala daemon.

Before starting working with Impala, ensure that you have all necessary prerequisites. See Impala Requirements for details.