Post-Installation Configuration for Impala

This section describes the mandatory and recommended configuration settings for Impala. If Impala is installed using cluster management software, some of these configurations might be completed automatically; you must still configure short-circuit reads manually. If you want to customize your environment, consider making the changes described in this topic.

You must enable short-circuit reads, whether or not Impala was installed with cluster management software. This setting goes in the Impala configuration settings, not the Hadoop-wide settings.
You must enable block location tracking, and you can optionally enable native checksumming for optimal performance.

Mandatory: Short-Circuit Reads

Enabling short-circuit reads allows Impala to read local data directly from the file system. This removes the need to communicate through the DataNodes, improving performance. This setting also minimizes the number of additional copies of data. Short-circuit reads requires libhadoop.so (the Hadoop Native Library) to be accessible to both the server and the client. libhadoop.so is not available if you have installed from a tarball. You must install from an .rpm, .deb, or parcel to use short-circuit local reads.

To configure DataNodes for short-circuit reads:

Copy the client core-site.xml and hdfs-site.xml configuration files from the Hadoop configuration directory to the Impala configuration directory. The default Impala configuration location is /etc/impala/conf.

On all Impala nodes, configure the following properties in Impala's copy of hdfs-site.xml as shown:

<property>
    <name>dfs.client.read.shortcircuit</name>
    <value>true</value>
</property>

<property>
    <name>dfs.domain.socket.path</name>
    <value>/var/run/hdfs-sockets/dn</value>
</property>

<property>
    <name>dfs.client.file-block-storage-locations.timeout.millis</name>
    <value>10000</value>
</property>

If /var/run/hadoop-hdfs/ is group-writable, make sure its group is root.

Note: If you are also going to enable block location tracking, you can skip copying configuration files and restarting DataNodes and go straight to Optional: Block Location Tracking. Configuring short-circuit reads and block location tracking require the same process of copying files and restarting services, so you can complete that process once when you have completed all configuration changes. Whether you copy files and restart services now or during configuring block location tracking, short-circuit reads are not enabled until you complete those final steps.
After applying these changes, restart all DataNodes.

Mandatory: Block Location Tracking

Enabling block location metadata allows Impala to know which disk data blocks are located on, allowing better utilization of the underlying disks. Impala will not start unless this setting is enabled.

To enable block location tracking:

For each DataNode, adding the following to the hdfs-site.xml file:

<property>
  <name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
  <value>true</value>
</property>

Copy the client core-site.xml and hdfs-site.xml configuration files from the Hadoop configuration directory to the Impala configuration directory. The default Impala configuration location is /etc/impala/conf.
After applying these changes, restart all DataNodes.

Optional: Native Checksumming

Enabling native checksumming causes Impala to use an optimized native library for computing checksums, if that library is available.

To enable native checksumming:

If you installed from packages, the native checksumming library is installed and setup correctly. In such a case, no additional steps are required. Conversely, if you installed by other means, such as with tarballs, native checksumming may not be available due to missing shared objects. Finding the message "Unable to load native-hadoop library for your platform... using builtin-java classes where applicable" in the Impala logs indicates native checksumming may be unavailable. To enable native checksumming, you must build and install libhadoop.so (the Hadoop Native Library).