Benchmarking Impala Queries
Because Impala, like other Hadoop components, is designed to handle large data volumes in a distributed environment, conduct any performance tests using realistic data and cluster configurations. Use a multi-node cluster rather than a single node; run queries against tables containing terabytes of data rather than tens of gigabytes. The parallel processing techniques used by Impala are most appropriate for workloads that are beyond the capacity of a single server.
When you run queries returning large numbers of rows, the CPU time to pretty-print the output can be
substantial, giving an inaccurate measurement of the actual query time. Consider using the
-B
option on the impala-shell
command to turn off the pretty-printing, and
optionally the -o
option to store query results in a file rather than printing to the
screen. See impala-shell Configuration Options for details.