Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
|
#include <hash-table.h>
Public Member Functions | |
HashTableCtx (const std::vector< ExprContext * > &build_expr_ctxs, const std::vector< ExprContext * > &probe_expr_ctxs, bool stores_nulls, bool finds_nulls, int32_t initial_seed, int max_levels, int num_build_tuples) | |
void | Close () |
Call to cleanup any resources. More... | |
void | set_level (int level) |
int | level () const |
uint32_t | seed (int level) |
TupleRow * | row () const |
void * | last_expr_value (int expr_idx) const |
bool | last_expr_value_null (int expr_idx) const |
Returns if the expr at 'expr_idx' evaluated to NULL for the last row. More... | |
bool IR_ALWAYS_INLINE | EvalAndHashBuild (TupleRow *row, uint32_t *hash) |
bool IR_ALWAYS_INLINE | EvalAndHashProbe (TupleRow *row, uint32_t *hash) |
int | results_buffer_size () const |
llvm::Function * | CodegenEvalRow (RuntimeState *state, bool build_row) |
llvm::Function * | CodegenEquals (RuntimeState *state) |
llvm::Function * | CodegenHashCurrentRow (RuntimeState *state, bool use_murmur) |
Static Public Attributes | |
static const char * | LLVM_CLASS_NAME = "class.impala::HashTableCtx" |
Private Member Functions | |
uint32_t IR_NO_INLINE | HashCurrentRow () |
uint32_t | Hash (const void *input, int len, int32_t hash) |
Wrapper function for calling correct HashUtil function in non-codegen'd case. More... | |
bool IR_NO_INLINE | EvalBuildRow (TupleRow *row) |
bool IR_NO_INLINE | EvalProbeRow (TupleRow *row) |
uint32_t | HashVariableLenRow () |
bool | EvalRow (TupleRow *row, const std::vector< ExprContext * > &ctxs) |
bool IR_NO_INLINE | Equals (TupleRow *build_row) |
uint32_t | GetHashSeed () const |
Cross-compiled functions to access member variables used in CodegenHashCurrentRow(). More... | |
Private Attributes | |
const std::vector< ExprContext * > & | build_expr_ctxs_ |
const std::vector< ExprContext * > & | probe_expr_ctxs_ |
const bool | stores_nulls_ |
const bool | finds_nulls_ |
int | level_ |
std::vector< uint32_t > | seeds_ |
The seeds to use for hashing. Indexed by the level. More... | |
std::vector< int > | expr_values_buffer_offsets_ |
int | var_result_begin_ |
int | results_buffer_size_ |
uint8_t * | expr_values_buffer_ |
uint8_t * | expr_value_null_bits_ |
TupleRow * | row_ |
Scratch buffer to generate rows on the fly. More... | |
Friends | |
class | HashTable |
Linear or quadratic probing hash table implementation tailored to the usage pattern for partitioned hash aggregation and hash joins. The hash table stores TupleRows and allows for different exprs for insertions and finds. This is the pattern we use for joins and aggregation where the input/build tuple row descriptor is different from the find/probe descriptor. The implementation is designed to allow codegen for some paths. In addition to the hash table there is also an accompanying hash table context that is used for insertions and probes. For example, the hash table context stores evaluated expr results for the current row being processed when possible into a contiguous memory buffer. This allows for efficient hash computation. The hash table does not support removes. The hash table is not thread safe. The table is optimized for the partition hash aggregation and hash joins and is not intended to be a generic hash table implementation. The API loosely mimics the std::hashset API. The data (rows) are stored in a BufferedTupleStream. The basic data structure of this hash table is a vector of buckets. The buckets (indexed by the mod of the hash) contain a pointer to either the slot in the tuple-stream or in case of duplicate values, to the head of a linked list of nodes that in turn contain a pointer to tuple-stream slots. When inserting an entry we start at the bucket at position (hash % size) and search for either a bucket with the same hash or for an empty bucket. If a bucket with the same hash is found, we then compare for row equality and either insert a duplicate node if the equality is true, or continue the search if the row equality is false. Similarly, when probing we start from the bucket at position (hash % size) and search for an entry with the same hash or for an empty bucket. In the former case, we then check for row equality and continue the search if the row equality is false. In the latter case, the probe is not successful. When growing the hash table, the number of buckets is doubled. We trigger a resize when the fill factor is approx 75%. Due to the doubling nature of the buckets, we require that the number of buckets is a power of 2. This allows us to perform a modulo of the hash using a bitmask. We choose to use linear or quadratic probing because they exhibit good (predictable) cache behavior. We require that the number of buckets is a power of 2. This allows us to determine the bucket index using a bitmask. The first NUM_SMALL_BLOCKS of nodes_ are made of blocks less than the IO size (of 8MB) to reduce the memory footprint of small queries. TODO: Compare linear and quadratic probing and remove the loser. TODO: We currently use 32-bit hashes. There is room in the bucket structure for at least 48-bits. We should exploit this space. TODO: Consider capping the probes with a threshold value. If an insert reaches that threshold it is inserted to another linked list of overflow entries. TODO: Smarter resizes, and perhaps avoid using powers of 2 as the hash table size. TODO: this is not a fancy hash table in terms of memory access patterns (cuckoo-hashing or something that spills to disk). We will likely want to invest more time into this. TODO: hash-join and aggregation have very different access patterns. Joins insert all the rows and then calls scan to find them. Aggregation interleaves Find() and Inserts(). We may want to optimize joins more heavily for Inserts() (in particular growing). TODO: Batched interface for inserts and finds. TODO: Do we need to check mem limit exceeded so often. Check once per batch? Control block for a hash table. This class contains the logic as well as the variables needed by a thread to operate on a hash table.
Definition at line 104 of file hash-table.h.
HashTableCtx::HashTableCtx | ( | const std::vector< ExprContext * > & | build_expr_ctxs, |
const std::vector< ExprContext * > & | probe_expr_ctxs, | ||
bool | stores_nulls, | ||
bool | finds_nulls, | ||
int32_t | initial_seed, | ||
int | max_levels, | ||
int | num_build_tuples | ||
) |
Create a hash table context.
Definition at line 83 of file hash-table.cc.
References build_expr_ctxs_, impala::Expr::ComputeResultsLayout(), expr_value_null_bits_, expr_values_buffer_, expr_values_buffer_offsets_, probe_expr_ctxs_, results_buffer_size_, SEED_PRIMES, seeds_, and var_result_begin_.
void HashTableCtx::Close | ( | ) |
Call to cleanup any resources.
Definition at line 112 of file hash-table.cc.
References expr_value_null_bits_, expr_values_buffer_, and row_.
Function * HashTableCtx::CodegenEquals | ( | RuntimeState * | state | ) |
Codegen for evaluating a TupleRow and comparing equality against 'expr_values_buffer_'. Function signature matches HashTable::Equals().
Definition at line 820 of file hash-table.cc.
References impala::LlvmCodeGen::FnPrototype::AddArgument(), build_expr_ctxs_, impala::LlvmCodeGen::CastPtrToLlvmPtr(), impala::LlvmCodeGen::context(), impala::CodegenAnyVal::CreateCallWrapped(), impala::CodegenAnyVal::EqToNativePtr(), expr_value_null_bits_, expr_values_buffer_, expr_values_buffer_offsets_, impala::LlvmCodeGen::false_value(), impala::LlvmCodeGen::FinalizeFunction(), impala::RuntimeState::GetCodegen(), impala::Status::GetDetail(), impala::LlvmCodeGen::GetIntConstant(), impala::CodegenAnyVal::GetIsNull(), impala::LlvmCodeGen::GetPtrType(), impala::LlvmCodeGen::GetType(), impala::TupleRow::LLVM_CLASS_NAME, impala::ExprContext::LLVM_CLASS_NAME, LLVM_CLASS_NAME, impala::Status::ok(), impala::LlvmCodeGen::ptr_type(), row(), stores_nulls_, impala::LlvmCodeGen::true_value(), impala::TYPE_BOOLEAN, impala::TYPE_CHAR, impala::TYPE_TINYINT, and VLOG_QUERY.
Function * HashTableCtx::CodegenEvalRow | ( | RuntimeState * | state, |
bool | build_row | ||
) |
Codegen for evaluating a tuple row. Codegen'd function matches the signature for EvalBuildRow and EvalTupleRow. If build_row is true, the codegen uses the build_exprs, otherwise the probe_exprs.
Definition at line 519 of file hash-table.cc.
References impala::LlvmCodeGen::FnPrototype::AddArgument(), impala::LlvmCodeGen::boolean_type(), build_expr_ctxs_, impala::LlvmCodeGen::CastPtrToLlvmPtr(), CodegenAssignNullValue(), impala::LlvmCodeGen::context(), impala::CodegenAnyVal::CreateCallWrapped(), expr_value_null_bits_, expr_values_buffer_, expr_values_buffer_offsets_, impala::LlvmCodeGen::false_value(), impala::LlvmCodeGen::FinalizeFunction(), impala::RuntimeState::GetCodegen(), impala::Status::GetDetail(), impala::CodegenAnyVal::GetIsNull(), impala::LlvmCodeGen::GetPtrType(), impala::LlvmCodeGen::GetType(), impala::TupleRow::LLVM_CLASS_NAME, impala::ExprContext::LLVM_CLASS_NAME, LLVM_CLASS_NAME, impala::Status::ok(), probe_expr_ctxs_, impala::LlvmCodeGen::ptr_type(), row(), stores_nulls_, impala::CodegenAnyVal::ToNativePtr(), impala::LlvmCodeGen::true_value(), impala::TYPE_BOOLEAN, impala::TYPE_CHAR, impala::TYPE_DECIMAL, impala::TYPE_TIMESTAMP, impala::TYPE_TINYINT, and VLOG_QUERY.
Function * HashTableCtx::CodegenHashCurrentRow | ( | RuntimeState * | state, |
bool | use_murmur | ||
) |
Codegen for hashing the expr values in 'expr_values_buffer_'. Function prototype matches HashCurrentRow identically. Unlike HashCurrentRow(), the returned function only uses a single hash function, rather than switching based on level_. If 'use_murmur' is true, murmur hash is used, otherwise CRC is used if the hardware supports it (see hash-util.h).
Definition at line 647 of file hash-table.cc.
References impala::LlvmCodeGen::FnPrototype::AddArgument(), build_expr_ctxs_, impala::LlvmCodeGen::CastPtrToLlvmPtr(), impala::LlvmCodeGen::context(), expr_value_null_bits_, expr_values_buffer_, expr_values_buffer_offsets_, impala::LlvmCodeGen::FinalizeFunction(), impala::RuntimeState::GetCodegen(), impala::LlvmCodeGen::GetFunction(), impala::LlvmCodeGen::GetHashFunction(), impala::LlvmCodeGen::GetIntConstant(), impala::LlvmCodeGen::GetMurmurHashFunction(), impala::LlvmCodeGen::GetPtrType(), impala::LlvmCodeGen::GetType(), LLVM_CLASS_NAME, impala::Status::ok(), impala::LlvmCodeGen::ptr_type(), results_buffer_size_, seed(), stores_nulls_, impala::TYPE_CHAR, impala::TYPE_INT, impala::TYPE_STRING, impala::TYPE_TINYINT, impala::TYPE_VARCHAR, and var_result_begin_.
Returns true if the values of build_exprs evaluated over 'build_row' equal the values cached in expr_values_buffer_. This will be replaced by codegen.
Definition at line 171 of file hash-table.cc.
References build_expr_ctxs_, impala::RawValue::Eq(), expr_value_null_bits_, expr_values_buffer_, expr_values_buffer_offsets_, and stores_nulls_.
Referenced by impala::HashTable::Probe().
Evaluate and hash the build/probe row, returning in *hash. Returns false if this row should be rejected (doesn't need to be processed further) because it contains NULL. These need to be inlined in the IR module so we can find and replace the calls to EvalBuildRow()/EvalProbeRow().
Definition at line 23 of file hash-table.inline.h.
References EvalBuildRow(), HashCurrentRow(), and stores_nulls_.
Referenced by impala::HashTableTest::BasicTest(), impala::HashTableTest::GrowTableTest(), impala::HashTableTest::InsertFullTest(), impala::PartitionedAggregationNode::ProcessBatch(), and impala::HashTableTest::ScanTest().
Definition at line 30 of file hash-table.inline.h.
References EvalProbeRow(), finds_nulls_, HashCurrentRow(), and stores_nulls_.
Referenced by impala::HashTableTest::GrowTableTest(), impala::HashTableTest::InsertFullTest(), impala::HashTableTest::ProbeTest(), impala::PartitionedAggregationNode::ProcessBatch(), and impala::PartitionedHashJoinNode::ProcessProbeBatch().
|
inlineprivate |
Evaluate 'row' over build exprs caching the results in 'expr_values_buffer_' This will be replaced by codegen. We do not want this function inlined when cross compiled because we need to be able to differentiate between EvalBuildRow and EvalProbeRow by name and the build/probe exprs are baked into the codegen'd function.
Definition at line 203 of file hash-table.h.
References build_expr_ctxs_, and EvalRow().
Referenced by EvalAndHashBuild().
|
inlineprivate |
Evaluate 'row' over probe exprs caching the results in 'expr_values_buffer_' This will be replaced by codegen.
Definition at line 209 of file hash-table.h.
References EvalRow(), and probe_expr_ctxs_.
Referenced by EvalAndHashProbe().
|
private |
Evaluate the exprs over row and cache the results in 'expr_values_buffer_'. Returns whether any expr evaluated to NULL. This will be replaced by codegen.
Definition at line 124 of file hash-table.cc.
References build_expr_ctxs_, expr_value_null_bits_, expr_values_buffer_, expr_values_buffer_offsets_, NULL_VALUE, stores_nulls_, and impala::RawValue::Write().
Referenced by EvalBuildRow(), and EvalProbeRow().
|
private |
Cross-compiled functions to access member variables used in CodegenHashCurrentRow().
|
inlineprivate |
Wrapper function for calling correct HashUtil function in non-codegen'd case.
Use CRC hash at first level for better performance. Switch to murmur hash at subsequent levels since CRC doesn't randomize well with different seed inputs.
Definition at line 192 of file hash-table.h.
References impala::HashUtil::Hash(), level_, and impala::HashUtil::MurmurHash2_64().
Referenced by HashCurrentRow(), and HashVariableLenRow().
|
inlineprivate |
Compute the hash of the values in expr_values_buffer_. This will be replaced by codegen. We don't want this inlined for replacing with codegen'd functions so the function name does not change.
This handles NULLs implicitly since a constant seed value was put into results buffer for nulls. TODO: figure out which hash function to use. We need to generate uncorrelated hashes by changing just the seed. CRC does not have this property and FNV is okay. We should switch to something else.
Definition at line 177 of file hash-table.h.
References expr_values_buffer_, Hash(), HashVariableLenRow(), level_, results_buffer_size_, seeds_, and var_result_begin_.
Referenced by EvalAndHashBuild(), and EvalAndHashProbe().
|
private |
Compute the hash of the values in expr_values_buffer_ for rows with variable length fields (e.g. strings).
Definition at line 146 of file hash-table.cc.
References build_expr_ctxs_, expr_value_null_bits_, expr_values_buffer_, expr_values_buffer_offsets_, impala::hash, Hash(), impala::StringValue::len, level_, impala::StringValue::ptr, seeds_, impala::TYPE_STRING, impala::TYPE_VARCHAR, and var_result_begin_.
Referenced by HashCurrentRow().
|
inline |
Returns the results of the exprs at 'expr_idx' evaluated over the last row processed. This value is invalid if the expr evaluated to NULL. TODO: this is an awkward abstraction but aggregation node can take advantage of it and save some expr evaluation calls.
Definition at line 134 of file hash-table.h.
References expr_values_buffer_, and expr_values_buffer_offsets_.
|
inline |
Returns if the expr at 'expr_idx' evaluated to NULL for the last row.
Definition at line 139 of file hash-table.h.
References expr_value_null_bits_.
|
inline |
Definition at line 124 of file hash-table.h.
References level_.
|
inline |
Definition at line 151 of file hash-table.h.
References results_buffer_size_.
|
inline |
Definition at line 127 of file hash-table.h.
References row_.
Referenced by impala::HashTable::Begin(), CodegenEquals(), CodegenEvalRow(), impala::HashTable::Find(), and impala::HashTable::FirstUnmatched().
|
inline |
Definition at line 125 of file hash-table.h.
References seeds_.
Referenced by impala::PartitionedHashJoinNode::Partition::BuildHashTableInternal(), and CodegenHashCurrentRow().
|
inline |
Definition at line 329 of file hash-table.inline.h.
|
friend |
Definition at line 172 of file hash-table.h.
|
private |
Definition at line 227 of file hash-table.h.
Referenced by CodegenEquals(), CodegenEvalRow(), CodegenHashCurrentRow(), Equals(), EvalBuildRow(), EvalRow(), HashTableCtx(), and HashVariableLenRow().
|
private |
Use bytes instead of bools to be compatible with llvm. This address must not change once allocated.
Definition at line 263 of file hash-table.h.
Referenced by Close(), CodegenEquals(), CodegenEvalRow(), CodegenHashCurrentRow(), Equals(), EvalRow(), HashTableCtx(), HashVariableLenRow(), and last_expr_value_null().
|
private |
Buffer to store evaluated expr results. This address must not change once allocated since the address is baked into the codegen
Definition at line 259 of file hash-table.h.
Referenced by Close(), CodegenEquals(), CodegenEvalRow(), CodegenHashCurrentRow(), Equals(), EvalRow(), HashCurrentRow(), HashTableCtx(), HashVariableLenRow(), and last_expr_value().
|
private |
Cache of exprs values for the current row being evaluated. This can either be a build row (during Insert()) or probe row (during Find()).
Definition at line 246 of file hash-table.h.
Referenced by CodegenEquals(), CodegenEvalRow(), CodegenHashCurrentRow(), Equals(), EvalRow(), HashTableCtx(), HashVariableLenRow(), and last_expr_value().
|
private |
Definition at line 235 of file hash-table.h.
Referenced by EvalAndHashProbe().
|
private |
The current level this context is working on. Each level needs to use a different seed.
Definition at line 239 of file hash-table.h.
Referenced by Hash(), HashCurrentRow(), HashVariableLenRow(), and level().
|
static |
Definition at line 169 of file hash-table.h.
Referenced by CodegenEquals(), CodegenEvalRow(), and CodegenHashCurrentRow().
|
private |
Definition at line 228 of file hash-table.h.
Referenced by CodegenEvalRow(), EvalProbeRow(), and HashTableCtx().
|
private |
Byte size of 'expr_values_buffer_'. Never changes once set, can be removed with codegen.
Definition at line 255 of file hash-table.h.
Referenced by CodegenHashCurrentRow(), HashCurrentRow(), HashTableCtx(), and results_buffer_size().
|
private |
Scratch buffer to generate rows on the fly.
Definition at line 266 of file hash-table.h.
Referenced by Close(), impala::HashTable::Probe(), and row().
|
private |
The seeds to use for hashing. Indexed by the level.
Definition at line 242 of file hash-table.h.
Referenced by HashCurrentRow(), HashTableCtx(), HashVariableLenRow(), and seed().
|
private |
Constants on how the hash table should behave. Joins and aggs have slightly different behavior. TODO: these constants are an ideal candidate to be removed with codegen. TODO: ..or with template-ization
Definition at line 234 of file hash-table.h.
Referenced by CodegenEquals(), CodegenEvalRow(), CodegenHashCurrentRow(), Equals(), EvalAndHashBuild(), EvalAndHashProbe(), and EvalRow().
|
private |
Byte offset into 'expr_values_buffer_' that begins the variable length results. If -1, there are no variable length slots. Never changes once set, can be removed with codegen.
Definition at line 251 of file hash-table.h.
Referenced by CodegenHashCurrentRow(), HashCurrentRow(), HashTableCtx(), and HashVariableLenRow().