Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros
impala::HashTableCtx Class Reference

#include <hash-table.h>

Collaboration diagram for impala::HashTableCtx:

Public Member Functions

 HashTableCtx (const std::vector< ExprContext * > &build_expr_ctxs, const std::vector< ExprContext * > &probe_expr_ctxs, bool stores_nulls, bool finds_nulls, int32_t initial_seed, int max_levels, int num_build_tuples)
 
void Close ()
 Call to cleanup any resources. More...
 
void set_level (int level)
 
int level () const
 
uint32_t seed (int level)
 
TupleRowrow () const
 
void * last_expr_value (int expr_idx) const
 
bool last_expr_value_null (int expr_idx) const
 Returns if the expr at 'expr_idx' evaluated to NULL for the last row. More...
 
bool IR_ALWAYS_INLINE EvalAndHashBuild (TupleRow *row, uint32_t *hash)
 
bool IR_ALWAYS_INLINE EvalAndHashProbe (TupleRow *row, uint32_t *hash)
 
int results_buffer_size () const
 
llvm::Function * CodegenEvalRow (RuntimeState *state, bool build_row)
 
llvm::Function * CodegenEquals (RuntimeState *state)
 
llvm::Function * CodegenHashCurrentRow (RuntimeState *state, bool use_murmur)
 

Static Public Attributes

static const char * LLVM_CLASS_NAME = "class.impala::HashTableCtx"
 

Private Member Functions

uint32_t IR_NO_INLINE HashCurrentRow ()
 
uint32_t Hash (const void *input, int len, int32_t hash)
 Wrapper function for calling correct HashUtil function in non-codegen'd case. More...
 
bool IR_NO_INLINE EvalBuildRow (TupleRow *row)
 
bool IR_NO_INLINE EvalProbeRow (TupleRow *row)
 
uint32_t HashVariableLenRow ()
 
bool EvalRow (TupleRow *row, const std::vector< ExprContext * > &ctxs)
 
bool IR_NO_INLINE Equals (TupleRow *build_row)
 
uint32_t GetHashSeed () const
 Cross-compiled functions to access member variables used in CodegenHashCurrentRow(). More...
 

Private Attributes

const std::vector< ExprContext * > & build_expr_ctxs_
 
const std::vector< ExprContext * > & probe_expr_ctxs_
 
const bool stores_nulls_
 
const bool finds_nulls_
 
int level_
 
std::vector< uint32_t > seeds_
 The seeds to use for hashing. Indexed by the level. More...
 
std::vector< int > expr_values_buffer_offsets_
 
int var_result_begin_
 
int results_buffer_size_
 
uint8_t * expr_values_buffer_
 
uint8_t * expr_value_null_bits_
 
TupleRowrow_
 Scratch buffer to generate rows on the fly. More...
 

Friends

class HashTable
 

Detailed Description

Linear or quadratic probing hash table implementation tailored to the usage pattern for partitioned hash aggregation and hash joins. The hash table stores TupleRows and allows for different exprs for insertions and finds. This is the pattern we use for joins and aggregation where the input/build tuple row descriptor is different from the find/probe descriptor. The implementation is designed to allow codegen for some paths. In addition to the hash table there is also an accompanying hash table context that is used for insertions and probes. For example, the hash table context stores evaluated expr results for the current row being processed when possible into a contiguous memory buffer. This allows for efficient hash computation. The hash table does not support removes. The hash table is not thread safe. The table is optimized for the partition hash aggregation and hash joins and is not intended to be a generic hash table implementation. The API loosely mimics the std::hashset API. The data (rows) are stored in a BufferedTupleStream. The basic data structure of this hash table is a vector of buckets. The buckets (indexed by the mod of the hash) contain a pointer to either the slot in the tuple-stream or in case of duplicate values, to the head of a linked list of nodes that in turn contain a pointer to tuple-stream slots. When inserting an entry we start at the bucket at position (hash % size) and search for either a bucket with the same hash or for an empty bucket. If a bucket with the same hash is found, we then compare for row equality and either insert a duplicate node if the equality is true, or continue the search if the row equality is false. Similarly, when probing we start from the bucket at position (hash % size) and search for an entry with the same hash or for an empty bucket. In the former case, we then check for row equality and continue the search if the row equality is false. In the latter case, the probe is not successful. When growing the hash table, the number of buckets is doubled. We trigger a resize when the fill factor is approx 75%. Due to the doubling nature of the buckets, we require that the number of buckets is a power of 2. This allows us to perform a modulo of the hash using a bitmask. We choose to use linear or quadratic probing because they exhibit good (predictable) cache behavior. We require that the number of buckets is a power of 2. This allows us to determine the bucket index using a bitmask. The first NUM_SMALL_BLOCKS of nodes_ are made of blocks less than the IO size (of 8MB) to reduce the memory footprint of small queries. TODO: Compare linear and quadratic probing and remove the loser. TODO: We currently use 32-bit hashes. There is room in the bucket structure for at least 48-bits. We should exploit this space. TODO: Consider capping the probes with a threshold value. If an insert reaches that threshold it is inserted to another linked list of overflow entries. TODO: Smarter resizes, and perhaps avoid using powers of 2 as the hash table size. TODO: this is not a fancy hash table in terms of memory access patterns (cuckoo-hashing or something that spills to disk). We will likely want to invest more time into this. TODO: hash-join and aggregation have very different access patterns. Joins insert all the rows and then calls scan to find them. Aggregation interleaves Find() and Inserts(). We may want to optimize joins more heavily for Inserts() (in particular growing). TODO: Batched interface for inserts and finds. TODO: Do we need to check mem limit exceeded so often. Check once per batch? Control block for a hash table. This class contains the logic as well as the variables needed by a thread to operate on a hash table.

Definition at line 104 of file hash-table.h.

Constructor & Destructor Documentation

HashTableCtx::HashTableCtx ( const std::vector< ExprContext * > &  build_expr_ctxs,
const std::vector< ExprContext * > &  probe_expr_ctxs,
bool  stores_nulls,
bool  finds_nulls,
int32_t  initial_seed,
int  max_levels,
int  num_build_tuples 
)

Create a hash table context.

  • build_exprs are the exprs that should be used to evaluate rows during Insert().
  • probe_exprs are used during Find()
  • stores_nulls: if false, TupleRows with nulls are ignored during Insert
  • finds_nulls: if false, Find() returns End() for TupleRows with nulls even if stores_nulls is true
  • initial_seed: Initial seed value to use when computing hashes for rows with level 0. Other levels have their seeds derived from this seed.
  • The max levels we will hash with.

Definition at line 83 of file hash-table.cc.

References build_expr_ctxs_, impala::Expr::ComputeResultsLayout(), expr_value_null_bits_, expr_values_buffer_, expr_values_buffer_offsets_, probe_expr_ctxs_, results_buffer_size_, SEED_PRIMES, seeds_, and var_result_begin_.

Member Function Documentation

void HashTableCtx::Close ( )

Call to cleanup any resources.

Definition at line 112 of file hash-table.cc.

References expr_value_null_bits_, expr_values_buffer_, and row_.

bool HashTableCtx::Equals ( TupleRow build_row)
private

Returns true if the values of build_exprs evaluated over 'build_row' equal the values cached in expr_values_buffer_. This will be replaced by codegen.

Definition at line 171 of file hash-table.cc.

References build_expr_ctxs_, impala::RawValue::Eq(), expr_value_null_bits_, expr_values_buffer_, expr_values_buffer_offsets_, and stores_nulls_.

Referenced by impala::HashTable::Probe().

bool impala::HashTableCtx::EvalAndHashBuild ( TupleRow row,
uint32_t *  hash 
)
inline

Evaluate and hash the build/probe row, returning in *hash. Returns false if this row should be rejected (doesn't need to be processed further) because it contains NULL. These need to be inlined in the IR module so we can find and replace the calls to EvalBuildRow()/EvalProbeRow().

Definition at line 23 of file hash-table.inline.h.

References EvalBuildRow(), HashCurrentRow(), and stores_nulls_.

Referenced by impala::HashTableTest::BasicTest(), impala::HashTableTest::GrowTableTest(), impala::HashTableTest::InsertFullTest(), impala::PartitionedAggregationNode::ProcessBatch(), and impala::HashTableTest::ScanTest().

bool IR_NO_INLINE impala::HashTableCtx::EvalBuildRow ( TupleRow row)
inlineprivate

Evaluate 'row' over build exprs caching the results in 'expr_values_buffer_' This will be replaced by codegen. We do not want this function inlined when cross compiled because we need to be able to differentiate between EvalBuildRow and EvalProbeRow by name and the build/probe exprs are baked into the codegen'd function.

Definition at line 203 of file hash-table.h.

References build_expr_ctxs_, and EvalRow().

Referenced by EvalAndHashBuild().

bool IR_NO_INLINE impala::HashTableCtx::EvalProbeRow ( TupleRow row)
inlineprivate

Evaluate 'row' over probe exprs caching the results in 'expr_values_buffer_' This will be replaced by codegen.

Definition at line 209 of file hash-table.h.

References EvalRow(), and probe_expr_ctxs_.

Referenced by EvalAndHashProbe().

bool HashTableCtx::EvalRow ( TupleRow row,
const std::vector< ExprContext * > &  ctxs 
)
private

Evaluate the exprs over row and cache the results in 'expr_values_buffer_'. Returns whether any expr evaluated to NULL. This will be replaced by codegen.

Definition at line 124 of file hash-table.cc.

References build_expr_ctxs_, expr_value_null_bits_, expr_values_buffer_, expr_values_buffer_offsets_, NULL_VALUE, stores_nulls_, and impala::RawValue::Write().

Referenced by EvalBuildRow(), and EvalProbeRow().

uint32_t impala::HashTableCtx::GetHashSeed ( ) const
private

Cross-compiled functions to access member variables used in CodegenHashCurrentRow().

uint32_t impala::HashTableCtx::Hash ( const void *  input,
int  len,
int32_t  hash 
)
inlineprivate

Wrapper function for calling correct HashUtil function in non-codegen'd case.

Use CRC hash at first level for better performance. Switch to murmur hash at subsequent levels since CRC doesn't randomize well with different seed inputs.

Definition at line 192 of file hash-table.h.

References impala::HashUtil::Hash(), level_, and impala::HashUtil::MurmurHash2_64().

Referenced by HashCurrentRow(), and HashVariableLenRow().

uint32_t IR_NO_INLINE impala::HashTableCtx::HashCurrentRow ( )
inlineprivate

Compute the hash of the values in expr_values_buffer_. This will be replaced by codegen. We don't want this inlined for replacing with codegen'd functions so the function name does not change.

This handles NULLs implicitly since a constant seed value was put into results buffer for nulls. TODO: figure out which hash function to use. We need to generate uncorrelated hashes by changing just the seed. CRC does not have this property and FNV is okay. We should switch to something else.

Definition at line 177 of file hash-table.h.

References expr_values_buffer_, Hash(), HashVariableLenRow(), level_, results_buffer_size_, seeds_, and var_result_begin_.

Referenced by EvalAndHashBuild(), and EvalAndHashProbe().

uint32_t HashTableCtx::HashVariableLenRow ( )
private

Compute the hash of the values in expr_values_buffer_ for rows with variable length fields (e.g. strings).

Definition at line 146 of file hash-table.cc.

References build_expr_ctxs_, expr_value_null_bits_, expr_values_buffer_, expr_values_buffer_offsets_, impala::hash, Hash(), impala::StringValue::len, level_, impala::StringValue::ptr, seeds_, impala::TYPE_STRING, impala::TYPE_VARCHAR, and var_result_begin_.

Referenced by HashCurrentRow().

void* impala::HashTableCtx::last_expr_value ( int  expr_idx) const
inline

Returns the results of the exprs at 'expr_idx' evaluated over the last row processed. This value is invalid if the expr evaluated to NULL. TODO: this is an awkward abstraction but aggregation node can take advantage of it and save some expr evaluation calls.

Definition at line 134 of file hash-table.h.

References expr_values_buffer_, and expr_values_buffer_offsets_.

bool impala::HashTableCtx::last_expr_value_null ( int  expr_idx) const
inline

Returns if the expr at 'expr_idx' evaluated to NULL for the last row.

Definition at line 139 of file hash-table.h.

References expr_value_null_bits_.

int impala::HashTableCtx::level ( ) const
inline

Definition at line 124 of file hash-table.h.

References level_.

int impala::HashTableCtx::results_buffer_size ( ) const
inline

Definition at line 151 of file hash-table.h.

References results_buffer_size_.

TupleRow* impala::HashTableCtx::row ( ) const
inline
uint32_t impala::HashTableCtx::seed ( int  level)
inline
void impala::HashTableCtx::set_level ( int  level)
inline

Definition at line 329 of file hash-table.inline.h.

Friends And Related Function Documentation

friend class HashTable
friend

Definition at line 172 of file hash-table.h.

Member Data Documentation

const std::vector<ExprContext*>& impala::HashTableCtx::build_expr_ctxs_
private
uint8_t* impala::HashTableCtx::expr_value_null_bits_
private

Use bytes instead of bools to be compatible with llvm. This address must not change once allocated.

Definition at line 263 of file hash-table.h.

Referenced by Close(), CodegenEquals(), CodegenEvalRow(), CodegenHashCurrentRow(), Equals(), EvalRow(), HashTableCtx(), HashVariableLenRow(), and last_expr_value_null().

uint8_t* impala::HashTableCtx::expr_values_buffer_
private

Buffer to store evaluated expr results. This address must not change once allocated since the address is baked into the codegen

Definition at line 259 of file hash-table.h.

Referenced by Close(), CodegenEquals(), CodegenEvalRow(), CodegenHashCurrentRow(), Equals(), EvalRow(), HashCurrentRow(), HashTableCtx(), HashVariableLenRow(), and last_expr_value().

std::vector<int> impala::HashTableCtx::expr_values_buffer_offsets_
private

Cache of exprs values for the current row being evaluated. This can either be a build row (during Insert()) or probe row (during Find()).

Definition at line 246 of file hash-table.h.

Referenced by CodegenEquals(), CodegenEvalRow(), CodegenHashCurrentRow(), Equals(), EvalRow(), HashTableCtx(), HashVariableLenRow(), and last_expr_value().

const bool impala::HashTableCtx::finds_nulls_
private

Definition at line 235 of file hash-table.h.

Referenced by EvalAndHashProbe().

int impala::HashTableCtx::level_
private

The current level this context is working on. Each level needs to use a different seed.

Definition at line 239 of file hash-table.h.

Referenced by Hash(), HashCurrentRow(), HashVariableLenRow(), and level().

const char * HashTableCtx::LLVM_CLASS_NAME = "class.impala::HashTableCtx"
static

Definition at line 169 of file hash-table.h.

Referenced by CodegenEquals(), CodegenEvalRow(), and CodegenHashCurrentRow().

const std::vector<ExprContext*>& impala::HashTableCtx::probe_expr_ctxs_
private

Definition at line 228 of file hash-table.h.

Referenced by CodegenEvalRow(), EvalProbeRow(), and HashTableCtx().

int impala::HashTableCtx::results_buffer_size_
private

Byte size of 'expr_values_buffer_'. Never changes once set, can be removed with codegen.

Definition at line 255 of file hash-table.h.

Referenced by CodegenHashCurrentRow(), HashCurrentRow(), HashTableCtx(), and results_buffer_size().

TupleRow* impala::HashTableCtx::row_
private

Scratch buffer to generate rows on the fly.

Definition at line 266 of file hash-table.h.

Referenced by Close(), impala::HashTable::Probe(), and row().

std::vector<uint32_t> impala::HashTableCtx::seeds_
private

The seeds to use for hashing. Indexed by the level.

Definition at line 242 of file hash-table.h.

Referenced by HashCurrentRow(), HashTableCtx(), HashVariableLenRow(), and seed().

const bool impala::HashTableCtx::stores_nulls_
private

Constants on how the hash table should behave. Joins and aggs have slightly different behavior. TODO: these constants are an ideal candidate to be removed with codegen. TODO: ..or with template-ization

Definition at line 234 of file hash-table.h.

Referenced by CodegenEquals(), CodegenEvalRow(), CodegenHashCurrentRow(), Equals(), EvalAndHashBuild(), EvalAndHashProbe(), and EvalRow().

int impala::HashTableCtx::var_result_begin_
private

Byte offset into 'expr_values_buffer_' that begins the variable length results. If -1, there are no variable length slots. Never changes once set, can be removed with codegen.

Definition at line 251 of file hash-table.h.

Referenced by CodegenHashCurrentRow(), HashCurrentRow(), HashTableCtx(), and HashVariableLenRow().


The documentation for this class was generated from the following files: