#include <hash-table.h>

Collaboration diagram for impala::HashTableCtx:

Public Member Functions
	HashTableCtx (const std::vector< ExprContext * > &build_expr_ctxs, const std::vector< ExprContext * > &probe_expr_ctxs, bool stores_nulls, bool finds_nulls, int32_t initial_seed, int max_levels, int num_build_tuples)

void	Close ()
	Call to cleanup any resources. More...

void	set_level (int level)

int	level () const

uint32_t	seed (int level)

TupleRow *	row () const

void *	last_expr_value (int expr_idx) const

bool	last_expr_value_null (int expr_idx) const
	Returns if the expr at 'expr_idx' evaluated to NULL for the last row. More...

bool IR_ALWAYS_INLINE	EvalAndHashBuild (TupleRow row, uint32_t hash)

bool IR_ALWAYS_INLINE	EvalAndHashProbe (TupleRow row, uint32_t hash)

int	results_buffer_size () const

llvm::Function *	CodegenEvalRow (RuntimeState *state, bool build_row)

llvm::Function *	CodegenEquals (RuntimeState *state)

llvm::Function *	CodegenHashCurrentRow (RuntimeState *state, bool use_murmur)

Static Public Attributes
static const char *	LLVM_CLASS_NAME = "class.impala::HashTableCtx"

Private Member Functions
uint32_t IR_NO_INLINE	HashCurrentRow ()

uint32_t	Hash (const void *input, int len, int32_t hash)
	Wrapper function for calling correct HashUtil function in non-codegen'd case. More...

bool IR_NO_INLINE	EvalBuildRow (TupleRow *row)

bool IR_NO_INLINE	EvalProbeRow (TupleRow *row)

uint32_t	HashVariableLenRow ()

bool	EvalRow (TupleRow row, const std::vector< ExprContext > &ctxs)

bool IR_NO_INLINE	Equals (TupleRow *build_row)

uint32_t	GetHashSeed () const
	Cross-compiled functions to access member variables used in CodegenHashCurrentRow(). More...

Private Attributes
const std::vector< ExprContext * > &	build_expr_ctxs_

const std::vector< ExprContext * > &	probe_expr_ctxs_

const bool	stores_nulls_

const bool	finds_nulls_

int	level_

std::vector< uint32_t >	seeds_
	The seeds to use for hashing. Indexed by the level. More...

std::vector< int >	expr_values_buffer_offsets_

int	var_result_begin_

int	results_buffer_size_

uint8_t *	expr_values_buffer_

uint8_t *	expr_value_null_bits_

TupleRow *	row_
	Scratch buffer to generate rows on the fly. More...

Friends
class	HashTable

Detailed Description

Linear or quadratic probing hash table implementation tailored to the usage pattern for partitioned hash aggregation and hash joins. The hash table stores TupleRows and allows for different exprs for insertions and finds. This is the pattern we use for joins and aggregation where the input/build tuple row descriptor is different from the find/probe descriptor. The implementation is designed to allow codegen for some paths. In addition to the hash table there is also an accompanying hash table context that is used for insertions and probes. For example, the hash table context stores evaluated expr results for the current row being processed when possible into a contiguous memory buffer. This allows for efficient hash computation. The hash table does not support removes. The hash table is not thread safe. The table is optimized for the partition hash aggregation and hash joins and is not intended to be a generic hash table implementation. The API loosely mimics the std::hashset API. The data (rows) are stored in a BufferedTupleStream. The basic data structure of this hash table is a vector of buckets. The buckets (indexed by the mod of the hash) contain a pointer to either the slot in the tuple-stream or in case of duplicate values, to the head of a linked list of nodes that in turn contain a pointer to tuple-stream slots. When inserting an entry we start at the bucket at position (hash % size) and search for either a bucket with the same hash or for an empty bucket. If a bucket with the same hash is found, we then compare for row equality and either insert a duplicate node if the equality is true, or continue the search if the row equality is false. Similarly, when probing we start from the bucket at position (hash % size) and search for an entry with the same hash or for an empty bucket. In the former case, we then check for row equality and continue the search if the row equality is false. In the latter case, the probe is not successful. When growing the hash table, the number of buckets is doubled. We trigger a resize when the fill factor is approx 75%. Due to the doubling nature of the buckets, we require that the number of buckets is a power of 2. This allows us to perform a modulo of the hash using a bitmask. We choose to use linear or quadratic probing because they exhibit good (predictable) cache behavior. We require that the number of buckets is a power of 2. This allows us to determine the bucket index using a bitmask. The first NUM_SMALL_BLOCKS of nodes_ are made of blocks less than the IO size (of 8MB) to reduce the memory footprint of small queries. TODO: Compare linear and quadratic probing and remove the loser. TODO: We currently use 32-bit hashes. There is room in the bucket structure for at least 48-bits. We should exploit this space. TODO: Consider capping the probes with a threshold value. If an insert reaches that threshold it is inserted to another linked list of overflow entries. TODO: Smarter resizes, and perhaps avoid using powers of 2 as the hash table size. TODO: this is not a fancy hash table in terms of memory access patterns (cuckoo-hashing or something that spills to disk). We will likely want to invest more time into this. TODO: hash-join and aggregation have very different access patterns. Joins insert all the rows and then calls scan to find them. Aggregation interleaves Find() and Inserts(). We may want to optimize joins more heavily for Inserts() (in particular growing). TODO: Batched interface for inserts and finds. TODO: Do we need to check mem limit exceeded so often. Check once per batch? Control block for a hash table. This class contains the logic as well as the variables needed by a thread to operate on a hash table.

Definition at line 104 of file hash-table.h.

Constructor & Destructor Documentation

HashTableCtx::HashTableCtx	(	const std::vector< ExprContext * > &	build_expr_ctxs,
		const std::vector< ExprContext * > &	probe_expr_ctxs,
		bool	stores_nulls,
		bool	finds_nulls,
		int32_t	initial_seed,
		int	max_levels,
		int	num_build_tuples
	)

Create a hash table context.

build_exprs are the exprs that should be used to evaluate rows during Insert().
probe_exprs are used during Find()
stores_nulls: if false, TupleRows with nulls are ignored during Insert
finds_nulls: if false, Find() returns End() for TupleRows with nulls even if stores_nulls is true
initial_seed: Initial seed value to use when computing hashes for rows with level 0. Other levels have their seeds derived from this seed.
The max levels we will hash with.

Definition at line 83 of file hash-table.cc.

References build_expr_ctxs_, impala::Expr::ComputeResultsLayout(), expr_value_null_bits_, expr_values_buffer_, expr_values_buffer_offsets_, probe_expr_ctxs_, results_buffer_size_, SEED_PRIMES, seeds_, and var_result_begin_.

Member Function Documentation

void HashTableCtx::Close ( )

Call to cleanup any resources.

Definition at line 112 of file hash-table.cc.

References expr_value_null_bits_, expr_values_buffer_, and row_.

Function * HashTableCtx::CodegenEquals ( RuntimeState * state )

Codegen for evaluating a TupleRow and comparing equality against 'expr_values_buffer_'. Function signature matches HashTable::Equals().

Definition at line 820 of file hash-table.cc.

Function * HashTableCtx::CodegenEvalRow	(	RuntimeState *	state,
		bool	build_row
	)

Codegen for evaluating a tuple row. Codegen'd function matches the signature for EvalBuildRow and EvalTupleRow. If build_row is true, the codegen uses the build_exprs, otherwise the probe_exprs.

Definition at line 519 of file hash-table.cc.

Function * HashTableCtx::CodegenHashCurrentRow	(	RuntimeState *	state,
		bool	use_murmur
	)

Codegen for hashing the expr values in 'expr_values_buffer_'. Function prototype matches HashCurrentRow identically. Unlike HashCurrentRow(), the returned function only uses a single hash function, rather than switching based on level_. If 'use_murmur' is true, murmur hash is used, otherwise CRC is used if the hardware supports it (see hash-util.h).

Definition at line 647 of file hash-table.cc.

bool HashTableCtx::Equals ( TupleRow * build_row )

private

Returns true if the values of build_exprs evaluated over 'build_row' equal the values cached in expr_values_buffer_. This will be replaced by codegen.

Definition at line 171 of file hash-table.cc.

References build_expr_ctxs_, impala::RawValue::Eq(), expr_value_null_bits_, expr_values_buffer_, expr_values_buffer_offsets_, and stores_nulls_.

Referenced by impala::HashTable::Probe().

bool impala::HashTableCtx::EvalAndHashBuild	(	TupleRow *	row,
		uint32_t *	hash
	)

inline

Evaluate and hash the build/probe row, returning in *hash. Returns false if this row should be rejected (doesn't need to be processed further) because it contains NULL. These need to be inlined in the IR module so we can find and replace the calls to EvalBuildRow()/EvalProbeRow().

Definition at line 23 of file hash-table.inline.h.

References EvalBuildRow(), HashCurrentRow(), and stores_nulls_.

Referenced by impala::HashTableTest::BasicTest(), impala::HashTableTest::GrowTableTest(), impala::HashTableTest::InsertFullTest(), impala::PartitionedAggregationNode::ProcessBatch(), and impala::HashTableTest::ScanTest().

bool impala::HashTableCtx::EvalAndHashProbe	(	TupleRow *	row,
		uint32_t *	hash
	)

inline

Definition at line 30 of file hash-table.inline.h.

References EvalProbeRow(), finds_nulls_, HashCurrentRow(), and stores_nulls_.

Referenced by impala::HashTableTest::GrowTableTest(), impala::HashTableTest::InsertFullTest(), impala::HashTableTest::ProbeTest(), impala::PartitionedAggregationNode::ProcessBatch(), and impala::PartitionedHashJoinNode::ProcessProbeBatch().

bool IR_NO_INLINE impala::HashTableCtx::EvalBuildRow ( TupleRow * row )

inlineprivate

Evaluate 'row' over build exprs caching the results in 'expr_values_buffer_' This will be replaced by codegen. We do not want this function inlined when cross compiled because we need to be able to differentiate between EvalBuildRow and EvalProbeRow by name and the build/probe exprs are baked into the codegen'd function.

Definition at line 203 of file hash-table.h.

References build_expr_ctxs_, and EvalRow().

Referenced by EvalAndHashBuild().

bool IR_NO_INLINE impala::HashTableCtx::EvalProbeRow ( TupleRow * row )

inlineprivate

Evaluate 'row' over probe exprs caching the results in 'expr_values_buffer_' This will be replaced by codegen.

Definition at line 209 of file hash-table.h.

References EvalRow(), and probe_expr_ctxs_.

Referenced by EvalAndHashProbe().

bool HashTableCtx::EvalRow	(	TupleRow *	row,
		const std::vector< ExprContext * > &	ctxs
	)

private

Evaluate the exprs over row and cache the results in 'expr_values_buffer_'. Returns whether any expr evaluated to NULL. This will be replaced by codegen.

Definition at line 124 of file hash-table.cc.

References build_expr_ctxs_, expr_value_null_bits_, expr_values_buffer_, expr_values_buffer_offsets_, NULL_VALUE, stores_nulls_, and impala::RawValue::Write().

Referenced by EvalBuildRow(), and EvalProbeRow().

uint32_t impala::HashTableCtx::GetHashSeed ( ) const

private

Cross-compiled functions to access member variables used in CodegenHashCurrentRow().

uint32_t impala::HashTableCtx::Hash	(	const void *	input,
		int	len,
		int32_t	hash
	)

inlineprivate

Wrapper function for calling correct HashUtil function in non-codegen'd case.

Use CRC hash at first level for better performance. Switch to murmur hash at subsequent levels since CRC doesn't randomize well with different seed inputs.

Definition at line 192 of file hash-table.h.

References impala::HashUtil::Hash(), level_, and impala::HashUtil::MurmurHash2_64().

Referenced by HashCurrentRow(), and HashVariableLenRow().

uint32_t IR_NO_INLINE impala::HashTableCtx::HashCurrentRow ( )

inlineprivate

Compute the hash of the values in expr_values_buffer_. This will be replaced by codegen. We don't want this inlined for replacing with codegen'd functions so the function name does not change.

This handles NULLs implicitly since a constant seed value was put into results buffer for nulls. TODO: figure out which hash function to use. We need to generate uncorrelated hashes by changing just the seed. CRC does not have this property and FNV is okay. We should switch to something else.

Definition at line 177 of file hash-table.h.

References expr_values_buffer_, Hash(), HashVariableLenRow(), level_, results_buffer_size_, seeds_, and var_result_begin_.

Referenced by EvalAndHashBuild(), and EvalAndHashProbe().

uint32_t HashTableCtx::HashVariableLenRow ( )

private

Compute the hash of the values in expr_values_buffer_ for rows with variable length fields (e.g. strings).

Definition at line 146 of file hash-table.cc.

References build_expr_ctxs_, expr_value_null_bits_, expr_values_buffer_, expr_values_buffer_offsets_, impala::hash, Hash(), impala::StringValue::len, level_, impala::StringValue::ptr, seeds_, impala::TYPE_STRING, impala::TYPE_VARCHAR, and var_result_begin_.

Referenced by HashCurrentRow().

void* impala::HashTableCtx::last_expr_value ( int expr_idx ) const

inline

Returns the results of the exprs at 'expr_idx' evaluated over the last row processed. This value is invalid if the expr evaluated to NULL. TODO: this is an awkward abstraction but aggregation node can take advantage of it and save some expr evaluation calls.

Definition at line 134 of file hash-table.h.

References expr_values_buffer_, and expr_values_buffer_offsets_.

bool impala::HashTableCtx::last_expr_value_null ( int expr_idx ) const

inline

Returns if the expr at 'expr_idx' evaluated to NULL for the last row.

Definition at line 139 of file hash-table.h.

References expr_value_null_bits_.

int impala::HashTableCtx::level ( ) const

inline

Definition at line 124 of file hash-table.h.

References level_.

int impala::HashTableCtx::results_buffer_size ( ) const

inline

Definition at line 151 of file hash-table.h.

References results_buffer_size_.

TupleRow* impala::HashTableCtx::row ( ) const

inline

Definition at line 127 of file hash-table.h.

References row_.

Referenced by impala::HashTable::Begin(), CodegenEquals(), CodegenEvalRow(), impala::HashTable::Find(), and impala::HashTable::FirstUnmatched().

uint32_t impala::HashTableCtx::seed ( int level )

inline

Definition at line 125 of file hash-table.h.

References seeds_.

Referenced by impala::PartitionedHashJoinNode::Partition::BuildHashTableInternal(), and CodegenHashCurrentRow().

void impala::HashTableCtx::set_level ( int level )

inline

Definition at line 329 of file hash-table.inline.h.

Friends And Related Function Documentation

friend class HashTable

friend

Definition at line 172 of file hash-table.h.

Member Data Documentation

const std::vector<ExprContext*>& impala::HashTableCtx::build_expr_ctxs_

private

Definition at line 227 of file hash-table.h.

Referenced by CodegenEquals(), CodegenEvalRow(), CodegenHashCurrentRow(), Equals(), EvalBuildRow(), EvalRow(), HashTableCtx(), and HashVariableLenRow().

uint8_t* impala::HashTableCtx::expr_value_null_bits_

private

Use bytes instead of bools to be compatible with llvm. This address must not change once allocated.

Definition at line 263 of file hash-table.h.

Referenced by Close(), CodegenEquals(), CodegenEvalRow(), CodegenHashCurrentRow(), Equals(), EvalRow(), HashTableCtx(), HashVariableLenRow(), and last_expr_value_null().

uint8_t* impala::HashTableCtx::expr_values_buffer_

private

Buffer to store evaluated expr results. This address must not change once allocated since the address is baked into the codegen

Definition at line 259 of file hash-table.h.

Referenced by Close(), CodegenEquals(), CodegenEvalRow(), CodegenHashCurrentRow(), Equals(), EvalRow(), HashCurrentRow(), HashTableCtx(), HashVariableLenRow(), and last_expr_value().

std::vector<int> impala::HashTableCtx::expr_values_buffer_offsets_

private

Cache of exprs values for the current row being evaluated. This can either be a build row (during Insert()) or probe row (during Find()).

Definition at line 246 of file hash-table.h.

Referenced by CodegenEquals(), CodegenEvalRow(), CodegenHashCurrentRow(), Equals(), EvalRow(), HashTableCtx(), HashVariableLenRow(), and last_expr_value().

const bool impala::HashTableCtx::finds_nulls_

private

Definition at line 235 of file hash-table.h.

Referenced by EvalAndHashProbe().

int impala::HashTableCtx::level_

private

The current level this context is working on. Each level needs to use a different seed.

Definition at line 239 of file hash-table.h.

Referenced by Hash(), HashCurrentRow(), HashVariableLenRow(), and level().

const char * HashTableCtx::LLVM_CLASS_NAME = "class.impala::HashTableCtx"

static

Definition at line 169 of file hash-table.h.

Referenced by CodegenEquals(), CodegenEvalRow(), and CodegenHashCurrentRow().

const std::vector<ExprContext*>& impala::HashTableCtx::probe_expr_ctxs_

private

Definition at line 228 of file hash-table.h.

Referenced by CodegenEvalRow(), EvalProbeRow(), and HashTableCtx().

int impala::HashTableCtx::results_buffer_size_

private

Byte size of 'expr_values_buffer_'. Never changes once set, can be removed with codegen.

Definition at line 255 of file hash-table.h.

Referenced by CodegenHashCurrentRow(), HashCurrentRow(), HashTableCtx(), and results_buffer_size().

TupleRow* impala::HashTableCtx::row_

private

Scratch buffer to generate rows on the fly.

Definition at line 266 of file hash-table.h.

Referenced by Close(), impala::HashTable::Probe(), and row().

std::vector<uint32_t> impala::HashTableCtx::seeds_

private

The seeds to use for hashing. Indexed by the level.

Definition at line 242 of file hash-table.h.

Referenced by HashCurrentRow(), HashTableCtx(), HashVariableLenRow(), and seed().

const bool impala::HashTableCtx::stores_nulls_

private

Constants on how the hash table should behave. Joins and aggs have slightly different behavior. TODO: these constants are an ideal candidate to be removed with codegen. TODO: ..or with template-ization

Definition at line 234 of file hash-table.h.

Referenced by CodegenEquals(), CodegenEvalRow(), CodegenHashCurrentRow(), Equals(), EvalAndHashBuild(), EvalAndHashProbe(), and EvalRow().

int impala::HashTableCtx::var_result_begin_

private

Byte offset into 'expr_values_buffer_' that begins the variable length results. If -1, there are no variable length slots. Never changes once set, can be removed with codegen.

Definition at line 251 of file hash-table.h.

Referenced by CodegenHashCurrentRow(), HashCurrentRow(), HashTableCtx(), and HashVariableLenRow().

The documentation for this class was generated from the following files:

be/src/exec/hash-table.h
be/src/exec/hash-table.cc
be/src/exec/hash-table.inline.h

Public Member Functions

Static Public Attributes

Private Member Functions

Private Attributes

Friends

Detailed Description

Constructor & Destructor Documentation

Member Function Documentation

Friends And Related Function Documentation

Member Data Documentation