#include <hdfs-scanner.h>

Inheritance diagram for impala::HdfsScanner:

Collaboration diagram for impala::HdfsScanner:

Public Member Functions
	HdfsScanner (HdfsScanNode scan_node, RuntimeState state)

virtual	~HdfsScanner ()

virtual Status	Prepare (ScannerContext *context)
	One-time initialisation of state that is constant across scan ranges. More...

virtual Status	ProcessSplit ()=0

virtual void	Close ()

Static Public Attributes
static const int	FILE_BLOCK_SIZE = 4096

static const char *	LLVM_CLASS_NAME = "class.impala::HdfsScanner"

Protected Types
typedef int(*	WriteTuplesFn )(HdfsScanner , MemPool , TupleRow , int, FieldLocation , int, int, int, int)

Protected Member Functions
Status	InitializeWriteTuplesFn (HdfsPartitionDescriptor *partition, THdfsFileFormat::type type, const std::string &scanner_name)

void	StartNewRowBatch ()
	Set batch_ to a new row batch and update tuple_mem_ accordingly. More...

virtual Status	InitNewRange ()=0
	Reset internal state for a new scan range. More...

int	GetMemory (MemPool pool, Tuple tuple_mem, TupleRow **tuple_row_mem)

Status	CommitRows (int num_rows)

void	AddFinalRowBatch ()

void	AttachPool (MemPool *pool, bool commit_batch)

bool IR_ALWAYS_INLINE	EvalConjuncts (TupleRow *row)

int	WriteEmptyTuples (RowBatch *row_batch, int num_tuples)

int	WriteEmptyTuples (ScannerContext context, TupleRow tuple_row, int num_tuples)
	Write empty tuples and commit them to the context object. More...

int	WriteAlignedTuples (MemPool pool, TupleRow tuple_row_mem, int row_size, FieldLocation *fields, int num_tuples, int max_added_tuples, int slots_per_tuple, int row_start_indx)

Status	UpdateDecompressor (const THdfsCompression::type &compression)

Status	UpdateDecompressor (const std::string &codec)

bool	ReportTupleParseError (FieldLocation fields, uint8_t errors, int row_idx)

virtual void	LogRowParseError (int row_idx, std::stringstream *)

bool	WriteCompleteTuple (MemPool pool, FieldLocation fields, Tuple tuple, TupleRow tuple_row, Tuple template_tuple, uint8_t error_fields, uint8_t *error_in_row)

void	ReportColumnParseError (const SlotDescriptor desc, const char data, int len)

void	InitTuple (Tuple template_tuple, Tuple tuple)

Tuple *	next_tuple (Tuple *t) const

TupleRow *	next_row (TupleRow *r) const

ExprContext *	GetConjunctCtx (int idx) const

Static Protected Member Functions
static llvm::Function *	CodegenWriteCompleteTuple (HdfsScanNode , LlvmCodeGen , const std::vector< ExprContext * > &conjunct_ctxs)

static llvm::Function *	CodegenWriteAlignedTuples (HdfsScanNode , LlvmCodeGen , llvm::Function *write_tuple_fn)

Protected Attributes
HdfsScanNode *	scan_node_
	The scan node that started this scanner. More...

RuntimeState *	state_
	RuntimeState for error reporting. More...

ScannerContext *	context_
	Context for this scanner. More...

ScannerContext::Stream *	stream_
	The first stream for context_. More...

std::vector< ExprContext * >	conjunct_ctxs_

Tuple *	template_tuple_

int	tuple_byte_size_
	Fixed size of each tuple, in bytes. More...

Tuple *	tuple_
	Current tuple pointer into tuple_mem_. More...

RowBatch *	batch_

uint8_t *	tuple_mem_
	The tuple memory of batch_. More...

int	num_errors_in_file_
	number of errors in current file More...

boost::scoped_ptr< TextConverter >	text_converter_
	Helper class for converting text to other types;. More...

int32_t	num_null_bytes_
	Number of null bytes in the tuple. More...

Status	parse_status_

boost::scoped_ptr< Codec >	decompressor_
	Decompressor class to use, if any. More...

THdfsCompression::type	decompression_type_
	The most recently used decompression type. More...

boost::scoped_ptr< MemPool >	data_buffer_pool_

RuntimeProfile::Counter *	decompress_timer_
	Time spent decompressing bytes. More...

WriteTuplesFn	write_tuples_fn_
	Jitted write tuples function pointer. Null if codegen is disabled. More...

Detailed Description

HdfsScanner is the superclass for different hdfs file format parsers. There is an instance of the scanner object created for each split, each driven by a different thread created by the scan node. The scan node calls:

Prepare
ProcessSplit
Close ProcessSplit does not return until the split is complete (or an error) occurred. The HdfsScanner works in tandem with the ScannerContext to interleave IO and parsing. If a split is compressed, then a decompressor will be created, either during Prepare() or at the beginning of ProcessSplit(), and used for decompressing and reading the split. For codegen, the implementation is split into two parts.

During the Prepare() phase of the ScanNode, the scanner subclass's static Codegen() function will be called to perform codegen for that scanner type for the specific tuple desc. This codegen'd function is cached in the HdfsScanNode.
During the GetNext() phase (where we create one Scanner for each scan range), the created scanner subclass can retrieve, from the scan node, the codegen'd function to use. This way, we only codegen once per scanner type, rather than once per scanner object. This class also encapsulates row batch management. Subclasses should call CommitRows() after writing to the current row batch, which handles creating row batches, attaching resources (IO buffers and mem pools) to the current row batch, and passing row batches up to the scan node. Subclasses can also use GetMemory() to help with per-row memory management.

Definition at line 91 of file hdfs-scanner.h.

Member Typedef Documentation

typedef int(* impala::HdfsScanner::WriteTuplesFn)(HdfsScanner *, MemPool *, TupleRow *, int, FieldLocation *, int, int, int, int)

protected

Matching typedef for WriteAlignedTuples for codegen. Refer to comments for that function.

Definition at line 212 of file hdfs-scanner.h.

Constructor & Destructor Documentation

HdfsScanner::HdfsScanner	(	HdfsScanNode *	scan_node,
		RuntimeState *	state
	)

Definition at line 53 of file hdfs-scanner.cc.

HdfsScanner::~HdfsScanner ( )

virtual

Definition at line 67 of file hdfs-scanner.cc.

References batch_.

Member Function Documentation

void HdfsScanner::AddFinalRowBatch ( )

protected

Attach all remaining resources from context_ to batch_ and send batch_ to the scan node. This must be called after all rows have been committed and no further resources are needed from context_ (in practice this will happen in each scanner subclass's Close() implementation).

Definition at line 145 of file hdfs-scanner.cc.

References impala::HdfsScanNode::AddMaterializedRowBatch(), batch_, context_, impala::ScannerContext::ReleaseCompletedResources(), and scan_node_.

Referenced by impala::HdfsTextScanner::Close(), impala::BaseSequenceScanner::Close(), and impala::HdfsParquetScanner::Close().

void impala::HdfsScanner::AttachPool	(	MemPool *	pool,
		bool	commit_batch
	)

inlineprotected

Release all memory in 'pool' to batch_. If commit_batch is true, the row batch will be committed. commit_batch should be true if the attached pool is expected to be non-trivial (i.e. a decompression buffer) to minimize scanner mem usage.

Definition at line 256 of file hdfs-scanner.h.

References impala::MemPool::AcquireData(), batch_, CommitRows(), and impala::RowBatch::tuple_data_pool().

Referenced by impala::HdfsTextScanner::Close(), impala::BaseSequenceScanner::Close(), impala::HdfsParquetScanner::Close(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ReadCompressedBlock(), impala::HdfsParquetScanner::BaseColumnReader::ReadDataPage(), and impala::HdfsRCFileScanner::ResetRowGroup().

void HdfsScanner::Close ( )

virtual

Release all resources the scanner has allocated. This is the last chance for the scanner to attach any resources to the ScannerContext object.

Reimplemented in impala::HdfsParquetScanner, impala::BaseSequenceScanner, and impala::HdfsTextScanner.

Definition at line 82 of file hdfs-scanner.cc.

References impala::Expr::Close(), conjunct_ctxs_, decompressor_, and state_.

Referenced by impala::HdfsTextScanner::Close(), impala::BaseSequenceScanner::Close(), impala::HdfsParquetScanner::Close(), and impala::HdfsScanNode::ScannerThread().

Function * HdfsScanner::CodegenWriteAlignedTuples	(	HdfsScanNode *	,
		LlvmCodeGen *	,
		llvm::Function *	write_tuple_fn
	)

staticprotected

Codegen function to replace WriteAlignedTuples. WriteAlignedTuples is cross compiled to IR. This function loads the precompiled IR function, modifies it and returns the resulting function.

Definition at line 495 of file hdfs-scanner.cc.

References impala::LlvmCodeGen::codegen_timer(), impala::LlvmCodeGen::FinalizeFunction(), impala::LlvmCodeGen::GetFunction(), impala::LlvmCodeGen::ReplaceCallSites(), and SCOPED_TIMER.

Referenced by impala::HdfsTextScanner::Codegen(), and impala::HdfsSequenceScanner::Codegen().

Function * HdfsScanner::CodegenWriteCompleteTuple	(	HdfsScanNode *	,
		LlvmCodeGen *	,
		const std::vector< ExprContext * > &	conjunct_ctxs
	)

staticprotected

Codegen function to replace WriteCompleteTuple. Should behave identically to WriteCompleteTuple.

Definition at line 296 of file hdfs-scanner.cc.

Referenced by impala::HdfsTextScanner::Codegen(), and impala::HdfsSequenceScanner::Codegen().

Status HdfsScanner::CommitRows ( int num_rows )

protected

Commit num_rows to the current row batch. If this completes, the row batch is enqueued with the scan node and StartNewRowBatch() is called. Returns Status::OK if the query is not cancelled and hasn't exceeded any mem limits. Scanner can call this with 0 rows to flush any pending resources (attached pools and io buffers) to minimize memory consumption.

Definition at line 124 of file hdfs-scanner.cc.

References impala::HdfsScanNode::AddMaterializedRowBatch(), impala::RowBatch::AtCapacity(), batch_, impala::TupleDescriptor::byte_size(), impala::Status::CANCELLED, impala::ScannerContext::cancelled(), impala::RowBatch::capacity(), impala::RuntimeState::CheckQueryState(), impala::RowBatch::CommitRows(), conjunct_ctxs_, context_, impala::ExprContext::FreeLocalAllocations(), impala::ScannerContext::num_completed_io_buffers(), impala::RowBatch::num_rows(), impala::Status::OK, impala::ScannerContext::ReleaseCompletedResources(), RETURN_IF_ERROR, scan_node_, StartNewRowBatch(), state_, impala::HdfsScanNode::tuple_desc(), and tuple_mem_.

Referenced by impala::HdfsParquetScanner::AssembleRows(), AttachPool(), impala::HdfsTextScanner::FinishScanRange(), impala::HdfsSequenceScanner::ProcessDecompressedBlock(), impala::HdfsParquetScanner::ProcessFooter(), impala::HdfsTextScanner::ProcessRange(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ProcessRange(), impala::HdfsRCFileScanner::ProcessRange(), and impala::HdfsParquetScanner::ProcessSplit().

bool IR_ALWAYS_INLINE impala::HdfsScanner::EvalConjuncts ( TupleRow * row )

inlineprotected

Convenience function for evaluating conjuncts using this scanner's ExprContexts. This must always be inlined so we can correctly replace the call to ExecNode::EvalConjuncts() during codegen.

Definition at line 266 of file hdfs-scanner.h.

References conjunct_ctxs_, and impala::ExecNode::EvalConjuncts().

Referenced by impala::HdfsParquetScanner::AssembleRows(), impala::HdfsAvroScanner::DecodeAvroData(), impala::HdfsRCFileScanner::ProcessRange(), WriteCompleteTuple(), WriteEmptyTuples(), and impala::HdfsTextScanner::WriteFields().

ExprContext * HdfsScanner::GetConjunctCtx ( int idx ) const

protected

Simple wrapper around conjunct_ctxs_. Used in the codegen'd version of WriteCompleteTuple() because it's easier than writing IR to access conjunct_ctxs_.

Definition at line 79 of file hdfs-scanner-ir.cc.

References conjunct_ctxs_, and gen_ir_descriptions::idx.

int HdfsScanner::GetMemory	(	MemPool **	pool,
		Tuple **	tuple_mem,
		TupleRow **	tuple_row_mem
	)

protected

Gets memory for outputting tuples into batch_. *pool is the mem pool that should be used for memory allocated for those tuples. *tuple_mem should be the location to output tuples, and *tuple_row_mem for outputting tuple rows. Returns the maximum number of tuples/tuple rows that can be output (before the current row batch is complete and a new one is allocated). Memory returned from this call is invalidated after calling CommitRows. Callers must call GetMemory again after calling this function.

Definition at line 115 of file hdfs-scanner.cc.

References impala::RowBatch::AddRow(), batch_, impala::RowBatch::capacity(), impala::RowBatch::GetRow(), impala::RowBatch::num_rows(), impala::RowBatch::tuple_data_pool(), and tuple_mem_.

Referenced by impala::HdfsParquetScanner::AssembleRows(), impala::HdfsTextScanner::FinishScanRange(), impala::HdfsSequenceScanner::ProcessDecompressedBlock(), impala::HdfsParquetScanner::ProcessFooter(), impala::HdfsTextScanner::ProcessRange(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ProcessRange(), and impala::HdfsRCFileScanner::ProcessRange().

Status HdfsScanner::InitializeWriteTuplesFn	(	HdfsPartitionDescriptor *	partition,
		THdfsFileFormat::type	type,
		const std::string &	scanner_name
	)

protected

Initializes write_tuples_fn_ to the jitted function if codegen is possible.

partition - partition descriptor for this scanner/scan range
type - type for this scanner
scanner_name - debug string name for this scanner (e.g. HdfsTextScanner)

Definition at line 87 of file hdfs-scanner.cc.

References impala::HdfsPartitionDescriptor::escape_char(), impala::HdfsScanNode::GetCodegenFn(), impala::ExecNode::id(), impala::HdfsScanNode::IncNumScannersCodegenDisabled(), impala::HdfsScanNode::IncNumScannersCodegenEnabled(), impala::Status::OK, scan_node_, impala::TupleDescriptor::string_slots(), impala::HdfsScanNode::tuple_desc(), and write_tuples_fn_.

Referenced by impala::HdfsSequenceScanner::InitNewRange(), and impala::HdfsTextScanner::ResetScanner().

virtual Status impala::HdfsScanner::InitNewRange ( )

protectedpure virtual

Reset internal state for a new scan range.

Implemented in impala::HdfsRCFileScanner, impala::HdfsParquetScanner, impala::HdfsSequenceScanner, impala::HdfsAvroScanner, and impala::HdfsTextScanner.

Referenced by impala::BaseSequenceScanner::ProcessSplit().

void impala::HdfsScanner::InitTuple	(	Tuple *	template_tuple,
		Tuple *	tuple
	)

inlineprotected

Initialize a tuple. TODO: only copy over non-null slots. TODO: InitTuple is called frequently, avoid the if, perhaps via templatization.

Definition at line 355 of file hdfs-scanner.h.

References num_null_bytes_, and tuple_byte_size_.

Referenced by impala::HdfsParquetScanner::AssembleRows(), impala::HdfsAvroScanner::DecodeAvroData(), impala::HdfsRCFileScanner::ProcessRange(), WriteCompleteTuple(), and impala::HdfsTextScanner::WriteFields().

void HdfsScanner::LogRowParseError	(	int	row_idx,
		std::stringstream *
	)

protectedvirtual

Utility function to append an error message for an invalid row. This is called from ReportTupleParseError() row_idx is the index of the row in the current batch. Subclasses should override this function (i.e. text needs to join boundary rows). Since this is only in the error path, vtable overhead is acceptable.

Reimplemented in impala::HdfsSequenceScanner, and impala::HdfsTextScanner.

Definition at line 572 of file hdfs-scanner.cc.

Referenced by ReportTupleParseError().

TupleRow* impala::HdfsScanner::next_row ( TupleRow * r ) const

inlineprotected

Definition at line 368 of file hdfs-scanner.h.

References batch_, and impala::RowBatch::row_byte_size().

Referenced by impala::HdfsParquetScanner::AssembleRows(), impala::HdfsAvroScanner::DecodeAvroData(), impala::HdfsRCFileScanner::ProcessRange(), WriteEmptyTuples(), and impala::HdfsTextScanner::WriteFields().

Tuple* impala::HdfsScanner::next_tuple ( Tuple * t ) const

inlineprotected

Definition at line 363 of file hdfs-scanner.h.

References tuple_byte_size_.

Referenced by impala::HdfsParquetScanner::AssembleRows(), impala::HdfsAvroScanner::DecodeAvroData(), impala::HdfsRCFileScanner::ProcessRange(), and impala::HdfsTextScanner::WriteFields().

Status HdfsScanner::Prepare ( ScannerContext * context )

virtual

One-time initialisation of state that is constant across scan ranges.

Reimplemented in impala::HdfsRCFileScanner, impala::HdfsSequenceScanner, impala::HdfsParquetScanner, impala::BaseSequenceScanner, and impala::HdfsTextScanner.

Definition at line 71 of file hdfs-scanner.cc.

References ADD_TIMER, conjunct_ctxs_, context_, decompress_timer_, impala::HdfsScanNode::GetConjunctCtxs(), impala::ScannerContext::GetStream(), impala::HdfsScanNode::InitTemplateTuple(), impala::Status::OK, impala::ScannerContext::partition_descriptor(), impala::HdfsPartitionDescriptor::partition_key_value_ctxs(), RETURN_IF_ERROR, impala::ExecNode::runtime_profile(), scan_node_, StartNewRowBatch(), state_, stream_, and template_tuple_.

Referenced by impala::HdfsScanNode::CreateAndPrepareScanner(), impala::HdfsTextScanner::Prepare(), impala::BaseSequenceScanner::Prepare(), and impala::HdfsParquetScanner::Prepare().

virtual Status impala::HdfsScanner::ProcessSplit ( )

pure virtual

Process an entire split, reading bytes from the context's streams. Context is initialized with the split data (e.g. template tuple, partition descriptor, etc). This function should only return on error or end of scan range.

Implemented in impala::HdfsParquetScanner, impala::BaseSequenceScanner, and impala::HdfsTextScanner.

Referenced by impala::HdfsScanNode::ScannerThread().

void HdfsScanner::ReportColumnParseError	(	const SlotDescriptor *	desc,
		const char *	data,
		int	len
	)

protected

Report parse error for column @ desc. If abort_on_error is true, sets parse_status_ to the error message.

Definition at line 577 of file hdfs-scanner.cc.

References impala::RuntimeState::abort_on_error(), impala::SlotDescriptor::col_pos(), impala::RuntimeState::LogError(), impala::RuntimeState::LogHasSpace(), impala::HdfsScanNode::num_partition_keys(), impala::Status::ok(), parse_status_, scan_node_, state_, and impala::SlotDescriptor::type().

Referenced by impala::HdfsRCFileScanner::ProcessRange(), ReportTupleParseError(), and impala::HdfsTextScanner::WritePartialTuple().

bool HdfsScanner::ReportTupleParseError	(	FieldLocation *	fields,
		uint8_t *	errors,
		int	row_idx
	)

protected

Utility function to report parse errors for each field. If errors[i] is nonzero, fields[i] had a parse error. row_idx is the idx of the row in the current batch that had the parse error Returns false if parsing should be aborted. In this case parse_status_ is set to the error. This is called from WriteAlignedTuples.

Definition at line 546 of file hdfs-scanner.cc.

References impala::RuntimeState::abort_on_error(), impala::ScannerContext::Stream::filename(), impala::RuntimeState::LogError(), impala::RuntimeState::LogHasSpace(), LogRowParseError(), impala::HdfsScanNode::materialized_slots(), num_errors_in_file_, impala::Status::ok(), parse_status_, ReportColumnParseError(), impala::RuntimeState::ReportFileErrors(), scan_node_, state_, and stream_.

Referenced by impala::HdfsSequenceScanner::ProcessRange(), and WriteAlignedTuples().

void HdfsScanner::StartNewRowBatch ( )

protected

Set batch_ to a new row batch and update tuple_mem_ accordingly.

Definition at line 108 of file hdfs-scanner.cc.

References impala::MemPool::Allocate(), batch_, impala::RuntimeState::batch_size(), impala::ExecNode::mem_tracker(), impala::ExecNode::row_desc(), scan_node_, state_, tuple_byte_size_, impala::RowBatch::tuple_data_pool(), and tuple_mem_.

Referenced by CommitRows(), and Prepare().

Status HdfsScanner::UpdateDecompressor ( const THdfsCompression::type & compression )

protected

Update the decompressor_ object given a compression type or codec name. Depending on the old compression type and the new one, it may close the old decompressor and/or create a new one of different type.

Definition at line 513 of file hdfs-scanner.cc.

References impala::Codec::CreateDecompressor(), data_buffer_pool_, decompression_type_, decompressor_, impala::Status::OK, RETURN_IF_ERROR, scan_node_, impala::TupleDescriptor::string_slots(), and impala::HdfsScanNode::tuple_desc().

Referenced by impala::HdfsAvroScanner::InitNewRange(), impala::HdfsSequenceScanner::InitNewRange(), and impala::HdfsTextScanner::ProcessSplit().

Status impala::HdfsScanner::UpdateDecompressor ( const std::string & codec )

protected

int HdfsScanner::WriteAlignedTuples	(	MemPool *	pool,
		TupleRow *	tuple_row_mem,
		int	row_size,
		FieldLocation *	fields,
		int	num_tuples,
		int	max_added_tuples,
		int	slots_per_tuple,
		int	row_start_indx
	)

protected

Processes batches of fields and writes them out to tuple_row_mem.

'pool' mempool to allocate from for auxiliary tuple memory
'tuple_row_mem' preallocated tuple_row memory this function must use.
'fields' must start at the beginning of a tuple.
'num_tuples' number of tuples to process
'max_added_tuples' the maximum number of tuples that should be added to the batch.
'row_start_index' is the number of rows that have already been processed as part of WritePartialTuple. Returns the number of tuples added to the row batch. This can be less than num_tuples/tuples_till_limit because of failed conjuncts. Returns -1 if parsing should be aborted due to parse errors.

Definition at line 33 of file hdfs-scanner-ir.cc.

References ReportTupleParseError(), template_tuple_, tuple_, tuple_byte_size_, UNLIKELY, and WriteCompleteTuple().

Referenced by impala::HdfsSequenceScanner::ProcessDecompressedBlock(), and impala::HdfsTextScanner::WriteFields().

bool HdfsScanner::WriteCompleteTuple	(	MemPool *	pool,
		FieldLocation *	fields,
		Tuple *	tuple,
		TupleRow *	tuple_row,
		Tuple *	template_tuple,
		uint8_t *	error_fields,
		uint8_t *	error_in_row
	)

protected

Writes out all slots for 'tuple' from 'fields'. 'fields' must be aligned to the start of the tuple (e.g. fields[0] maps to slots[0]). After writing the tuple, it will be evaluated against the conjuncts.

error_fields is an out array. error_fields[i] will be set to true if the ith field had a parse error
error_in_row is an out bool. It is set to true if any field had parse errors Returns whether the resulting tuplerow passed the conjuncts. The parsing of the fields and evaluating against conjuncts is combined in this function. This is done so it can be possible to evaluate conjuncts as slots are materialized (on partial tuples). This function is replaced by a codegen'd function at runtime. This is the reason that the out error parameters are typed uint8_t instead of bool. We need to be able to match this function's signature identically for the codegen'd function. Bool's as out parameters can get converted to bytes by the compiler and rather than implicitly depending on that to happen, we will explicitly type them to bytes. TODO: revisit this

Definition at line 217 of file hdfs-scanner.cc.

References EvalConjuncts(), InitTuple(), impala::FieldLocation::len, impala::HdfsScanNode::materialized_slots(), scan_node_, impala::TupleRow::SetTuple(), text_converter_, impala::HdfsScanNode::tuple_idx(), and UNLIKELY.

Referenced by impala::HdfsSequenceScanner::ProcessRange(), and WriteAlignedTuples().

int HdfsScanner::WriteEmptyTuples	(	RowBatch *	row_batch,
		int	num_tuples
	)

protected

Utility method to write out tuples when there are no materialized fields (e.g. select count(*) or only partition keys). num_tuples - Total number of tuples to write out. Returns the number of tuples added to the row batch.

Definition at line 157 of file hdfs-scanner.cc.

References impala::RowBatch::AddRow(), impala::RowBatch::AddRows(), impala::RowBatch::AtCapacity(), impala::RowBatch::capacity(), impala::RowBatch::CommitLastRow(), impala::RowBatch::CommitRows(), EvalConjuncts(), impala::RowBatch::GetRow(), impala::RowBatch::INVALID_ROW_INDEX, impala::RowBatch::num_rows(), scan_node_, impala::TupleRow::SetTuple(), template_tuple_, and impala::HdfsScanNode::tuple_idx().

Referenced by impala::HdfsSequenceScanner::ProcessDecompressedBlock(), impala::HdfsParquetScanner::ProcessFooter(), impala::HdfsTextScanner::ProcessRange(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ProcessRange(), and impala::HdfsRCFileScanner::ProcessRange().

int HdfsScanner::WriteEmptyTuples	(	ScannerContext *	context,
		TupleRow *	tuple_row,
		int	num_tuples
	)

protected

Write empty tuples and commit them to the context object.

Definition at line 195 of file hdfs-scanner.cc.

References EvalConjuncts(), next_row(), scan_node_, impala::TupleRow::SetTuple(), template_tuple_, and impala::HdfsScanNode::tuple_idx().

Member Data Documentation

RowBatch* impala::HdfsScanner::batch_

protected

The current row batch being populated. Creating new row batches, attaching context resources, and handing off to the scan node is handled by this class in CommitRows(), but AttachPool() must be called by scanner subclasses to attach any memory allocated by that subclass. All row batches created by this class are transferred to the scan node (i.e., all batches are ultimately owned by the scan node).

Definition at line 177 of file hdfs-scanner.h.

Referenced by AddFinalRowBatch(), AttachPool(), CommitRows(), GetMemory(), next_row(), impala::HdfsSequenceScanner::ProcessDecompressedBlock(), impala::HdfsParquetScanner::ProcessSplit(), StartNewRowBatch(), impala::HdfsTextScanner::WriteFields(), and ~HdfsScanner().

std::vector<ExprContext*> impala::HdfsScanner::conjunct_ctxs_

protected

ExprContext for each conjunct. Each scanner has its own ExprContexts so the conjuncts can be safely evaluated in parallel.

Definition at line 154 of file hdfs-scanner.h.

Referenced by Close(), CommitRows(), EvalConjuncts(), GetConjunctCtx(), and Prepare().

ScannerContext* impala::HdfsScanner::context_

protected

Context for this scanner.

Definition at line 147 of file hdfs-scanner.h.

Referenced by AddFinalRowBatch(), impala::HdfsParquetScanner::AssembleRows(), CommitRows(), impala::HdfsTextScanner::FillByteBufferCompressedFile(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsParquetScanner::InitColumns(), impala::HdfsTextScanner::InitNewRange(), impala::HdfsSequenceScanner::InitNewRange(), Prepare(), impala::HdfsSequenceScanner::ProcessDecompressedBlock(), impala::HdfsParquetScanner::ProcessFooter(), impala::HdfsTextScanner::ProcessRange(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ProcessRange(), impala::HdfsRCFileScanner::ProcessRange(), impala::HdfsParquetScanner::ProcessSplit(), and impala::HdfsTextScanner::ResetScanner().

boost::scoped_ptr<MemPool> impala::HdfsScanner::data_buffer_pool_

protected

Pool to allocate per data block memory. This should be used with the decompressor and any other per data block allocations.

Definition at line 205 of file hdfs-scanner.h.

Referenced by impala::HdfsTextScanner::Close(), impala::BaseSequenceScanner::Close(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ReadCompressedBlock(), impala::HdfsRCFileScanner::ReadRowGroup(), impala::HdfsRCFileScanner::ResetRowGroup(), UpdateDecompressor(), and impala::HdfsTextScanner::WritePartialTuple().

RuntimeProfile::Counter* impala::HdfsScanner::decompress_timer_

protected

Time spent decompressing bytes.

Definition at line 208 of file hdfs-scanner.h.

Referenced by impala::HdfsTextScanner::FillByteBufferCompressedFile(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsSequenceScanner::GetRecord(), Prepare(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsRCFileScanner::ReadColumnBuffers(), impala::HdfsSequenceScanner::ReadCompressedBlock(), impala::HdfsParquetScanner::BaseColumnReader::ReadDataPage(), and impala::HdfsRCFileScanner::ReadKeyBuffers().

THdfsCompression::type impala::HdfsScanner::decompression_type_

protected

The most recently used decompression type.

Definition at line 201 of file hdfs-scanner.h.

Referenced by impala::HdfsTextScanner::FillByteBuffer(), impala::HdfsTextScanner::FillByteBufferCompressedFile(), and UpdateDecompressor().

boost::scoped_ptr<Codec> impala::HdfsScanner::decompressor_

protected

Decompressor class to use, if any.

Definition at line 198 of file hdfs-scanner.h.

Referenced by impala::HdfsTextScanner::Close(), impala::BaseSequenceScanner::Close(), Close(), impala::HdfsTextScanner::FillByteBuffer(), impala::HdfsTextScanner::FillByteBufferCompressedFile(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsTextScanner::FinishScanRange(), impala::HdfsSequenceScanner::GetRecord(), impala::HdfsRCFileScanner::InitNewRange(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsRCFileScanner::ReadColumnBuffers(), impala::HdfsSequenceScanner::ReadCompressedBlock(), impala::HdfsRCFileScanner::ReadKeyBuffers(), and UpdateDecompressor().

const int impala::HdfsScanner::FILE_BLOCK_SIZE = 4096

static

Assumed size of an OS file block. Used mostly when reading file format headers, etc. This probably ought to be a derived number from the environment.

Definition at line 95 of file hdfs-scanner.h.

const char * HdfsScanner::LLVM_CLASS_NAME = "class.impala::HdfsScanner"

static

Scanner subclasses must implement these static functions as well. Unfortunately, c++ does not allow static virtual functions. Issue the initial ranges for 'files'. HdfsFileDesc groups all the splits assigned to this scan node by file. This is called before any of the scanner subclasses are created to process splits in 'files'. The strategy on how to parse the scan ranges depends on the file format.

For simple text files, all the splits are simply issued to the io mgr and one split == one scan range.
For formats with a header, the metadata is first parsed, and then the ranges are issued to the io mgr. There is one scan range for the header and one range for each split.
For columnar formats, the header is parsed and only the relevant byte ranges should be issued to the io mgr. This is one range for the metadata and one range for each column, for each split. This function is how scanners can pick their strategy. void IssueInitialRanges(HdfsScanNode* scan_node, const std::vector<HdfsFileDesc*>& files); Codegen all functions for this scanner. The codegen'd function is specific to the scanner subclass but not specific to each scanner object. We don't want to codegen the functions for each scanner object. llvm::Function* Codegen(HdfsScanNode* scan_node);

Definition at line 137 of file hdfs-scanner.h.

Referenced by CodegenWriteCompleteTuple().

int impala::HdfsScanner::num_errors_in_file_

protected

number of errors in current file

Definition at line 183 of file hdfs-scanner.h.

Referenced by ReportTupleParseError().

int32_t impala::HdfsScanner::num_null_bytes_

protected

Number of null bytes in the tuple.

Definition at line 189 of file hdfs-scanner.h.

Referenced by InitTuple().

Status impala::HdfsScanner::parse_status_

protected

Contains current parse status to minimize the number of Status objects returned. This significantly minimizes the cross compile dependencies for llvm since status objects inline a bunch of string functions. Also, status objects aren't extremely cheap to create and destroy.

Definition at line 195 of file hdfs-scanner.h.

Referenced by impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsSequenceScanner::GetRecord(), impala::HdfsAvroScanner::ParseMetadata(), impala::HdfsSequenceScanner::ProcessBlockCompressedScanRange(), impala::HdfsSequenceScanner::ProcessDecompressedBlock(), impala::HdfsTextScanner::ProcessRange(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ProcessRange(), impala::HdfsRCFileScanner::ProcessRange(), impala::BaseSequenceScanner::ProcessSplit(), impala::HdfsSequenceScanner::ReadBlockHeader(), impala::HdfsRCFileScanner::ReadColumnBuffers(), impala::HdfsSequenceScanner::ReadCompressedBlock(), impala::HdfsAvroScanner::ReadFileHeader(), impala::HdfsSequenceScanner::ReadFileHeader(), impala::HdfsRCFileScanner::ReadFileHeader(), impala::HdfsRCFileScanner::ReadKeyBuffers(), impala::HdfsRCFileScanner::ReadNumColumnsMetadata(), impala::HdfsRCFileScanner::ReadRowGroupHeader(), impala::BaseSequenceScanner::ReadSync(), ReportColumnParseError(), ReportTupleParseError(), impala::BaseSequenceScanner::SkipToSync(), and impala::HdfsTextScanner::WriteFields().

HdfsScanNode* impala::HdfsScanner::scan_node_

protected

The scan node that started this scanner.

Definition at line 141 of file hdfs-scanner.h.

RuntimeState* impala::HdfsScanner::state_

protected

ScannerContext::Stream* impala::HdfsScanner::stream_

protected

The first stream for context_.

Definition at line 150 of file hdfs-scanner.h.

Tuple* impala::HdfsScanner::template_tuple_

protected

A partially materialized tuple with only partition key slots set. The non-partition key slots are set to NULL. The template tuple must be copied into tuple_ before any of the other slots are materialized. Pointer is NULL if there are no partition key slots. This template tuple is computed once for each file and valid for the duration of that file. It is owned by the HDFS scan node.

Definition at line 164 of file hdfs-scanner.h.

Referenced by impala::HdfsAvroScanner::AllocateFileHeader(), impala::HdfsParquetScanner::AssembleRows(), impala::HdfsParquetScanner::CreateColumnReaders(), impala::HdfsAvroScanner::DecodeAvroData(), impala::HdfsAvroScanner::InitNewRange(), Prepare(), impala::HdfsSequenceScanner::ProcessRange(), impala::HdfsRCFileScanner::ProcessRange(), impala::HdfsAvroScanner::ResolveSchemas(), WriteAlignedTuples(), WriteEmptyTuples(), and impala::HdfsTextScanner::WriteFields().