Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
|
A scanner for reading RCFiles into tuples. More...
#include <hdfs-rcfile-scanner.h>
Classes | |
struct | ColumnInfo |
struct | RcFileHeader |
Data that is fixed across headers. This struct is shared between scan ranges. More... | |
Public Member Functions | |
HdfsRCFileScanner (HdfsScanNode *scan_node, RuntimeState *state) | |
virtual | ~HdfsRCFileScanner () |
virtual Status | Prepare (ScannerContext *context) |
One-time initialisation of state that is constant across scan ranges. More... | |
void | DebugString (int indentation_level, std::stringstream *out) const |
virtual void | Close () |
virtual Status | ProcessSplit () |
Static Public Member Functions | |
static Status | IssueInitialRanges (HdfsScanNode *scan_node, const std::vector< HdfsFileDesc * > &files) |
Issue the initial ranges for all sequence container files. More... | |
Static Public Attributes | |
static const int | FILE_BLOCK_SIZE = 4096 |
static const char * | LLVM_CLASS_NAME = "class.impala::HdfsScanner" |
Protected Types | |
typedef int(* | WriteTuplesFn )(HdfsScanner *, MemPool *, TupleRow *, int, FieldLocation *, int, int, int, int) |
Protected Member Functions | |
Status | ReadSync () |
Status | SkipToSync (const uint8_t *sync, int sync_size) |
bool | finished () |
Status | InitializeWriteTuplesFn (HdfsPartitionDescriptor *partition, THdfsFileFormat::type type, const std::string &scanner_name) |
void | StartNewRowBatch () |
Set batch_ to a new row batch and update tuple_mem_ accordingly. More... | |
int | GetMemory (MemPool **pool, Tuple **tuple_mem, TupleRow **tuple_row_mem) |
Status | CommitRows (int num_rows) |
void | AddFinalRowBatch () |
void | AttachPool (MemPool *pool, bool commit_batch) |
bool IR_ALWAYS_INLINE | EvalConjuncts (TupleRow *row) |
int | WriteEmptyTuples (RowBatch *row_batch, int num_tuples) |
int | WriteEmptyTuples (ScannerContext *context, TupleRow *tuple_row, int num_tuples) |
Write empty tuples and commit them to the context object. More... | |
int | WriteAlignedTuples (MemPool *pool, TupleRow *tuple_row_mem, int row_size, FieldLocation *fields, int num_tuples, int max_added_tuples, int slots_per_tuple, int row_start_indx) |
Status | UpdateDecompressor (const THdfsCompression::type &compression) |
Status | UpdateDecompressor (const std::string &codec) |
bool | ReportTupleParseError (FieldLocation *fields, uint8_t *errors, int row_idx) |
virtual void | LogRowParseError (int row_idx, std::stringstream *) |
bool | WriteCompleteTuple (MemPool *pool, FieldLocation *fields, Tuple *tuple, TupleRow *tuple_row, Tuple *template_tuple, uint8_t *error_fields, uint8_t *error_in_row) |
void | ReportColumnParseError (const SlotDescriptor *desc, const char *data, int len) |
void | InitTuple (Tuple *template_tuple, Tuple *tuple) |
Tuple * | next_tuple (Tuple *t) const |
TupleRow * | next_row (TupleRow *r) const |
ExprContext * | GetConjunctCtx (int idx) const |
Static Protected Member Functions | |
static llvm::Function * | CodegenWriteCompleteTuple (HdfsScanNode *, LlvmCodeGen *, const std::vector< ExprContext * > &conjunct_ctxs) |
static llvm::Function * | CodegenWriteAlignedTuples (HdfsScanNode *, LlvmCodeGen *, llvm::Function *write_tuple_fn) |
Protected Attributes | |
FileHeader * | header_ |
File header for this scan range. This is not owned by the parent scan node. More... | |
bool | only_parsing_header_ |
If true, this scanner object is only for processing the header. More... | |
HdfsScanNode * | scan_node_ |
The scan node that started this scanner. More... | |
RuntimeState * | state_ |
RuntimeState for error reporting. More... | |
ScannerContext * | context_ |
Context for this scanner. More... | |
ScannerContext::Stream * | stream_ |
The first stream for context_. More... | |
std::vector< ExprContext * > | conjunct_ctxs_ |
Tuple * | template_tuple_ |
int | tuple_byte_size_ |
Fixed size of each tuple, in bytes. More... | |
Tuple * | tuple_ |
Current tuple pointer into tuple_mem_. More... | |
RowBatch * | batch_ |
uint8_t * | tuple_mem_ |
The tuple memory of batch_. More... | |
int | num_errors_in_file_ |
number of errors in current file More... | |
boost::scoped_ptr< TextConverter > | text_converter_ |
Helper class for converting text to other types;. More... | |
int32_t | num_null_bytes_ |
Number of null bytes in the tuple. More... | |
Status | parse_status_ |
boost::scoped_ptr< Codec > | decompressor_ |
Decompressor class to use, if any. More... | |
THdfsCompression::type | decompression_type_ |
The most recently used decompression type. More... | |
boost::scoped_ptr< MemPool > | data_buffer_pool_ |
RuntimeProfile::Counter * | decompress_timer_ |
Time spent decompressing bytes. More... | |
WriteTuplesFn | write_tuples_fn_ |
Jitted write tuples function pointer. Null if codegen is disabled. More... | |
Static Protected Attributes | |
static const int | SYNC_HASH_SIZE = 16 |
Size of the sync hash field. More... | |
static const int | HEADER_SIZE = 1024 |
static const int | SYNC_MARKER = -1 |
Sync indicator. More... | |
Private Types | |
enum | Version { SEQ6, RCF1 } |
Private Member Functions | |
virtual FileHeader * | AllocateFileHeader () |
Implementation of superclass functions. More... | |
virtual Status | ReadFileHeader () |
virtual Status | InitNewRange () |
Reset internal state for a new scan range. More... | |
virtual Status | ProcessRange () |
virtual THdfsFileFormat::type | file_format () const |
Returns type of scanner: e.g. rcfile, seqfile. More... | |
Status | ReadNumColumnsMetadata () |
Status | ReadRowGroupHeader () |
Status | ReadKeyBuffers () |
void | GetCurrentKeyBuffer (int col_idx, bool skip_col_data, uint8_t **key_buf_ptr) |
Status | ReadColumnBuffers () |
Status | NextField (int col_idx) |
Status | ReadRowGroup () |
void | ResetRowGroup () |
Reset state for a new row group. More... | |
Status | NextRow () |
Private Attributes | |
std::vector< ColumnInfo > | columns_ |
std::vector< uint8_t > | key_buffer_ |
Buffer for copying key buffers. This buffer is reused between row groups. More... | |
int | num_rows_ |
number of rows in this rowgroup object More... | |
int | row_pos_ |
int | key_length_ |
int | compressed_key_length_ |
bool | reuse_row_group_buffer_ |
uint8_t * | row_group_buffer_ |
int | row_group_length_ |
int | row_group_buffer_size_ |
Static Private Attributes | |
static const char *const | RCFILE_KEY_CLASS_NAME |
static const char *const | RCFILE_VALUE_CLASS_NAME |
static const char *const | RCFILE_METADATA_KEY_NUM_COLS |
static const uint8_t | RCFILE_VERSION_HEADER [4] = {'R', 'C', 'F', 1} |
A scanner for reading RCFiles into tuples.
Definition at line 231 of file hdfs-rcfile-scanner.h.
|
protectedinherited |
Matching typedef for WriteAlignedTuples for codegen. Refer to comments for that function.
Definition at line 212 of file hdfs-scanner.h.
|
private |
Enumerator | |
---|---|
SEQ6 | |
RCF1 |
Definition at line 328 of file hdfs-rcfile-scanner.h.
HdfsRCFileScanner::HdfsRCFileScanner | ( | HdfsScanNode * | scan_node, |
RuntimeState * | state | ||
) |
Definition at line 53 of file hdfs-rcfile-scanner.cc.
|
virtual |
Definition at line 57 of file hdfs-rcfile-scanner.cc.
|
protectedinherited |
Attach all remaining resources from context_ to batch_ and send batch_ to the scan node. This must be called after all rows have been committed and no further resources are needed from context_ (in practice this will happen in each scanner subclass's Close() implementation).
Definition at line 145 of file hdfs-scanner.cc.
References impala::HdfsScanNode::AddMaterializedRowBatch(), impala::HdfsScanner::batch_, impala::HdfsScanner::context_, impala::ScannerContext::ReleaseCompletedResources(), and impala::HdfsScanner::scan_node_.
Referenced by impala::HdfsTextScanner::Close(), impala::BaseSequenceScanner::Close(), and impala::HdfsParquetScanner::Close().
|
privatevirtual |
Implementation of superclass functions.
Implements impala::BaseSequenceScanner.
Definition at line 227 of file hdfs-rcfile-scanner.cc.
Release all memory in 'pool' to batch_. If commit_batch is true, the row batch will be committed. commit_batch should be true if the attached pool is expected to be non-trivial (i.e. a decompression buffer) to minimize scanner mem usage.
Definition at line 256 of file hdfs-scanner.h.
References impala::MemPool::AcquireData(), impala::HdfsScanner::batch_, impala::HdfsScanner::CommitRows(), and impala::RowBatch::tuple_data_pool().
Referenced by impala::HdfsTextScanner::Close(), impala::BaseSequenceScanner::Close(), impala::HdfsParquetScanner::Close(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ReadCompressedBlock(), impala::HdfsParquetScanner::BaseColumnReader::ReadDataPage(), and ResetRowGroup().
|
virtualinherited |
Release all resources the scanner has allocated. This is the last chance for the scanner to attach any resources to the ScannerContext object.
Reimplemented from impala::HdfsScanner.
Definition at line 82 of file base-sequence-scanner.cc.
References impala::HdfsScanner::AddFinalRowBatch(), impala::HdfsScanner::AttachPool(), impala::ScannerContext::Stream::bytes_left(), impala::HdfsScanner::Close(), impala::BaseSequenceScanner::FileHeader::compression_type, impala::HdfsScanner::data_buffer_pool_, impala::HdfsScanner::decompressor_, impala::BaseSequenceScanner::file_format(), impala::BaseSequenceScanner::header_, impala::BaseSequenceScanner::num_syncs_, impala::BaseSequenceScanner::only_parsing_header_, impala::HdfsScanNode::RangeComplete(), impala::HdfsScanner::scan_node_, impala::HdfsScanner::stream_, impala::BaseSequenceScanner::total_block_size_, and VLOG_FILE.
|
staticprotectedinherited |
Codegen function to replace WriteAlignedTuples. WriteAlignedTuples is cross compiled to IR. This function loads the precompiled IR function, modifies it and returns the resulting function.
Definition at line 495 of file hdfs-scanner.cc.
References impala::LlvmCodeGen::codegen_timer(), impala::LlvmCodeGen::FinalizeFunction(), impala::LlvmCodeGen::GetFunction(), impala::LlvmCodeGen::ReplaceCallSites(), and SCOPED_TIMER.
Referenced by impala::HdfsTextScanner::Codegen(), and impala::HdfsSequenceScanner::Codegen().
|
staticprotectedinherited |
Codegen function to replace WriteCompleteTuple. Should behave identically to WriteCompleteTuple.
Definition at line 296 of file hdfs-scanner.cc.
References impala::LlvmCodeGen::FnPrototype::AddArgument(), impala::TupleDescriptor::byte_size(), impala::LlvmCodeGen::codegen_timer(), impala::LlvmCodeGen::CodegenMemcpy(), impala::TextConverter::CodegenWriteSlot(), impala::HdfsScanNode::ComputeSlotMaterializationOrder(), impala::LlvmCodeGen::context(), impala::CodegenAnyVal::CreateCallWrapped(), impala::LlvmCodeGen::false_value(), impala::LlvmCodeGen::FinalizeFunction(), impala::TupleDescriptor::GenerateLlvmStruct(), impala::Status::GetDetail(), impala::LlvmCodeGen::GetFunction(), impala::LlvmCodeGen::GetIntConstant(), impala::LlvmCodeGen::GetType(), impala::CodegenAnyVal::GetVal(), impala::HdfsScanNode::hdfs_table(), impala::FieldLocation::LLVM_CLASS_NAME, impala::TupleRow::LLVM_CLASS_NAME, impala::Tuple::LLVM_CLASS_NAME, impala::HdfsScanner::LLVM_CLASS_NAME, impala::MemPool::LLVM_CLASS_NAME, impala::HdfsScanNode::materialized_slots(), impala::HdfsTableDescriptor::null_column_value(), impala::HdfsScanNode::num_materialized_partition_keys(), impala::TupleDescriptor::num_null_bytes(), impala::Status::ok(), impala::LlvmCodeGen::OptimizeFunctionWithExprs(), impala::HdfsScanNode::runtime_state(), SCOPED_TIMER, impala::LlvmCodeGen::true_value(), impala::HdfsScanNode::tuple_desc(), impala::HdfsScanNode::tuple_idx(), impala::ColumnType::type, impala::SlotDescriptor::type(), impala::TYPE_BOOLEAN, impala::TYPE_DECIMAL, impala::TYPE_INT, impala::TYPE_TIMESTAMP, and impala::TYPE_TINYINT.
Referenced by impala::HdfsTextScanner::Codegen(), and impala::HdfsSequenceScanner::Codegen().
|
protectedinherited |
Commit num_rows to the current row batch. If this completes, the row batch is enqueued with the scan node and StartNewRowBatch() is called. Returns Status::OK if the query is not cancelled and hasn't exceeded any mem limits. Scanner can call this with 0 rows to flush any pending resources (attached pools and io buffers) to minimize memory consumption.
Definition at line 124 of file hdfs-scanner.cc.
References impala::HdfsScanNode::AddMaterializedRowBatch(), impala::RowBatch::AtCapacity(), impala::HdfsScanner::batch_, impala::TupleDescriptor::byte_size(), impala::Status::CANCELLED, impala::ScannerContext::cancelled(), impala::RowBatch::capacity(), impala::RuntimeState::CheckQueryState(), impala::RowBatch::CommitRows(), impala::HdfsScanner::conjunct_ctxs_, impala::HdfsScanner::context_, impala::ExprContext::FreeLocalAllocations(), impala::ScannerContext::num_completed_io_buffers(), impala::RowBatch::num_rows(), impala::Status::OK, impala::ScannerContext::ReleaseCompletedResources(), RETURN_IF_ERROR, impala::HdfsScanner::scan_node_, impala::HdfsScanner::StartNewRowBatch(), impala::HdfsScanner::state_, impala::HdfsScanNode::tuple_desc(), and impala::HdfsScanner::tuple_mem_.
Referenced by impala::HdfsParquetScanner::AssembleRows(), impala::HdfsScanner::AttachPool(), impala::HdfsTextScanner::FinishScanRange(), impala::HdfsSequenceScanner::ProcessDecompressedBlock(), impala::HdfsParquetScanner::ProcessFooter(), impala::HdfsTextScanner::ProcessRange(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ProcessRange(), ProcessRange(), and impala::HdfsParquetScanner::ProcessSplit().
void HdfsRCFileScanner::DebugString | ( | int | indentation_level, |
std::stringstream * | out | ||
) | const |
Definition at line 564 of file hdfs-rcfile-scanner.cc.
References impala::ScannerContext::Stream::filename(), impala::HdfsScanner::scan_node_, impala::HdfsScanner::stream_, and impala::HdfsScanNode::tuple_idx().
|
inlineprotectedinherited |
Convenience function for evaluating conjuncts using this scanner's ExprContexts. This must always be inlined so we can correctly replace the call to ExecNode::EvalConjuncts() during codegen.
Definition at line 266 of file hdfs-scanner.h.
References impala::HdfsScanner::conjunct_ctxs_, and impala::ExecNode::EvalConjuncts().
Referenced by impala::HdfsParquetScanner::AssembleRows(), impala::HdfsAvroScanner::DecodeAvroData(), ProcessRange(), impala::HdfsScanner::WriteCompleteTuple(), impala::HdfsScanner::WriteEmptyTuples(), and impala::HdfsTextScanner::WriteFields().
|
inlineprivatevirtual |
Returns type of scanner: e.g. rcfile, seqfile.
Implements impala::BaseSequenceScanner.
Definition at line 263 of file hdfs-rcfile-scanner.h.
|
inlineprotectedinherited |
Definition at line 117 of file base-sequence-scanner.h.
References impala::BaseSequenceScanner::finished_.
Referenced by impala::HdfsSequenceScanner::ProcessBlockCompressedScanRange(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ProcessRange(), and ProcessRange().
|
protectedinherited |
Simple wrapper around conjunct_ctxs_. Used in the codegen'd version of WriteCompleteTuple() because it's easier than writing IR to access conjunct_ctxs_.
Definition at line 79 of file hdfs-scanner-ir.cc.
References impala::HdfsScanner::conjunct_ctxs_, and gen_ir_descriptions::idx.
|
private |
Process the current key buffer. Inputs: col_idx: column to process skip_col_data: if true, just skip over the key data. Input/Output: key_buf_ptr: Pointer to the buffered file data, this will be moved past the data for this column. Sets: col_buf_len_ col_buf_uncompressed_len_ col_key_bufs_ col_bufs_off_
Definition at line 344 of file hdfs-rcfile-scanner.cc.
References impala::HdfsRCFileScanner::ColumnInfo::buffer_len, columns_, impala::ReadWriteUtil::GetVInt(), impala::HdfsRCFileScanner::ColumnInfo::key_buffer, row_group_length_, impala::HdfsRCFileScanner::ColumnInfo::start_offset, and impala::HdfsRCFileScanner::ColumnInfo::uncompressed_buffer_len.
Referenced by ReadKeyBuffers().
|
protectedinherited |
Gets memory for outputting tuples into batch_. *pool is the mem pool that should be used for memory allocated for those tuples. *tuple_mem should be the location to output tuples, and *tuple_row_mem for outputting tuple rows. Returns the maximum number of tuples/tuple rows that can be output (before the current row batch is complete and a new one is allocated). Memory returned from this call is invalidated after calling CommitRows. Callers must call GetMemory again after calling this function.
Definition at line 115 of file hdfs-scanner.cc.
References impala::RowBatch::AddRow(), impala::HdfsScanner::batch_, impala::RowBatch::capacity(), impala::RowBatch::GetRow(), impala::RowBatch::num_rows(), impala::RowBatch::tuple_data_pool(), and impala::HdfsScanner::tuple_mem_.
Referenced by impala::HdfsParquetScanner::AssembleRows(), impala::HdfsTextScanner::FinishScanRange(), impala::HdfsSequenceScanner::ProcessDecompressedBlock(), impala::HdfsParquetScanner::ProcessFooter(), impala::HdfsTextScanner::ProcessRange(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ProcessRange(), and ProcessRange().
|
protectedinherited |
Initializes write_tuples_fn_ to the jitted function if codegen is possible.
Definition at line 87 of file hdfs-scanner.cc.
References impala::HdfsPartitionDescriptor::escape_char(), impala::HdfsScanNode::GetCodegenFn(), impala::ExecNode::id(), impala::HdfsScanNode::IncNumScannersCodegenDisabled(), impala::HdfsScanNode::IncNumScannersCodegenEnabled(), impala::Status::OK, impala::HdfsScanner::scan_node_, impala::TupleDescriptor::string_slots(), impala::HdfsScanNode::tuple_desc(), and impala::HdfsScanner::write_tuples_fn_.
Referenced by impala::HdfsSequenceScanner::InitNewRange(), and impala::HdfsTextScanner::ResetScanner().
|
privatevirtual |
Reset internal state for a new scan range.
Implements impala::HdfsScanner.
Definition at line 68 of file hdfs-rcfile-scanner.cc.
References impala::BaseSequenceScanner::FileHeader::codec, columns_, impala::Codec::CreateDecompressor(), impala::HdfsScanner::decompressor_, impala::HdfsScanNode::GetMaterializedSlotIdx(), impala::HdfsScanNode::hdfs_table(), impala::BaseSequenceScanner::header_, impala::BaseSequenceScanner::FileHeader::is_compressed, impala::TableDescriptor::num_cols(), impala::HdfsScanNode::num_partition_keys(), impala::Status::OK, impala::BaseSequenceScanner::only_parsing_header_, RETURN_IF_ERROR, reuse_row_group_buffer_, row_group_buffer_size_, impala::HdfsScanner::scan_node_, impala::ScannerContext::Stream::set_contains_tuple_data(), impala::HdfsScanNode::SKIP_COLUMN, impala::HdfsScanner::stream_, impala::TupleDescriptor::string_slots(), and impala::HdfsScanNode::tuple_desc().
|
inlineprotectedinherited |
Initialize a tuple. TODO: only copy over non-null slots. TODO: InitTuple is called frequently, avoid the if, perhaps via templatization.
Definition at line 355 of file hdfs-scanner.h.
References impala::HdfsScanner::num_null_bytes_, and impala::HdfsScanner::tuple_byte_size_.
Referenced by impala::HdfsParquetScanner::AssembleRows(), impala::HdfsAvroScanner::DecodeAvroData(), ProcessRange(), impala::HdfsScanner::WriteCompleteTuple(), and impala::HdfsTextScanner::WriteFields().
|
staticinherited |
Issue the initial ranges for all sequence container files.
Definition at line 40 of file base-sequence-scanner.cc.
References impala::HdfsScanNode::AddDiskIoRanges(), impala::HdfsScanNode::AllocateScanRange(), impala::BaseSequenceScanner::HEADER_SIZE, impala::Status::OK, impala::ScanRangeMetadata::partition_id, and RETURN_IF_ERROR.
Referenced by impala::HdfsScanNode::GetNext().
|
protectedvirtualinherited |
Utility function to append an error message for an invalid row. This is called from ReportTupleParseError() row_idx is the index of the row in the current batch. Subclasses should override this function (i.e. text needs to join boundary rows). Since this is only in the error path, vtable overhead is acceptable.
Reimplemented in impala::HdfsSequenceScanner, and impala::HdfsTextScanner.
Definition at line 572 of file hdfs-scanner.cc.
Referenced by impala::HdfsScanner::ReportTupleParseError().
Definition at line 368 of file hdfs-scanner.h.
References impala::HdfsScanner::batch_, and impala::RowBatch::row_byte_size().
Referenced by impala::HdfsParquetScanner::AssembleRows(), impala::HdfsAvroScanner::DecodeAvroData(), ProcessRange(), impala::HdfsScanner::WriteEmptyTuples(), and impala::HdfsTextScanner::WriteFields().
Definition at line 363 of file hdfs-scanner.h.
References impala::HdfsScanner::tuple_byte_size_.
Referenced by impala::HdfsParquetScanner::AssembleRows(), impala::HdfsAvroScanner::DecodeAvroData(), ProcessRange(), and impala::HdfsTextScanner::WriteFields().
|
inlineprivate |
Look at the next field in the specified column buffer Input: col_idx: Column of the field. Modifies: cur_field_length_rep_[col_idx] key_buf_pos_[col_idx] cur_field_length_rep_[col_idx] cur_field_length_[col_idx]
Definition at line 368 of file hdfs-rcfile-scanner.cc.
References impala::HdfsRCFileScanner::ColumnInfo::buffer_pos, columns_, impala::HdfsRCFileScanner::ColumnInfo::current_field_len, impala::HdfsRCFileScanner::ColumnInfo::current_field_len_rep, impala::ScannerContext::Stream::file_offset(), impala::ReadWriteUtil::GetVLong(), impala::HdfsRCFileScanner::ColumnInfo::key_buffer, impala::HdfsRCFileScanner::ColumnInfo::key_buffer_pos, impala::Status::OK, and impala::HdfsScanner::stream_.
Referenced by NextRow().
|
inlineprivate |
Move to next row. Calls NextField on each column that we are reading. Modifies: row_pos_
Definition at line 400 of file hdfs-rcfile-scanner.cc.
References columns_, NextField(), num_rows_, impala::Status::OK, RETURN_IF_ERROR, and row_pos_.
Referenced by ProcessRange().
|
virtual |
One-time initialisation of state that is constant across scan ranges.
Reimplemented from impala::BaseSequenceScanner.
Definition at line 60 of file hdfs-rcfile-scanner.cc.
References impala::HdfsScanNode::hdfs_table(), impala::HdfsScanNode::IncNumScannersCodegenDisabled(), impala::HdfsTableDescriptor::null_column_value(), impala::Status::OK, impala::BaseSequenceScanner::Prepare(), RETURN_IF_ERROR, impala::HdfsScanner::scan_node_, and impala::HdfsScanner::text_converter_.
|
privatevirtual |
Process the current range until the end or an error occurred. Note this might be called multiple times if we skip over bad data. This function should read from the underlying ScannerContext materializing tuples to the context. When this function is called, it is guaranteed to be at the start of a data block (i.e. right after the sync marker).
Implements impala::BaseSequenceScanner.
Definition at line 451 of file hdfs-rcfile-scanner.cc.
References impala::RuntimeState::abort_on_error(), impala::HdfsRCFileScanner::ColumnInfo::buffer_pos, impala::SlotDescriptor::col_pos(), columns_, impala::HdfsScanner::CommitRows(), impala::HdfsScanner::context_, COUNTER_ADD, impala::HdfsRCFileScanner::ColumnInfo::current_field_len, impala::ScannerContext::Stream::eof(), impala::RuntimeState::ErrorLog(), impala::HdfsScanner::EvalConjuncts(), impala::ScannerContext::Stream::filename(), impala::BaseSequenceScanner::finished(), impala::HdfsScanner::GetMemory(), impala::HdfsScanner::InitTuple(), impala::RuntimeState::LogError(), impala::RuntimeState::LogHasSpace(), impala::HdfsRCFileScanner::ColumnInfo::materialize_column, impala::ScanNode::materialize_tuple_timer(), impala::HdfsScanNode::materialized_slots(), impala::HdfsScanner::next_row(), impala::HdfsScanner::next_tuple(), NextRow(), impala::SlotDescriptor::null_indicator_offset(), impala::HdfsScanNode::num_partition_keys(), num_rows_, impala::Status::OK, impala::HdfsScanner::parse_status_, pool, impala::ExecNode::ReachedLimit(), impala::ScannerContext::Stream::ReadInt(), ReadRowGroup(), impala::BaseSequenceScanner::ReadSync(), impala::HdfsScanner::ReportColumnParseError(), impala::RuntimeState::ReportFileErrors(), ResetRowGroup(), RETURN_IF_ERROR, RETURN_IF_FALSE, row_group_buffer_, row_group_length_, row_pos_, impala::ScanNode::rows_read_counter(), impala::HdfsScanner::scan_node_, SCOPED_TIMER, impala::Tuple::SetNull(), impala::TupleRow::SetTuple(), impala::HdfsRCFileScanner::ColumnInfo::start_offset, impala::HdfsScanner::state_, impala::HdfsScanner::stream_, impala::BaseSequenceScanner::SYNC_MARKER, impala::HdfsScanner::template_tuple_, impala::HdfsScanner::text_converter_, impala::HdfsScanNode::tuple_idx(), and impala::HdfsScanner::WriteEmptyTuples().
|
virtualinherited |
Process an entire split, reading bytes from the context's streams. Context is initialized with the split data (e.g. template tuple, partition descriptor, etc). This function should only return on error or end of scan range.
Implements impala::HdfsScanner.
Definition at line 100 of file base-sequence-scanner.cc.
References impala::RuntimeState::abort_on_error(), impala::ObjectPool::Add(), impala::HdfsScanNode::AddDiskIoRanges(), impala::BaseSequenceScanner::AllocateFileHeader(), impala::BaseSequenceScanner::bytes_skipped_counter_, impala::BaseSequenceScanner::CloseFileRanges(), COUNTER_ADD, impala::ScannerContext::Stream::eof(), impala::ScannerContext::Stream::file_offset(), impala::ScannerContext::Stream::filename(), impala::BaseSequenceScanner::finished_, impala::HdfsScanNode::GetFileDesc(), impala::HdfsScanNode::GetFileMetadata(), impala::BaseSequenceScanner::header_, impala::BaseSequenceScanner::FileHeader::header_size, impala::HdfsScanner::InitNewRange(), impala::BaseSequenceScanner::FileHeader::is_compressed, impala::Status::IsCancelled(), impala::Status::IsMemLimitExceeded(), impala::RuntimeState::LogError(), impala::Status::msg(), impala::RuntimeState::obj_pool(), impala::Status::OK, impala::Status::ok(), impala::BaseSequenceScanner::only_parsing_header_, impala::HdfsScanner::parse_status_, impala::BaseSequenceScanner::ProcessRange(), impala::BaseSequenceScanner::ReadFileHeader(), RETURN_IF_ERROR, RETURN_IF_FALSE, impala::HdfsScanner::scan_node_, impala::ScannerContext::Stream::set_contains_tuple_data(), impala::HdfsScanNode::SetFileMetadata(), impala::ScannerContext::Stream::SkipBytes(), impala::BaseSequenceScanner::SkipToSync(), impala::HdfsScanner::state_, impala::HdfsScanner::stream_, impala::BaseSequenceScanner::FileHeader::sync, and impala::BaseSequenceScanner::SYNC_HASH_SIZE.
|
private |
Read the rowgroup column buffers Sets: column_buffer_: Fills the buffer with either file data or decompressed data.
Definition at line 413 of file hdfs-rcfile-scanner.cc.
References impala::HdfsRCFileScanner::ColumnInfo::buffer_len, columns_, impala::HdfsScanner::decompress_timer_, impala::HdfsScanner::decompressor_, impala::BaseSequenceScanner::header_, impala::BaseSequenceScanner::FileHeader::is_compressed, impala::Status::OK, impala::HdfsScanner::parse_status_, impala::ScannerContext::Stream::ReadBytes(), RETURN_IF_ERROR, RETURN_IF_FALSE, row_group_buffer_, row_group_length_, SCOPED_TIMER, impala::ScannerContext::Stream::SkipBytes(), impala::HdfsRCFileScanner::ColumnInfo::start_offset, impala::HdfsScanner::stream_, impala::HdfsRCFileScanner::ColumnInfo::uncompressed_buffer_len, and VLOG_FILE.
Referenced by ReadRowGroup().
|
privatevirtual |
Read the file header. The underlying ScannerContext is at the start of the file header. This function must read the file header (which advances context_ past it) and initialize header_.
Implements impala::BaseSequenceScanner.
Definition at line 107 of file hdfs-rcfile-scanner.cc.
References impala::BaseSequenceScanner::FileHeader::codec, impala::Codec::CODEC_MAP, impala::BaseSequenceScanner::FileHeader::compression_type, impala::ScannerContext::Stream::filename(), impala::BaseSequenceScanner::header_, impala::BaseSequenceScanner::FileHeader::header_size, impala::ReadWriteUtil::HexDump(), impala::BaseSequenceScanner::FileHeader::is_compressed, impala::Status::OK, impala::HdfsScanner::parse_status_, RCF1, RCFILE_KEY_CLASS_NAME, RCFILE_VALUE_CLASS_NAME, RCFILE_VERSION_HEADER, impala::ScannerContext::Stream::ReadBoolean(), impala::ScannerContext::Stream::ReadBytes(), ReadNumColumnsMetadata(), impala::ScannerContext::Stream::ReadText(), RETURN_IF_ERROR, RETURN_IF_FALSE, SEQ6, impala::HdfsSequenceScanner::SEQFILE_VERSION_HEADER, impala::HdfsScanner::stream_, impala::BaseSequenceScanner::FileHeader::sync, impala::BaseSequenceScanner::SYNC_HASH_SIZE, impala::ScannerContext::Stream::total_bytes_returned(), impala::HdfsRCFileScanner::RcFileHeader::version, and VLOG_FILE.
|
private |
Read the rowgroup key buffers, decompress if necessary. The "keys" are really the lengths for the column values. They are read here and then used to decode the values in the column buffer. Calls GetCurrentKeyBuffer for each column to process the key data.
Definition at line 308 of file hdfs-rcfile-scanner.cc.
References columns_, compressed_key_length_, impala::HdfsScanner::decompress_timer_, impala::HdfsScanner::decompressor_, GetCurrentKeyBuffer(), impala::ReadWriteUtil::GetVInt(), impala::BaseSequenceScanner::header_, impala::BaseSequenceScanner::FileHeader::is_compressed, key_buffer_, key_length_, num_rows_, impala::Status::OK, impala::HdfsScanner::parse_status_, impala::ScannerContext::Stream::ReadBytes(), RETURN_IF_ERROR, RETURN_IF_FALSE, row_group_length_, SCOPED_TIMER, impala::HdfsScanner::stream_, and VLOG_FILE.
Referenced by ReadRowGroup().
|
private |
Reads the RCFile Header Metadata section in the current file to determine the number of columns. Other pieces of the metadata are ignored.
Definition at line 197 of file hdfs-rcfile-scanner.cc.
References impala::ScannerContext::Stream::filename(), impala::BaseSequenceScanner::header_, impala::HdfsRCFileScanner::RcFileHeader::num_cols, impala::Status::OK, impala::StringParser::PARSE_OVERFLOW, impala::HdfsScanner::parse_status_, impala::StringParser::PARSE_SUCCESS, RCFILE_METADATA_KEY_NUM_COLS, impala::ScannerContext::Stream::ReadInt(), impala::ScannerContext::Stream::ReadText(), RETURN_IF_FALSE, and impala::HdfsScanner::stream_.
Referenced by ReadFileHeader().
|
private |
Read a row group (except for the sync marker and sync) into buffers. Calls: ReadRowGroupHeader ReadKeyBuffers ReadColumnBuffers
Definition at line 254 of file hdfs-rcfile-scanner.cc.
References impala::HdfsScanner::data_buffer_pool_, impala::ExecNode::mem_tracker(), num_rows_, impala::Status::OK, ReadColumnBuffers(), ReadKeyBuffers(), ReadRowGroupHeader(), ResetRowGroup(), RETURN_IF_ERROR, reuse_row_group_buffer_, row_group_buffer_, row_group_buffer_size_, row_group_length_, impala::HdfsScanner::scan_node_, impala::RuntimeState::SetMemLimitExceeded(), and impala::HdfsScanner::state_.
Referenced by ProcessRange().
|
private |
Reads the rowgroup header starting after the sync. Sets: key_length_ compressed_key_length_ num_rows_
Definition at line 278 of file hdfs-rcfile-scanner.cc.
References compressed_key_length_, impala::ScannerContext::Stream::file_offset(), key_length_, impala::Status::OK, impala::HdfsScanner::parse_status_, impala::ScannerContext::Stream::ReadInt(), RETURN_IF_FALSE, and impala::HdfsScanner::stream_.
Referenced by ReadRowGroup().
|
protectedinherited |
Read and validate sync marker against header_->sync. Returns non-ok if the sync marker did not match. Scanners should always use this function to read sync markers, otherwise finished() might not be updated correctly. If finished() returns true after calling this function, scanners must not process any more records.
Definition at line 170 of file base-sequence-scanner.cc.
References impala::BaseSequenceScanner::block_start_, impala::ScannerContext::Stream::eof(), impala::ScannerContext::Stream::eosr(), impala::ScannerContext::Stream::file_offset(), impala::BaseSequenceScanner::finished_, impala::ScannerContext::Stream::GetBytes(), impala::hash, impala::BaseSequenceScanner::header_, impala::ReadWriteUtil::HexDump(), impala::BaseSequenceScanner::num_syncs_, impala::Status::OK, impala::HdfsScanner::parse_status_, RETURN_IF_FALSE, impala::HdfsScanner::stream_, impala::BaseSequenceScanner::FileHeader::sync, impala::BaseSequenceScanner::SYNC_HASH_SIZE, and impala::BaseSequenceScanner::total_block_size_.
Referenced by impala::HdfsSequenceScanner::ProcessBlockCompressedScanRange(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ProcessRange(), and ProcessRange().
|
protectedinherited |
Report parse error for column @ desc. If abort_on_error is true, sets parse_status_ to the error message.
Definition at line 577 of file hdfs-scanner.cc.
References impala::RuntimeState::abort_on_error(), impala::SlotDescriptor::col_pos(), impala::RuntimeState::LogError(), impala::RuntimeState::LogHasSpace(), impala::HdfsScanNode::num_partition_keys(), impala::Status::ok(), impala::HdfsScanner::parse_status_, impala::HdfsScanner::scan_node_, impala::HdfsScanner::state_, and impala::SlotDescriptor::type().
Referenced by ProcessRange(), impala::HdfsScanner::ReportTupleParseError(), and impala::HdfsTextScanner::WritePartialTuple().
|
protectedinherited |
Utility function to report parse errors for each field. If errors[i] is nonzero, fields[i] had a parse error. row_idx is the idx of the row in the current batch that had the parse error Returns false if parsing should be aborted. In this case parse_status_ is set to the error. This is called from WriteAlignedTuples.
Definition at line 546 of file hdfs-scanner.cc.
References impala::RuntimeState::abort_on_error(), impala::ScannerContext::Stream::filename(), impala::RuntimeState::LogError(), impala::RuntimeState::LogHasSpace(), impala::HdfsScanner::LogRowParseError(), impala::HdfsScanNode::materialized_slots(), impala::HdfsScanner::num_errors_in_file_, impala::Status::ok(), impala::HdfsScanner::parse_status_, impala::HdfsScanner::ReportColumnParseError(), impala::RuntimeState::ReportFileErrors(), impala::HdfsScanner::scan_node_, impala::HdfsScanner::state_, and impala::HdfsScanner::stream_.
Referenced by impala::HdfsSequenceScanner::ProcessRange(), and impala::HdfsScanner::WriteAlignedTuples().
|
private |
Reset state for a new row group.
Definition at line 231 of file hdfs-rcfile-scanner.cc.
References impala::HdfsScanner::AttachPool(), columns_, compressed_key_length_, impala::HdfsScanner::data_buffer_pool_, key_length_, num_rows_, reuse_row_group_buffer_, row_group_buffer_size_, and row_pos_.
Referenced by ProcessRange(), and ReadRowGroup().
|
protectedinherited |
Utility function to advance past the next sync marker, reading bytes from stream_. If no sync is found in the scan range, return Status::OK and sets finished_ to true. It is safe to call this function past eosr.
Definition at line 212 of file base-sequence-scanner.cc.
References impala::BaseSequenceScanner::block_start_, impala::ScannerContext::Stream::bytes_left(), impala::ScannerContext::Stream::eof(), impala::ScannerContext::Stream::eosr(), impala::ScannerContext::Stream::file_offset(), impala::ScannerContext::Stream::filename(), impala::BaseSequenceScanner::FindSyncBlock(), impala::BaseSequenceScanner::finished_, impala::ScannerContext::Stream::GetBuffer(), impala::ScannerContext::Stream::GetBytes(), impala::BaseSequenceScanner::num_syncs_, offset, impala::Status::OK, impala::HdfsScanner::parse_status_, RETURN_IF_ERROR, RETURN_IF_FALSE, impala::ScannerContext::Stream::SkipBytes(), impala::HdfsScanner::stream_, and VLOG_FILE.
Referenced by impala::BaseSequenceScanner::ProcessSplit().
|
protectedinherited |
Set batch_ to a new row batch and update tuple_mem_ accordingly.
Definition at line 108 of file hdfs-scanner.cc.
References impala::MemPool::Allocate(), impala::HdfsScanner::batch_, impala::RuntimeState::batch_size(), impala::ExecNode::mem_tracker(), impala::ExecNode::row_desc(), impala::HdfsScanner::scan_node_, impala::HdfsScanner::state_, impala::HdfsScanner::tuple_byte_size_, impala::RowBatch::tuple_data_pool(), and impala::HdfsScanner::tuple_mem_.
Referenced by impala::HdfsScanner::CommitRows(), and impala::HdfsScanner::Prepare().
|
protectedinherited |
Update the decompressor_ object given a compression type or codec name. Depending on the old compression type and the new one, it may close the old decompressor and/or create a new one of different type.
Definition at line 513 of file hdfs-scanner.cc.
References impala::Codec::CreateDecompressor(), impala::HdfsScanner::data_buffer_pool_, impala::HdfsScanner::decompression_type_, impala::HdfsScanner::decompressor_, impala::Status::OK, RETURN_IF_ERROR, impala::HdfsScanner::scan_node_, impala::TupleDescriptor::string_slots(), and impala::HdfsScanNode::tuple_desc().
Referenced by impala::HdfsAvroScanner::InitNewRange(), impala::HdfsSequenceScanner::InitNewRange(), and impala::HdfsTextScanner::ProcessSplit().
|
protectedinherited |
|
protectedinherited |
Processes batches of fields and writes them out to tuple_row_mem.
Definition at line 33 of file hdfs-scanner-ir.cc.
References impala::HdfsScanner::ReportTupleParseError(), impala::HdfsScanner::template_tuple_, impala::HdfsScanner::tuple_, impala::HdfsScanner::tuple_byte_size_, UNLIKELY, and impala::HdfsScanner::WriteCompleteTuple().
Referenced by impala::HdfsSequenceScanner::ProcessDecompressedBlock(), and impala::HdfsTextScanner::WriteFields().
|
protectedinherited |
Writes out all slots for 'tuple' from 'fields'. 'fields' must be aligned to the start of the tuple (e.g. fields[0] maps to slots[0]). After writing the tuple, it will be evaluated against the conjuncts.
Definition at line 217 of file hdfs-scanner.cc.
References impala::HdfsScanner::EvalConjuncts(), impala::HdfsScanner::InitTuple(), impala::FieldLocation::len, impala::HdfsScanNode::materialized_slots(), impala::HdfsScanner::scan_node_, impala::TupleRow::SetTuple(), impala::HdfsScanner::text_converter_, impala::HdfsScanNode::tuple_idx(), and UNLIKELY.
Referenced by impala::HdfsSequenceScanner::ProcessRange(), and impala::HdfsScanner::WriteAlignedTuples().
|
protectedinherited |
Utility method to write out tuples when there are no materialized fields (e.g. select count(*) or only partition keys). num_tuples - Total number of tuples to write out. Returns the number of tuples added to the row batch.
Definition at line 157 of file hdfs-scanner.cc.
References impala::RowBatch::AddRow(), impala::RowBatch::AddRows(), impala::RowBatch::AtCapacity(), impala::RowBatch::capacity(), impala::RowBatch::CommitLastRow(), impala::RowBatch::CommitRows(), impala::HdfsScanner::EvalConjuncts(), impala::RowBatch::GetRow(), impala::RowBatch::INVALID_ROW_INDEX, impala::RowBatch::num_rows(), impala::HdfsScanner::scan_node_, impala::TupleRow::SetTuple(), impala::HdfsScanner::template_tuple_, and impala::HdfsScanNode::tuple_idx().
Referenced by impala::HdfsSequenceScanner::ProcessDecompressedBlock(), impala::HdfsParquetScanner::ProcessFooter(), impala::HdfsTextScanner::ProcessRange(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ProcessRange(), and ProcessRange().
|
protectedinherited |
Write empty tuples and commit them to the context object.
Definition at line 195 of file hdfs-scanner.cc.
References impala::HdfsScanner::EvalConjuncts(), impala::HdfsScanner::next_row(), impala::HdfsScanner::scan_node_, impala::TupleRow::SetTuple(), impala::HdfsScanner::template_tuple_, and impala::HdfsScanNode::tuple_idx().
|
protectedinherited |
The current row batch being populated. Creating new row batches, attaching context resources, and handing off to the scan node is handled by this class in CommitRows(), but AttachPool() must be called by scanner subclasses to attach any memory allocated by that subclass. All row batches created by this class are transferred to the scan node (i.e., all batches are ultimately owned by the scan node).
Definition at line 177 of file hdfs-scanner.h.
Referenced by impala::HdfsScanner::AddFinalRowBatch(), impala::HdfsScanner::AttachPool(), impala::HdfsScanner::CommitRows(), impala::HdfsScanner::GetMemory(), impala::HdfsScanner::next_row(), impala::HdfsSequenceScanner::ProcessDecompressedBlock(), impala::HdfsParquetScanner::ProcessSplit(), impala::HdfsScanner::StartNewRowBatch(), impala::HdfsTextScanner::WriteFields(), and impala::HdfsScanner::~HdfsScanner().
|
private |
Vector of column descriptions for each column in the file (i.e., may contain a different number of non-partition columns than are in the table metadata). Indexed by column index, including non-materialized columns.
Definition at line 376 of file hdfs-rcfile-scanner.h.
Referenced by GetCurrentKeyBuffer(), InitNewRange(), NextField(), NextRow(), ProcessRange(), ReadColumnBuffers(), ReadKeyBuffers(), and ResetRowGroup().
|
private |
Compressed size of the row group's key buffers. Read from the row group header.
Definition at line 394 of file hdfs-rcfile-scanner.h.
Referenced by ReadKeyBuffers(), ReadRowGroupHeader(), and ResetRowGroup().
|
protectedinherited |
ExprContext for each conjunct. Each scanner has its own ExprContexts so the conjuncts can be safely evaluated in parallel.
Definition at line 154 of file hdfs-scanner.h.
Referenced by impala::HdfsScanner::Close(), impala::HdfsScanner::CommitRows(), impala::HdfsScanner::EvalConjuncts(), impala::HdfsScanner::GetConjunctCtx(), and impala::HdfsScanner::Prepare().
|
protectedinherited |
Context for this scanner.
Definition at line 147 of file hdfs-scanner.h.
Referenced by impala::HdfsScanner::AddFinalRowBatch(), impala::HdfsParquetScanner::AssembleRows(), impala::HdfsScanner::CommitRows(), impala::HdfsTextScanner::FillByteBufferCompressedFile(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsParquetScanner::InitColumns(), impala::HdfsTextScanner::InitNewRange(), impala::HdfsSequenceScanner::InitNewRange(), impala::HdfsScanner::Prepare(), impala::HdfsSequenceScanner::ProcessDecompressedBlock(), impala::HdfsParquetScanner::ProcessFooter(), impala::HdfsTextScanner::ProcessRange(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ProcessRange(), ProcessRange(), impala::HdfsParquetScanner::ProcessSplit(), and impala::HdfsTextScanner::ResetScanner().
|
protectedinherited |
Pool to allocate per data block memory. This should be used with the decompressor and any other per data block allocations.
Definition at line 205 of file hdfs-scanner.h.
Referenced by impala::HdfsTextScanner::Close(), impala::BaseSequenceScanner::Close(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ReadCompressedBlock(), ReadRowGroup(), ResetRowGroup(), impala::HdfsScanner::UpdateDecompressor(), and impala::HdfsTextScanner::WritePartialTuple().
|
protectedinherited |
Time spent decompressing bytes.
Definition at line 208 of file hdfs-scanner.h.
Referenced by impala::HdfsTextScanner::FillByteBufferCompressedFile(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsSequenceScanner::GetRecord(), impala::HdfsScanner::Prepare(), impala::HdfsAvroScanner::ProcessRange(), ReadColumnBuffers(), impala::HdfsSequenceScanner::ReadCompressedBlock(), impala::HdfsParquetScanner::BaseColumnReader::ReadDataPage(), and ReadKeyBuffers().
|
protectedinherited |
The most recently used decompression type.
Definition at line 201 of file hdfs-scanner.h.
Referenced by impala::HdfsTextScanner::FillByteBuffer(), impala::HdfsTextScanner::FillByteBufferCompressedFile(), and impala::HdfsScanner::UpdateDecompressor().
|
protectedinherited |
Decompressor class to use, if any.
Definition at line 198 of file hdfs-scanner.h.
Referenced by impala::HdfsTextScanner::Close(), impala::BaseSequenceScanner::Close(), impala::HdfsScanner::Close(), impala::HdfsTextScanner::FillByteBuffer(), impala::HdfsTextScanner::FillByteBufferCompressedFile(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsTextScanner::FinishScanRange(), impala::HdfsSequenceScanner::GetRecord(), InitNewRange(), impala::HdfsAvroScanner::ProcessRange(), ReadColumnBuffers(), impala::HdfsSequenceScanner::ReadCompressedBlock(), ReadKeyBuffers(), and impala::HdfsScanner::UpdateDecompressor().
|
staticinherited |
Assumed size of an OS file block. Used mostly when reading file format headers, etc. This probably ought to be a derived number from the environment.
Definition at line 95 of file hdfs-scanner.h.
|
protectedinherited |
File header for this scan range. This is not owned by the parent scan node.
Definition at line 127 of file base-sequence-scanner.h.
Referenced by impala::BaseSequenceScanner::Close(), impala::HdfsSequenceScanner::GetRecord(), impala::HdfsAvroScanner::InitNewRange(), impala::HdfsSequenceScanner::InitNewRange(), InitNewRange(), impala::HdfsAvroScanner::ParseMetadata(), impala::HdfsSequenceScanner::ProcessBlockCompressedScanRange(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ProcessRange(), impala::BaseSequenceScanner::ProcessSplit(), ReadColumnBuffers(), impala::HdfsAvroScanner::ReadFileHeader(), impala::HdfsSequenceScanner::ReadFileHeader(), ReadFileHeader(), ReadKeyBuffers(), ReadNumColumnsMetadata(), and impala::BaseSequenceScanner::ReadSync().
|
staticprotectedinherited |
Estimate of header size in bytes. This is initial number of bytes to issue per file. If the estimate is too low, more bytes will be read as necessary.
Definition at line 121 of file base-sequence-scanner.h.
Referenced by impala::BaseSequenceScanner::IssueInitialRanges().
|
private |
Buffer for copying key buffers. This buffer is reused between row groups.
Definition at line 379 of file hdfs-rcfile-scanner.h.
Referenced by ReadKeyBuffers().
|
private |
Size of the row group's key buffers. Read from the row group header.
Definition at line 390 of file hdfs-rcfile-scanner.h.
Referenced by ReadKeyBuffers(), ReadRowGroupHeader(), and ResetRowGroup().
|
staticinherited |
Scanner subclasses must implement these static functions as well. Unfortunately, c++ does not allow static virtual functions. Issue the initial ranges for 'files'. HdfsFileDesc groups all the splits assigned to this scan node by file. This is called before any of the scanner subclasses are created to process splits in 'files'. The strategy on how to parse the scan ranges depends on the file format.
Definition at line 137 of file hdfs-scanner.h.
Referenced by impala::HdfsScanner::CodegenWriteCompleteTuple().
|
protectedinherited |
number of errors in current file
Definition at line 183 of file hdfs-scanner.h.
Referenced by impala::HdfsScanner::ReportTupleParseError().
|
protectedinherited |
Number of null bytes in the tuple.
Definition at line 189 of file hdfs-scanner.h.
Referenced by impala::HdfsScanner::InitTuple().
|
private |
number of rows in this rowgroup object
Definition at line 382 of file hdfs-rcfile-scanner.h.
Referenced by NextRow(), ProcessRange(), ReadKeyBuffers(), ReadRowGroup(), and ResetRowGroup().
|
protectedinherited |
If true, this scanner object is only for processing the header.
Definition at line 130 of file base-sequence-scanner.h.
Referenced by impala::BaseSequenceScanner::Close(), impala::BaseSequenceScanner::CloseFileRanges(), impala::HdfsAvroScanner::InitNewRange(), impala::HdfsSequenceScanner::InitNewRange(), InitNewRange(), and impala::BaseSequenceScanner::ProcessSplit().
|
protectedinherited |
Contains current parse status to minimize the number of Status objects returned. This significantly minimizes the cross compile dependencies for llvm since status objects inline a bunch of string functions. Also, status objects aren't extremely cheap to create and destroy.
Definition at line 195 of file hdfs-scanner.h.
Referenced by impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsSequenceScanner::GetRecord(), impala::HdfsAvroScanner::ParseMetadata(), impala::HdfsSequenceScanner::ProcessBlockCompressedScanRange(), impala::HdfsSequenceScanner::ProcessDecompressedBlock(), impala::HdfsTextScanner::ProcessRange(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ProcessRange(), ProcessRange(), impala::BaseSequenceScanner::ProcessSplit(), impala::HdfsSequenceScanner::ReadBlockHeader(), ReadColumnBuffers(), impala::HdfsSequenceScanner::ReadCompressedBlock(), impala::HdfsAvroScanner::ReadFileHeader(), impala::HdfsSequenceScanner::ReadFileHeader(), ReadFileHeader(), ReadKeyBuffers(), ReadNumColumnsMetadata(), ReadRowGroupHeader(), impala::BaseSequenceScanner::ReadSync(), impala::HdfsScanner::ReportColumnParseError(), impala::HdfsScanner::ReportTupleParseError(), impala::BaseSequenceScanner::SkipToSync(), and impala::HdfsTextScanner::WriteFields().
|
staticprivate |
The key class name located in the RCFile Header. This is always "org.apache.hadoop.hive.ql.io.RCFile$KeyBuffer"
Definition at line 243 of file hdfs-rcfile-scanner.h.
Referenced by ReadFileHeader().
|
staticprivate |
RCFile metadata key for determining the number of columns present in the RCFile: "hive.io.rcfile.column.number"
Definition at line 251 of file hdfs-rcfile-scanner.h.
Referenced by ReadNumColumnsMetadata().
|
staticprivate |
The value class name located in the RCFile Header. This is always "org.apache.hadoop.hive.ql.io.RCFile$ValueBuffer"
Definition at line 247 of file hdfs-rcfile-scanner.h.
Referenced by ReadFileHeader().
|
staticprivate |
The four byte RCFile unique version header present at the beginning of the file {'R', 'C', 'F' 1}
Definition at line 255 of file hdfs-rcfile-scanner.h.
Referenced by ReadFileHeader().
|
private |
If true, the row_group_buffer_ can be reused across row groups, otherwise, it (more specifically the data_buffer_pool_ that allocated the row_group_buffer_) must be attached to the row batch.
Definition at line 399 of file hdfs-rcfile-scanner.h.
Referenced by InitNewRange(), ReadRowGroup(), and ResetRowGroup().
|
private |
Buffer containing the entire row group. We allocate a buffer for the entire row group, skipping non-materialized columns.
Definition at line 403 of file hdfs-rcfile-scanner.h.
Referenced by ProcessRange(), ReadColumnBuffers(), and ReadRowGroup().
|
private |
This is the allocated size of 'row_group_buffer_'. 'row_group_buffer_' is reused across row groups and will grow as necessary.
Definition at line 411 of file hdfs-rcfile-scanner.h.
Referenced by InitNewRange(), ReadRowGroup(), and ResetRowGroup().
|
private |
Sum of the bytes lengths of the materialized columns in the current row group. This is the number of valid bytes in row_group_buffer_.
Definition at line 407 of file hdfs-rcfile-scanner.h.
Referenced by GetCurrentKeyBuffer(), ProcessRange(), ReadColumnBuffers(), ReadKeyBuffers(), and ReadRowGroup().
|
private |
Current row position in this rowgroup. This value is incremented each time NextRow() is called.
Definition at line 386 of file hdfs-rcfile-scanner.h.
Referenced by NextRow(), ProcessRange(), and ResetRowGroup().
|
protectedinherited |
The scan node that started this scanner.
Definition at line 141 of file hdfs-scanner.h.
Referenced by impala::HdfsScanner::AddFinalRowBatch(), impala::HdfsParquetScanner::AssembleRows(), impala::HdfsParquetScanner::BaseColumnReader::BaseColumnReader(), impala::HdfsTextScanner::Close(), impala::BaseSequenceScanner::Close(), impala::HdfsParquetScanner::Close(), impala::BaseSequenceScanner::CloseFileRanges(), impala::HdfsScanner::CommitRows(), impala::HdfsParquetScanner::CreateColumnReaders(), impala::HdfsParquetScanner::CreateReader(), DebugString(), impala::HdfsAvroScanner::DecodeAvroData(), impala::HdfsTextScanner::FillByteBufferCompressedFile(), impala::HdfsTextScanner::FinishScanRange(), impala::HdfsParquetScanner::InitColumns(), impala::HdfsScanner::InitializeWriteTuplesFn(), impala::HdfsTextScanner::InitNewRange(), impala::HdfsAvroScanner::InitNewRange(), impala::HdfsSequenceScanner::InitNewRange(), InitNewRange(), impala::HdfsAvroScanner::ParseMetadata(), impala::HdfsTextScanner::Prepare(), impala::BaseSequenceScanner::Prepare(), impala::HdfsParquetScanner::Prepare(), impala::HdfsScanner::Prepare(), impala::HdfsSequenceScanner::Prepare(), Prepare(), impala::HdfsSequenceScanner::ProcessBlockCompressedScanRange(), impala::HdfsSequenceScanner::ProcessDecompressedBlock(), impala::HdfsParquetScanner::ProcessFooter(), impala::HdfsTextScanner::ProcessRange(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ProcessRange(), ProcessRange(), impala::BaseSequenceScanner::ProcessSplit(), impala::HdfsParquetScanner::BaseColumnReader::ReadDataPage(), ReadRowGroup(), impala::HdfsScanner::ReportColumnParseError(), impala::HdfsScanner::ReportTupleParseError(), impala::HdfsTextScanner::ResetScanner(), impala::HdfsAvroScanner::ResolveSchemas(), impala::HdfsScanner::StartNewRowBatch(), impala::HdfsScanner::UpdateDecompressor(), impala::HdfsAvroScanner::VerifyTypesMatch(), impala::HdfsScanner::WriteCompleteTuple(), impala::HdfsScanner::WriteEmptyTuples(), impala::HdfsTextScanner::WriteFields(), and impala::HdfsTextScanner::WritePartialTuple().
|
protectedinherited |
RuntimeState for error reporting.
Definition at line 144 of file hdfs-scanner.h.
Referenced by impala::HdfsScanner::Close(), impala::HdfsScanner::CommitRows(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsTextScanner::FinishScanRange(), impala::HdfsTextScanner::Prepare(), impala::HdfsScanner::Prepare(), impala::HdfsSequenceScanner::Prepare(), impala::HdfsSequenceScanner::ProcessBlockCompressedScanRange(), ProcessRange(), impala::BaseSequenceScanner::ProcessSplit(), impala::HdfsSequenceScanner::ReadCompressedBlock(), impala::BaseSequenceScanner::ReadPastSize(), ReadRowGroup(), impala::HdfsScanner::ReportColumnParseError(), impala::HdfsScanner::ReportTupleParseError(), impala::HdfsAvroScanner::ResolveSchemas(), impala::HdfsScanner::StartNewRowBatch(), impala::HdfsParquetScanner::ValidateColumn(), and impala::HdfsTextScanner::WriteFields().
|
protectedinherited |
The first stream for context_.
Definition at line 150 of file hdfs-scanner.h.
Referenced by impala::HdfsTextScanner::Close(), impala::BaseSequenceScanner::Close(), impala::HdfsParquetScanner::CreateColumnReaders(), DebugString(), impala::HdfsTextScanner::FillByteBuffer(), impala::HdfsTextScanner::FillByteBufferCompressedFile(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsTextScanner::FindFirstTuple(), impala::HdfsTextScanner::FinishScanRange(), impala::HdfsSequenceScanner::GetRecord(), impala::HdfsTextScanner::InitNewRange(), InitNewRange(), NextField(), impala::HdfsAvroScanner::ParseMetadata(), impala::BaseSequenceScanner::Prepare(), impala::HdfsScanner::Prepare(), impala::HdfsSequenceScanner::ProcessBlockCompressedScanRange(), impala::HdfsParquetScanner::ProcessFooter(), impala::HdfsTextScanner::ProcessRange(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsSequenceScanner::ProcessRange(), ProcessRange(), impala::HdfsTextScanner::ProcessSplit(), impala::BaseSequenceScanner::ProcessSplit(), impala::HdfsParquetScanner::ProcessSplit(), impala::HdfsSequenceScanner::ReadBlockHeader(), ReadColumnBuffers(), impala::HdfsSequenceScanner::ReadCompressedBlock(), impala::HdfsAvroScanner::ReadFileHeader(), impala::HdfsSequenceScanner::ReadFileHeader(), ReadFileHeader(), ReadKeyBuffers(), ReadNumColumnsMetadata(), ReadRowGroupHeader(), impala::BaseSequenceScanner::ReadSync(), impala::HdfsScanner::ReportTupleParseError(), impala::BaseSequenceScanner::SkipToSync(), impala::HdfsParquetScanner::ValidateFileMetadata(), impala::HdfsAvroScanner::VerifyTypesMatch(), and impala::HdfsTextScanner::WriteFields().
|
staticprotectedinherited |
Size of the sync hash field.
Definition at line 49 of file base-sequence-scanner.h.
Referenced by impala::BaseSequenceScanner::ProcessSplit(), impala::HdfsAvroScanner::ReadFileHeader(), impala::HdfsSequenceScanner::ReadFileHeader(), ReadFileHeader(), and impala::BaseSequenceScanner::ReadSync().
|
staticprotectedinherited |
Sync indicator.
Definition at line 124 of file base-sequence-scanner.h.
Referenced by impala::HdfsSequenceScanner::ProcessRange(), and ProcessRange().
|
protectedinherited |
A partially materialized tuple with only partition key slots set. The non-partition key slots are set to NULL. The template tuple must be copied into tuple_ before any of the other slots are materialized. Pointer is NULL if there are no partition key slots. This template tuple is computed once for each file and valid for the duration of that file. It is owned by the HDFS scan node.
Definition at line 164 of file hdfs-scanner.h.
Referenced by impala::HdfsAvroScanner::AllocateFileHeader(), impala::HdfsParquetScanner::AssembleRows(), impala::HdfsParquetScanner::CreateColumnReaders(), impala::HdfsAvroScanner::DecodeAvroData(), impala::HdfsAvroScanner::InitNewRange(), impala::HdfsScanner::Prepare(), impala::HdfsSequenceScanner::ProcessRange(), ProcessRange(), impala::HdfsAvroScanner::ResolveSchemas(), impala::HdfsScanner::WriteAlignedTuples(), impala::HdfsScanner::WriteEmptyTuples(), and impala::HdfsTextScanner::WriteFields().
|
protectedinherited |
Helper class for converting text to other types;.
Definition at line 186 of file hdfs-scanner.h.
Referenced by impala::HdfsTextScanner::InitNewRange(), impala::HdfsSequenceScanner::InitNewRange(), Prepare(), ProcessRange(), impala::HdfsScanner::WriteCompleteTuple(), and impala::HdfsTextScanner::WritePartialTuple().
|
protectedinherited |
Current tuple pointer into tuple_mem_.
Definition at line 170 of file hdfs-scanner.h.
Referenced by impala::HdfsTextScanner::FinishScanRange(), impala::HdfsSequenceScanner::ProcessDecompressedBlock(), impala::HdfsTextScanner::ProcessRange(), impala::HdfsSequenceScanner::ProcessRange(), impala::HdfsScanner::WriteAlignedTuples(), and impala::HdfsTextScanner::WriteFields().
|
protectedinherited |
Fixed size of each tuple, in bytes.
Definition at line 167 of file hdfs-scanner.h.
Referenced by impala::HdfsParquetScanner::AssembleRows(), impala::HdfsScanner::InitTuple(), impala::HdfsScanner::next_tuple(), impala::HdfsScanner::StartNewRowBatch(), and impala::HdfsScanner::WriteAlignedTuples().
|
protectedinherited |
The tuple memory of batch_.
Definition at line 180 of file hdfs-scanner.h.
Referenced by impala::HdfsScanner::CommitRows(), impala::HdfsScanner::GetMemory(), and impala::HdfsScanner::StartNewRowBatch().
|
protectedinherited |
Jitted write tuples function pointer. Null if codegen is disabled.
Definition at line 215 of file hdfs-scanner.h.
Referenced by impala::HdfsScanner::InitializeWriteTuplesFn(), impala::HdfsSequenceScanner::ProcessDecompressedBlock(), and impala::HdfsTextScanner::WriteFields().