Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
|
#include <scanner-context.h>
Public Types | |
typedef boost::function< int(int64_t)> | ReadPastSizeCallback |
Public Member Functions | |
bool | GetBytes (int64_t requested_len, uint8_t **buffer, int64_t *out_len, Status *status, bool peek=false) |
Status | GetBuffer (bool peek, uint8_t **buffer, int64_t *out_len) |
void | set_contains_tuple_data (bool v) |
void | set_read_past_size_cb (ReadPastSizeCallback cb) |
int64_t | bytes_left () |
Return the number of bytes left in the range for this stream. More... | |
bool | eosr () const |
bool | eof () const |
If true, the stream has reached the end of the file. More... | |
const char * | filename () |
const DiskIoMgr::ScanRange * | scan_range () |
const HdfsFileDesc * | file_desc () |
int64_t | file_offset () const |
Returns the buffer's current offset in the file. More... | |
int64_t | total_bytes_returned () |
Returns the total number of bytes returned. More... | |
bool | ReadBoolean (bool *boolean, Status *) |
bool | ReadInt (int32_t *val, Status *, bool peek=false) |
bool | ReadVLong (int64_t *val, Status *) |
bool | ReadVInt (int32_t *val, Status *) |
bool | ReadZLong (int64_t *val, Status *) |
Read a zigzag encoded long. More... | |
bool | SkipBytes (int64_t length, Status *) |
Skip over the next length bytes in the specified HDFS file. More... | |
bool | ReadBytes (int64_t length, uint8_t **buf, Status *, bool peek=false) |
bool | ReadText (uint8_t **buf, int64_t *length, Status *) |
bool | SkipText (Status *) |
Skip this text object. More... | |
Private Member Functions | |
Stream (ScannerContext *parent) | |
Status | GetBytesInternal (int64_t requested_len, uint8_t **buffer, bool peek, int64_t *out_len) |
Status | GetNextBuffer (int64_t read_past_size=0) |
void | ReleaseCompletedResources (RowBatch *batch, bool done) |
Status | ReportIncompleteRead (int64_t length, int64_t bytes_read) |
Error-reporting functions. More... | |
Status | ReportInvalidRead (int64_t length) |
Private Attributes | |
ScannerContext * | parent_ |
DiskIoMgr::ScanRange * | scan_range_ |
const HdfsFileDesc * | file_desc_ |
bool | contains_tuple_data_ |
int64_t | total_bytes_returned_ |
Total number of bytes returned from GetBytes() More... | |
int64_t | file_len_ |
ReadPastSizeCallback | read_past_size_cb_ |
DiskIoMgr::BufferDescriptor * | io_buffer_ |
The current io buffer. This starts as NULL before we've read any bytes. More... | |
uint8_t * | io_buffer_pos_ |
Next byte to read in io_buffer_. More... | |
int64_t | io_buffer_bytes_left_ |
Bytes left in io_buffer_. More... | |
boost::scoped_ptr< MemPool > | boundary_pool_ |
boost::scoped_ptr< StringBuffer > | boundary_buffer_ |
uint8_t * | boundary_buffer_pos_ |
int64_t | boundary_buffer_bytes_left_ |
uint8_t ** | output_buffer_pos_ |
int64_t * | output_buffer_bytes_left_ |
std::list < DiskIoMgr::BufferDescriptor * > | completed_io_buffers_ |
Friends | |
class | ScannerContext |
Encapsulates a stream (continuous byte range) that can be read. A context can contain one or more streams. For non-columnar files, there is only one stream; for columnar, there is one stream per column.
Definition at line 66 of file scanner-context.h.
typedef boost::function<int (int64_t)> impala::ScannerContext::Stream::ReadPastSizeCallback |
Callback that returns the buffer size to use when reading past the end of the scan range. By default a constant value is used, which scanners can override with this callback. The callback takes the file offset of the asynchronous read (this may be more than file_offset() due to data being assembled in the boundary buffer). Reading past the end of the scan range is likely a remote read, so we want to minimize the number of io requests as well as the data volume.
Definition at line 105 of file scanner-context.h.
|
private |
Definition at line 52 of file scanner-context.cc.
|
inline |
Return the number of bytes left in the range for this stream.
Definition at line 109 of file scanner-context.h.
References impala::DiskIoMgr::RequestRange::len(), scan_range_, and total_bytes_returned_.
Referenced by impala::BaseSequenceScanner::Close(), and impala::BaseSequenceScanner::SkipToSync().
|
inline |
If true, the stream has reached the end of the file.
Definition at line 116 of file scanner-context.h.
References file_len_, and file_offset().
Referenced by eosr(), impala::HdfsTextScanner::FinishScanRange(), impala::HdfsSequenceScanner::ProcessBlockCompressedScanRange(), impala::HdfsSequenceScanner::ProcessRange(), impala::HdfsRCFileScanner::ProcessRange(), impala::BaseSequenceScanner::ProcessSplit(), impala::BaseSequenceScanner::ReadSync(), and impala::BaseSequenceScanner::SkipToSync().
|
inline |
If true, all bytes in this scan range have been returned or we have reached eof (the scan range could be longer than the file).
Definition at line 113 of file scanner-context.h.
References eof(), impala::DiskIoMgr::RequestRange::len(), scan_range_, and total_bytes_returned_.
Referenced by impala::HdfsTextScanner::FillByteBuffer(), impala::HdfsTextScanner::FillByteBufferCompressedFile(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsParquetScanner::ProcessFooter(), impala::HdfsTextScanner::ProcessRange(), impala::HdfsParquetScanner::BaseColumnReader::ReadDataPage(), impala::BaseSequenceScanner::ReadSync(), and impala::BaseSequenceScanner::SkipToSync().
|
inline |
Definition at line 120 of file scanner-context.h.
References file_desc_.
Referenced by impala::HdfsTextScanner::Close(), impala::HdfsScanNode::CreateAndPrepareScanner(), impala::HdfsTextScanner::InitNewRange(), and impala::HdfsTextScanner::ProcessSplit().
|
inline |
Returns the buffer's current offset in the file.
Definition at line 123 of file scanner-context.h.
References impala::DiskIoMgr::RequestRange::offset(), scan_range_, and total_bytes_returned_.
Referenced by eof(), impala::HdfsTextScanner::FinishScanRange(), impala::HdfsRCFileScanner::NextField(), impala::HdfsSequenceScanner::ProcessBlockCompressedScanRange(), impala::BaseSequenceScanner::ProcessSplit(), impala::HdfsSequenceScanner::ReadBlockHeader(), impala::HdfsRCFileScanner::ReadRowGroupHeader(), impala::BaseSequenceScanner::ReadSync(), and impala::BaseSequenceScanner::SkipToSync().
|
inline |
Definition at line 118 of file scanner-context.h.
References impala::DiskIoMgr::RequestRange::file(), and scan_range_.
Referenced by impala::ScannerContext::AddStream(), impala::HdfsParquetScanner::AssembleRows(), impala::HdfsParquetScanner::CreateColumnReaders(), impala::HdfsRCFileScanner::DebugString(), impala::HdfsTextScanner::FillByteBufferCompressedFile(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsTextScanner::FinishScanRange(), impala::HdfsAvroScanner::ParseMetadata(), impala::HdfsParquetScanner::ProcessFooter(), impala::HdfsRCFileScanner::ProcessRange(), impala::BaseSequenceScanner::ProcessSplit(), impala::HdfsSequenceScanner::ReadFileHeader(), impala::HdfsRCFileScanner::ReadFileHeader(), impala::HdfsRCFileScanner::ReadNumColumnsMetadata(), impala::HdfsScanner::ReportTupleParseError(), impala::BaseSequenceScanner::SkipToSync(), impala::HdfsParquetScanner::ValidateFileMetadata(), impala::HdfsAvroScanner::VerifyTypesMatch(), and impala::HdfsTextScanner::WriteFields().
Gets the bytes from the first available buffer within the scan range. This may be the boundary buffer used to stitch IO buffers together. If we are past the end of the scan range, no bytes are returned.
Definition at line 171 of file scanner-context.cc.
References impala::Status::CANCELLED, impala::Status::OK, and RETURN_IF_ERROR.
Referenced by impala::HdfsTextScanner::FillByteBuffer(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsParquetScanner::ProcessFooter(), impala::HdfsParquetScanner::BaseColumnReader::ReadDataPage(), and impala::BaseSequenceScanner::SkipToSync().
|
inline |
Returns up to requested_len bytes or an error. This can block if bytes are not available.
Handle the fast common path where all the bytes are in the first buffer. This is the path used by sequence/rc/parquet file formats to read a very small number (i.e. single int) of bytes.
Definition at line 31 of file scanner-context.inline.h.
References GetBytesInternal(), LIKELY, impala::Status::ok(), output_buffer_bytes_left_, output_buffer_pos_, ReportInvalidRead(), total_bytes_returned_, and UNLIKELY.
Referenced by impala::HdfsTextScanner::FillByteBuffer(), impala::HdfsTextScanner::FillByteBufferCompressedFile(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsParquetScanner::BaseColumnReader::ReadDataPage(), impala::BaseSequenceScanner::ReadSync(), and impala::BaseSequenceScanner::SkipToSync().
|
private |
GetBytes helper to handle the slow path. If peek is set then return the data but do not move the current offset.
Definition at line 216 of file scanner-context.cc.
References impala::Status::CANCELLED, impala::Status::OK, and RETURN_IF_ERROR.
Referenced by GetBytes().
|
private |
Gets (and blocks) for the next io buffer. After fetching all buffers in the scan range, performs synchronous reads past the scan range until EOF. When performing a synchronous read, the read size is the max of read_past_size and the result returned by read_past_size_cb_() (or DEFAULT_READ_PAST_SIZE if no callback is set). read_past_size is not used otherwise. Updates io_buffer_, io_buffer_bytes_left_, and io_buffer_pos_. If GetNextBuffer() is called after all bytes in the file have been returned, io_buffer_bytes_left_ will be set to 0. In the non-error case, io_buffer_ is never set to NULL, even if it contains 0 bytes.
Definition at line 115 of file scanner-context.cc.
References impala::Status::CANCELLED, DEFAULT_READ_PAST_SIZE, offset, impala::Status::OK, RETURN_IF_ERROR, SCOPED_TIMER, and VLOG_FILE.
Read a Boolean primitive value written using Java serialization. Equivalent to java.io.DataInput.readBoolean()
Definition at line 95 of file scanner-context.inline.h.
References RETURN_IF_FALSE.
Referenced by impala::HdfsSequenceScanner::ReadFileHeader(), and impala::HdfsRCFileScanner::ReadFileHeader().
|
inline |
Read length bytes into the supplied buffer. The returned buffer is owned by this object.
Definition at line 56 of file scanner-context.inline.h.
References RETURN_IF_FALSE, and UNLIKELY.
Referenced by impala::HdfsSequenceScanner::GetRecord(), impala::HdfsAvroScanner::ParseMetadata(), impala::HdfsAvroScanner::ProcessRange(), impala::HdfsRCFileScanner::ReadColumnBuffers(), impala::HdfsSequenceScanner::ReadCompressedBlock(), impala::HdfsParquetScanner::BaseColumnReader::ReadDataPage(), impala::HdfsAvroScanner::ReadFileHeader(), impala::HdfsSequenceScanner::ReadFileHeader(), impala::HdfsRCFileScanner::ReadFileHeader(), and impala::HdfsRCFileScanner::ReadKeyBuffers().
Read an Integer primitive value written using Java serialization. Equivalent to java.io.DataInput.readInt()
Definition at line 102 of file scanner-context.inline.h.
References RETURN_IF_FALSE.
Referenced by impala::HdfsSequenceScanner::ProcessBlockCompressedScanRange(), impala::HdfsSequenceScanner::ProcessRange(), impala::HdfsRCFileScanner::ProcessRange(), impala::HdfsSequenceScanner::ReadBlockHeader(), impala::HdfsSequenceScanner::ReadFileHeader(), impala::HdfsRCFileScanner::ReadNumColumnsMetadata(), and impala::HdfsRCFileScanner::ReadRowGroupHeader().
Read a Writable Text value from the supplied file. Ref: org.apache.hadoop.io.WritableUtils.readString() The returned buffer is owned by this object.
Definition at line 88 of file scanner-context.inline.h.
References RETURN_IF_FALSE.
Referenced by impala::HdfsSequenceScanner::ReadFileHeader(), impala::HdfsRCFileScanner::ReadFileHeader(), and impala::HdfsRCFileScanner::ReadNumColumnsMetadata().
Read a variable length Integer value written using Writable serialization. Ref: org.apache.hadoop.io.WritableUtils.readVInt()
Definition at line 109 of file scanner-context.inline.h.
References RETURN_IF_FALSE.
Read a variable-length Long value written using Writable serialization. Ref: org.apache.hadoop.io.WritableUtils.readVLong()
Definition at line 116 of file scanner-context.inline.h.
References impala::ReadWriteUtil::DecodeVIntSize(), impala::ReadWriteUtil::IsNegativeVInt(), impala::ReadWriteUtil::MAX_VINT_LEN, and RETURN_IF_FALSE.
Referenced by impala::HdfsSequenceScanner::GetRecord(), and impala::HdfsSequenceScanner::ReadCompressedBlock().
Read a zigzag encoded long.
Definition at line 148 of file scanner-context.inline.h.
References RETURN_IF_FALSE.
Referenced by impala::HdfsAvroScanner::ParseMetadata(), and impala::HdfsAvroScanner::ProcessRange().
If 'batch' is not NULL, attaches all completed io buffers and the boundary mem pool to batch. If 'done' is set, releases the completed resources. If 'batch' is NULL then contains_tuple_data_ should be false.
Definition at line 76 of file scanner-context.cc.
References impala::MemPool::AcquireData(), impala::RowBatch::AddIoBuffer(), impala::Status::CANCELLED, and impala::RowBatch::tuple_data_pool().
|
private |
Error-reporting functions.
Definition at line 286 of file scanner-context.cc.
|
private |
Definition at line 294 of file scanner-context.cc.
Referenced by GetBytes().
|
inline |
Definition at line 119 of file scanner-context.h.
References scan_range_.
Referenced by impala::HdfsTextScanner::FindFirstTuple(), and impala::HdfsParquetScanner::ProcessFooter().
|
inline |
Sets whether of not the resulting tuples contain ptrs into memory owned by the scanner context. This by default, is inferred from the scan_node tuple descriptor (i.e. contains string slots) but can be overridden. If possible, this should be set to false to reduce memory usage as resources can be reused and recycled more quickly.
Definition at line 97 of file scanner-context.h.
References contains_tuple_data_.
Referenced by impala::HdfsParquetScanner::InitColumns(), impala::HdfsTextScanner::InitNewRange(), impala::HdfsRCFileScanner::InitNewRange(), and impala::BaseSequenceScanner::ProcessSplit().
|
inline |
Definition at line 106 of file scanner-context.h.
References read_past_size_cb_.
Referenced by impala::BaseSequenceScanner::Prepare().
Skip over the next length bytes in the specified HDFS file.
TODO: consider implementing a Skip in the context/stream object that's more efficient than GetBytes.
Definition at line 70 of file scanner-context.inline.h.
References RETURN_IF_FALSE, and UNLIKELY.
Referenced by impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsSequenceScanner::GetRecord(), impala::BaseSequenceScanner::ProcessSplit(), impala::HdfsRCFileScanner::ReadColumnBuffers(), impala::HdfsParquetScanner::BaseColumnReader::ReadDataPage(), and impala::BaseSequenceScanner::SkipToSync().
Skip this text object.
Definition at line 82 of file scanner-context.inline.h.
Referenced by impala::HdfsSequenceScanner::ReadCompressedBlock(), and impala::HdfsSequenceScanner::ReadFileHeader().
|
inline |
Returns the total number of bytes returned.
Definition at line 126 of file scanner-context.h.
References total_bytes_returned_.
Referenced by impala::HdfsAvroScanner::ReadFileHeader(), impala::HdfsSequenceScanner::ReadFileHeader(), impala::HdfsRCFileScanner::ReadFileHeader(), and impala::HdfsScanNode::ScannerThread().
|
friend |
Definition at line 163 of file scanner-context.h.
|
private |
Definition at line 198 of file scanner-context.h.
|
private |
Definition at line 200 of file scanner-context.h.
Referenced by impala::ScannerContext::AddStream().
|
private |
Definition at line 199 of file scanner-context.h.
|
private |
The boundary buffer is used to copy multiple IO buffers from the scan range into a single buffer to return to the scanner. After copying all or part of an IO buffer into the boundary buffer, the current buffer's state is updated to no longer include the copied bytes (e.g., io_buffer_bytes_left_ is decremented). Conceptually, the data in the boundary buffer always comes before that in the current buffer, and all the bytes in the stream are either already returned to the scanner, in the current IO buffer, or in the boundary buffer.
Definition at line 197 of file scanner-context.h.
|
private |
List of buffers that are completed but still have bytes referenced by the caller. On the next GetBytes() call, these buffers are released (the caller by calling GetBytes() signals it is done with its previous bytes). At this point the buffers are either returned to the io mgr or attached to the current row batch.
Definition at line 214 of file scanner-context.h.
|
private |
If true, tuples will contain pointers into memory contained in this object. That memory (io buffers or boundary buffers) must be attached to the row batch.
Definition at line 170 of file scanner-context.h.
Referenced by impala::ScannerContext::AddStream(), and set_contains_tuple_data().
|
private |
Definition at line 166 of file scanner-context.h.
Referenced by impala::ScannerContext::AddStream(), and file_desc().
|
private |
File length. Initialized with file_desc_->file_length but updated if eof is found earlier, i.e. the file was truncated.
Definition at line 177 of file scanner-context.h.
Referenced by impala::ScannerContext::AddStream(), and eof().
|
private |
The current io buffer. This starts as NULL before we've read any bytes.
Definition at line 182 of file scanner-context.h.
Referenced by impala::ScannerContext::AddStream().
|
private |
Bytes left in io_buffer_.
Definition at line 188 of file scanner-context.h.
Referenced by impala::ScannerContext::AddStream().
|
private |
Next byte to read in io_buffer_.
Definition at line 185 of file scanner-context.h.
Referenced by impala::ScannerContext::AddStream().
|
private |
Points to either io_buffer_bytes_left_ or boundary_buffer_bytes_left_ (initialized to a static zero-value int before calling GetBytes())
Definition at line 208 of file scanner-context.h.
Referenced by impala::ScannerContext::AddStream(), and GetBytes().
|
private |
Points to either io_buffer_pos_ or boundary_buffer_pos_ (initialized to NULL before calling GetBytes())
Definition at line 204 of file scanner-context.h.
Referenced by impala::ScannerContext::AddStream(), and GetBytes().
|
private |
Definition at line 164 of file scanner-context.h.
|
private |
Definition at line 179 of file scanner-context.h.
Referenced by set_read_past_size_cb().
|
private |
Definition at line 165 of file scanner-context.h.
Referenced by impala::ScannerContext::AddStream(), bytes_left(), eosr(), file_offset(), filename(), and scan_range().
|
private |
Total number of bytes returned from GetBytes()
Definition at line 173 of file scanner-context.h.
Referenced by impala::ScannerContext::AddStream(), bytes_left(), eosr(), file_offset(), GetBytes(), and total_bytes_returned().