Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros
impala::ScannerContext::Stream Class Reference

#include <scanner-context.h>

Collaboration diagram for impala::ScannerContext::Stream:

Public Types

typedef boost::function< int(int64_t)> ReadPastSizeCallback
 

Public Member Functions

bool GetBytes (int64_t requested_len, uint8_t **buffer, int64_t *out_len, Status *status, bool peek=false)
 
Status GetBuffer (bool peek, uint8_t **buffer, int64_t *out_len)
 
void set_contains_tuple_data (bool v)
 
void set_read_past_size_cb (ReadPastSizeCallback cb)
 
int64_t bytes_left ()
 Return the number of bytes left in the range for this stream. More...
 
bool eosr () const
 
bool eof () const
 If true, the stream has reached the end of the file. More...
 
const char * filename ()
 
const DiskIoMgr::ScanRangescan_range ()
 
const HdfsFileDescfile_desc ()
 
int64_t file_offset () const
 Returns the buffer's current offset in the file. More...
 
int64_t total_bytes_returned ()
 Returns the total number of bytes returned. More...
 
bool ReadBoolean (bool *boolean, Status *)
 
bool ReadInt (int32_t *val, Status *, bool peek=false)
 
bool ReadVLong (int64_t *val, Status *)
 
bool ReadVInt (int32_t *val, Status *)
 
bool ReadZLong (int64_t *val, Status *)
 Read a zigzag encoded long. More...
 
bool SkipBytes (int64_t length, Status *)
 Skip over the next length bytes in the specified HDFS file. More...
 
bool ReadBytes (int64_t length, uint8_t **buf, Status *, bool peek=false)
 
bool ReadText (uint8_t **buf, int64_t *length, Status *)
 
bool SkipText (Status *)
 Skip this text object. More...
 

Private Member Functions

 Stream (ScannerContext *parent)
 
Status GetBytesInternal (int64_t requested_len, uint8_t **buffer, bool peek, int64_t *out_len)
 
Status GetNextBuffer (int64_t read_past_size=0)
 
void ReleaseCompletedResources (RowBatch *batch, bool done)
 
Status ReportIncompleteRead (int64_t length, int64_t bytes_read)
 Error-reporting functions. More...
 
Status ReportInvalidRead (int64_t length)
 

Private Attributes

ScannerContextparent_
 
DiskIoMgr::ScanRangescan_range_
 
const HdfsFileDescfile_desc_
 
bool contains_tuple_data_
 
int64_t total_bytes_returned_
 Total number of bytes returned from GetBytes() More...
 
int64_t file_len_
 
ReadPastSizeCallback read_past_size_cb_
 
DiskIoMgr::BufferDescriptorio_buffer_
 The current io buffer. This starts as NULL before we've read any bytes. More...
 
uint8_t * io_buffer_pos_
 Next byte to read in io_buffer_. More...
 
int64_t io_buffer_bytes_left_
 Bytes left in io_buffer_. More...
 
boost::scoped_ptr< MemPoolboundary_pool_
 
boost::scoped_ptr< StringBufferboundary_buffer_
 
uint8_t * boundary_buffer_pos_
 
int64_t boundary_buffer_bytes_left_
 
uint8_t ** output_buffer_pos_
 
int64_t * output_buffer_bytes_left_
 
std::list
< DiskIoMgr::BufferDescriptor * > 
completed_io_buffers_
 

Friends

class ScannerContext
 

Detailed Description

Encapsulates a stream (continuous byte range) that can be read. A context can contain one or more streams. For non-columnar files, there is only one stream; for columnar, there is one stream per column.

Definition at line 66 of file scanner-context.h.

Member Typedef Documentation

typedef boost::function<int (int64_t)> impala::ScannerContext::Stream::ReadPastSizeCallback

Callback that returns the buffer size to use when reading past the end of the scan range. By default a constant value is used, which scanners can override with this callback. The callback takes the file offset of the asynchronous read (this may be more than file_offset() due to data being assembled in the boundary buffer). Reading past the end of the scan range is likely a remote read, so we want to minimize the number of io requests as well as the data volume.

Definition at line 105 of file scanner-context.h.

Constructor & Destructor Documentation

ScannerContext::Stream::Stream ( ScannerContext parent)
private

Definition at line 52 of file scanner-context.cc.

Member Function Documentation

int64_t impala::ScannerContext::Stream::bytes_left ( )
inline

Return the number of bytes left in the range for this stream.

Definition at line 109 of file scanner-context.h.

References impala::DiskIoMgr::RequestRange::len(), scan_range_, and total_bytes_returned_.

Referenced by impala::BaseSequenceScanner::Close(), and impala::BaseSequenceScanner::SkipToSync().

Status ScannerContext::Stream::GetBuffer ( bool  peek,
uint8_t **  buffer,
int64_t *  out_len 
)

Gets the bytes from the first available buffer within the scan range. This may be the boundary buffer used to stitch IO buffers together. If we are past the end of the scan range, no bytes are returned.

Definition at line 171 of file scanner-context.cc.

References impala::Status::CANCELLED, impala::Status::OK, and RETURN_IF_ERROR.

Referenced by impala::HdfsTextScanner::FillByteBuffer(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsParquetScanner::ProcessFooter(), impala::HdfsParquetScanner::BaseColumnReader::ReadDataPage(), and impala::BaseSequenceScanner::SkipToSync().

bool ScannerContext::Stream::GetBytes ( int64_t  requested_len,
uint8_t **  buffer,
int64_t *  out_len,
Status status,
bool  peek = false 
)
inline

Returns up to requested_len bytes or an error. This can block if bytes are not available.

  • requested_len is the number of bytes requested. This function will return those number of bytes unless end of file or an error occurred.
  • If peek is true, the scan range position is not incremented (i.e. repeated calls with peek = true will return the same data).
  • *buffer on return is a pointer to the buffer. The memory is owned by the ScannerContext and should not be modified. If the buffer is entirely from one disk io buffer, a pointer inside that buffer is returned directly. If the requested buffer straddles io buffers, a copy is done here.
  • *out_len is the number of bytes returned.
  • *status is set if there is an error. Returns true if the call was success (i.e. status->ok()) This should only be called from the scanner thread. Note that this will return bytes past the end of the scan range until the end of the file.

Handle the fast common path where all the bytes are in the first buffer. This is the path used by sequence/rc/parquet file formats to read a very small number (i.e. single int) of bytes.

Definition at line 31 of file scanner-context.inline.h.

References GetBytesInternal(), LIKELY, impala::Status::ok(), output_buffer_bytes_left_, output_buffer_pos_, ReportInvalidRead(), total_bytes_returned_, and UNLIKELY.

Referenced by impala::HdfsTextScanner::FillByteBuffer(), impala::HdfsTextScanner::FillByteBufferCompressedFile(), impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsParquetScanner::BaseColumnReader::ReadDataPage(), impala::BaseSequenceScanner::ReadSync(), and impala::BaseSequenceScanner::SkipToSync().

Status ScannerContext::Stream::GetBytesInternal ( int64_t  requested_len,
uint8_t **  buffer,
bool  peek,
int64_t *  out_len 
)
private

GetBytes helper to handle the slow path. If peek is set then return the data but do not move the current offset.

Definition at line 216 of file scanner-context.cc.

References impala::Status::CANCELLED, impala::Status::OK, and RETURN_IF_ERROR.

Referenced by GetBytes().

Status ScannerContext::Stream::GetNextBuffer ( int64_t  read_past_size = 0)
private

Gets (and blocks) for the next io buffer. After fetching all buffers in the scan range, performs synchronous reads past the scan range until EOF. When performing a synchronous read, the read size is the max of read_past_size and the result returned by read_past_size_cb_() (or DEFAULT_READ_PAST_SIZE if no callback is set). read_past_size is not used otherwise. Updates io_buffer_, io_buffer_bytes_left_, and io_buffer_pos_. If GetNextBuffer() is called after all bytes in the file have been returned, io_buffer_bytes_left_ will be set to 0. In the non-error case, io_buffer_ is never set to NULL, even if it contains 0 bytes.

Definition at line 115 of file scanner-context.cc.

References impala::Status::CANCELLED, DEFAULT_READ_PAST_SIZE, offset, impala::Status::OK, RETURN_IF_ERROR, SCOPED_TIMER, and VLOG_FILE.

bool ScannerContext::Stream::ReadBoolean ( bool boolean,
Status status 
)
inline

Read a Boolean primitive value written using Java serialization. Equivalent to java.io.DataInput.readBoolean()

Definition at line 95 of file scanner-context.inline.h.

References RETURN_IF_FALSE.

Referenced by impala::HdfsSequenceScanner::ReadFileHeader(), and impala::HdfsRCFileScanner::ReadFileHeader().

bool ScannerContext::Stream::ReadInt ( int32_t *  val,
Status status,
bool  peek = false 
)
inline
bool ScannerContext::Stream::ReadText ( uint8_t **  buf,
int64_t *  length,
Status status 
)
inline

Read a Writable Text value from the supplied file. Ref: org.apache.hadoop.io.WritableUtils.readString() The returned buffer is owned by this object.

Definition at line 88 of file scanner-context.inline.h.

References RETURN_IF_FALSE.

Referenced by impala::HdfsSequenceScanner::ReadFileHeader(), impala::HdfsRCFileScanner::ReadFileHeader(), and impala::HdfsRCFileScanner::ReadNumColumnsMetadata().

bool ScannerContext::Stream::ReadVInt ( int32_t *  val,
Status status 
)
inline

Read a variable length Integer value written using Writable serialization. Ref: org.apache.hadoop.io.WritableUtils.readVInt()

Definition at line 109 of file scanner-context.inline.h.

References RETURN_IF_FALSE.

bool ScannerContext::Stream::ReadVLong ( int64_t *  val,
Status status 
)
inline

Read a variable-length Long value written using Writable serialization. Ref: org.apache.hadoop.io.WritableUtils.readVLong()

Definition at line 116 of file scanner-context.inline.h.

References impala::ReadWriteUtil::DecodeVIntSize(), impala::ReadWriteUtil::IsNegativeVInt(), impala::ReadWriteUtil::MAX_VINT_LEN, and RETURN_IF_FALSE.

Referenced by impala::HdfsSequenceScanner::GetRecord(), and impala::HdfsSequenceScanner::ReadCompressedBlock().

bool ScannerContext::Stream::ReadZLong ( int64_t *  val,
Status status 
)
inline

Read a zigzag encoded long.

Definition at line 148 of file scanner-context.inline.h.

References RETURN_IF_FALSE.

Referenced by impala::HdfsAvroScanner::ParseMetadata(), and impala::HdfsAvroScanner::ProcessRange().

void ScannerContext::Stream::ReleaseCompletedResources ( RowBatch batch,
bool  done 
)
private

If 'batch' is not NULL, attaches all completed io buffers and the boundary mem pool to batch. If 'done' is set, releases the completed resources. If 'batch' is NULL then contains_tuple_data_ should be false.

Definition at line 76 of file scanner-context.cc.

References impala::MemPool::AcquireData(), impala::RowBatch::AddIoBuffer(), impala::Status::CANCELLED, and impala::RowBatch::tuple_data_pool().

Status ScannerContext::Stream::ReportIncompleteRead ( int64_t  length,
int64_t  bytes_read 
)
private

Error-reporting functions.

Definition at line 286 of file scanner-context.cc.

Status ScannerContext::Stream::ReportInvalidRead ( int64_t  length)
private

Definition at line 294 of file scanner-context.cc.

Referenced by GetBytes().

const DiskIoMgr::ScanRange* impala::ScannerContext::Stream::scan_range ( )
inline
void impala::ScannerContext::Stream::set_contains_tuple_data ( bool  v)
inline

Sets whether of not the resulting tuples contain ptrs into memory owned by the scanner context. This by default, is inferred from the scan_node tuple descriptor (i.e. contains string slots) but can be overridden. If possible, this should be set to false to reduce memory usage as resources can be reused and recycled more quickly.

Definition at line 97 of file scanner-context.h.

References contains_tuple_data_.

Referenced by impala::HdfsParquetScanner::InitColumns(), impala::HdfsTextScanner::InitNewRange(), impala::HdfsRCFileScanner::InitNewRange(), and impala::BaseSequenceScanner::ProcessSplit().

void impala::ScannerContext::Stream::set_read_past_size_cb ( ReadPastSizeCallback  cb)
inline

Definition at line 106 of file scanner-context.h.

References read_past_size_cb_.

Referenced by impala::BaseSequenceScanner::Prepare().

bool ScannerContext::Stream::SkipBytes ( int64_t  length,
Status status 
)
inline

Skip over the next length bytes in the specified HDFS file.

TODO: consider implementing a Skip in the context/stream object that's more efficient than GetBytes.

Definition at line 70 of file scanner-context.inline.h.

References RETURN_IF_FALSE, and UNLIKELY.

Referenced by impala::HdfsTextScanner::FillByteBufferGzip(), impala::HdfsSequenceScanner::GetRecord(), impala::BaseSequenceScanner::ProcessSplit(), impala::HdfsRCFileScanner::ReadColumnBuffers(), impala::HdfsParquetScanner::BaseColumnReader::ReadDataPage(), and impala::BaseSequenceScanner::SkipToSync().

bool ScannerContext::Stream::SkipText ( Status status)
inline
int64_t impala::ScannerContext::Stream::total_bytes_returned ( )
inline

Friends And Related Function Documentation

friend class ScannerContext
friend

Definition at line 163 of file scanner-context.h.

Member Data Documentation

boost::scoped_ptr<StringBuffer> impala::ScannerContext::Stream::boundary_buffer_
private

Definition at line 198 of file scanner-context.h.

int64_t impala::ScannerContext::Stream::boundary_buffer_bytes_left_
private

Definition at line 200 of file scanner-context.h.

Referenced by impala::ScannerContext::AddStream().

uint8_t* impala::ScannerContext::Stream::boundary_buffer_pos_
private

Definition at line 199 of file scanner-context.h.

boost::scoped_ptr<MemPool> impala::ScannerContext::Stream::boundary_pool_
private

The boundary buffer is used to copy multiple IO buffers from the scan range into a single buffer to return to the scanner. After copying all or part of an IO buffer into the boundary buffer, the current buffer's state is updated to no longer include the copied bytes (e.g., io_buffer_bytes_left_ is decremented). Conceptually, the data in the boundary buffer always comes before that in the current buffer, and all the bytes in the stream are either already returned to the scanner, in the current IO buffer, or in the boundary buffer.

Definition at line 197 of file scanner-context.h.

std::list<DiskIoMgr::BufferDescriptor*> impala::ScannerContext::Stream::completed_io_buffers_
private

List of buffers that are completed but still have bytes referenced by the caller. On the next GetBytes() call, these buffers are released (the caller by calling GetBytes() signals it is done with its previous bytes). At this point the buffers are either returned to the io mgr or attached to the current row batch.

Definition at line 214 of file scanner-context.h.

bool impala::ScannerContext::Stream::contains_tuple_data_
private

If true, tuples will contain pointers into memory contained in this object. That memory (io buffers or boundary buffers) must be attached to the row batch.

Definition at line 170 of file scanner-context.h.

Referenced by impala::ScannerContext::AddStream(), and set_contains_tuple_data().

const HdfsFileDesc* impala::ScannerContext::Stream::file_desc_
private

Definition at line 166 of file scanner-context.h.

Referenced by impala::ScannerContext::AddStream(), and file_desc().

int64_t impala::ScannerContext::Stream::file_len_
private

File length. Initialized with file_desc_->file_length but updated if eof is found earlier, i.e. the file was truncated.

Definition at line 177 of file scanner-context.h.

Referenced by impala::ScannerContext::AddStream(), and eof().

DiskIoMgr::BufferDescriptor* impala::ScannerContext::Stream::io_buffer_
private

The current io buffer. This starts as NULL before we've read any bytes.

Definition at line 182 of file scanner-context.h.

Referenced by impala::ScannerContext::AddStream().

int64_t impala::ScannerContext::Stream::io_buffer_bytes_left_
private

Bytes left in io_buffer_.

Definition at line 188 of file scanner-context.h.

Referenced by impala::ScannerContext::AddStream().

uint8_t* impala::ScannerContext::Stream::io_buffer_pos_
private

Next byte to read in io_buffer_.

Definition at line 185 of file scanner-context.h.

Referenced by impala::ScannerContext::AddStream().

int64_t* impala::ScannerContext::Stream::output_buffer_bytes_left_
private

Points to either io_buffer_bytes_left_ or boundary_buffer_bytes_left_ (initialized to a static zero-value int before calling GetBytes())

Definition at line 208 of file scanner-context.h.

Referenced by impala::ScannerContext::AddStream(), and GetBytes().

uint8_t** impala::ScannerContext::Stream::output_buffer_pos_
private

Points to either io_buffer_pos_ or boundary_buffer_pos_ (initialized to NULL before calling GetBytes())

Definition at line 204 of file scanner-context.h.

Referenced by impala::ScannerContext::AddStream(), and GetBytes().

ScannerContext* impala::ScannerContext::Stream::parent_
private

Definition at line 164 of file scanner-context.h.

ReadPastSizeCallback impala::ScannerContext::Stream::read_past_size_cb_
private

Definition at line 179 of file scanner-context.h.

Referenced by set_read_past_size_cb().

DiskIoMgr::ScanRange* impala::ScannerContext::Stream::scan_range_
private
int64_t impala::ScannerContext::Stream::total_bytes_returned_
private

Total number of bytes returned from GetBytes()

Definition at line 173 of file scanner-context.h.

Referenced by impala::ScannerContext::AddStream(), bytes_left(), eosr(), file_offset(), GetBytes(), and total_bytes_returned().


The documentation for this class was generated from the following files: