Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros
impala::RleEncoder Class Reference

#include <rle-encoding.h>

Collaboration diagram for impala::RleEncoder:

Public Member Functions

 RleEncoder (uint8_t *buffer, int buffer_len, int bit_width)
 
bool Put (uint64_t value)
 
int Flush ()
 
void Clear ()
 Resets all the state in the encoder. More...
 
uint8_t * buffer ()
 Returns pointer to underlying buffer. More...
 
int32_t len ()
 

Static Public Member Functions

static int MinBufferSize (int bit_width)
 
static int MaxBufferSize (int bit_width, int num_values)
 Returns the maximum byte size it could take to encode 'num_values'. More...
 

Private Member Functions

void FlushBufferedValues (bool done)
 
void FlushLiteralRun (bool update_indicator_byte)
 
void FlushRepeatedRun ()
 Flushes a repeated run to the underlying buffer. More...
 
void CheckBufferFull ()
 

Private Attributes

const int bit_width_
 Number of bits needed to encode the value. More...
 
BitWriter bit_writer_
 Underlying buffer. More...
 
bool buffer_full_
 If true, the buffer is full and subsequent Put()'s will fail. More...
 
int max_run_byte_size_
 The maximum byte size a single run can take. More...
 
int64_t buffered_values_ [8]
 
int num_buffered_values_
 Number of values in buffered_values_. More...
 
int64_t current_value_
 
int repeat_count_
 
int literal_count_
 
uint8_t * literal_indicator_byte_
 

Static Private Attributes

static const int MAX_VALUES_PER_LITERAL_RUN = (1 << 6) * 8
 

Detailed Description

Class to incrementally build the rle data. This class does not allocate any memory. The encoding has two modes: encoding repeated runs and literal runs. If the run is sufficiently short, it is more efficient to encode as a literal run. This class does so by buffering 8 values at a time. If they are not all the same they are added to the literal run. If they are the same, they are added to the repeated run. When we switch modes, the previous run is flushed out.

Definition at line 111 of file rle-encoding.h.

Constructor & Destructor Documentation

impala::RleEncoder::RleEncoder ( uint8_t *  buffer,
int  buffer_len,
int  bit_width 
)
inline

buffer/buffer_len: preallocated output buffer. bit_width: max number of bits for value. TODO: consider adding a min_repeated_run_length so the caller can control when values should be encoded as repeated runs. Currently this is derived based on the bit_width, which can determine a storage optimal choice. TODO: allow 0 bit_width (and have dict encoder use it)

Definition at line 119 of file rle-encoding.h.

References bit_width_, Clear(), max_run_byte_size_, and MinBufferSize().

Member Function Documentation

uint8_t* impala::RleEncoder::buffer ( )
inline

Returns pointer to underlying buffer.

Definition at line 161 of file rle-encoding.h.

References bit_writer_, and impala::BitWriter::buffer().

void impala::RleEncoder::CheckBufferFull ( )
inlineprivate

Checks and sets buffer_full_. This must be called after flushing a run to make sure there are enough bytes remaining to encode the next run.

Definition at line 397 of file rle-encoding.h.

References bit_writer_, buffer_full_, impala::BitWriter::buffer_len(), impala::BitWriter::bytes_written(), and max_run_byte_size_.

Referenced by FlushLiteralRun(), and FlushRepeatedRun().

void impala::RleEncoder::Clear ( )
inline

Resets all the state in the encoder.

Definition at line 404 of file rle-encoding.h.

References bit_writer_, buffer_full_, impala::BitWriter::Clear(), current_value_, literal_count_, literal_indicator_byte_, num_buffered_values_, and repeat_count_.

Referenced by RleEncoder().

int impala::RleEncoder::Flush ( )
inline

Flushes any pending values to the underlying buffer. Returns the total number of bytes written

Definition at line 370 of file rle-encoding.h.

References bit_writer_, buffered_values_, impala::BitWriter::bytes_written(), impala::BitWriter::Flush(), FlushLiteralRun(), FlushRepeatedRun(), literal_count_, num_buffered_values_, and repeat_count_.

Referenced by impala::TEST(), impala::ValidateRle(), and impala::DictEncoderBase::WriteData().

void impala::RleEncoder::FlushBufferedValues ( bool  done)
inlineprivate

Flushes any buffered values. If this is part of a repeated run, this is largely a no-op. If it is part of a literal run, this will call FlushLiteralRun, which writes out the buffered literal values. If 'done' is true, the current run would be written even if it would normally have been buffered more. This should only be called at the end, when the encoder has received all values even if it would normally continue to be buffered.

Flush the values that have been buffered. At this point we decide whether we need to switch between the run types or continue the current one.

Definition at line 340 of file rle-encoding.h.

References FlushLiteralRun(), literal_count_, literal_indicator_byte_, num_buffered_values_, and repeat_count_.

Referenced by Put().

void impala::RleEncoder::FlushLiteralRun ( bool  update_indicator_byte)
inlineprivate

Flushes literal values to the underlying buffer. If update_indicator_byte, then the current literal run is complete and the indicator byte is updated.

Definition at line 295 of file rle-encoding.h.

References bit_width_, bit_writer_, buffered_values_, CheckBufferFull(), impala::BitWriter::GetNextBytePtr(), literal_count_, literal_indicator_byte_, num_buffered_values_, and impala::BitWriter::PutValue().

Referenced by Flush(), and FlushBufferedValues().

void impala::RleEncoder::FlushRepeatedRun ( )
inlineprivate

Flushes a repeated run to the underlying buffer.

Definition at line 325 of file rle-encoding.h.

References bit_width_, bit_writer_, impala::BitUtil::Ceil(), CheckBufferFull(), current_value_, num_buffered_values_, impala::BitWriter::PutAligned(), impala::BitWriter::PutVlqInt(), and repeat_count_.

Referenced by Flush(), and Put().

int32_t impala::RleEncoder::len ( )
inline

Definition at line 162 of file rle-encoding.h.

References bit_writer_, and impala::BitWriter::bytes_written().

Referenced by impala::DictEncoderBase::WriteData().

static int impala::RleEncoder::MaxBufferSize ( int  bit_width,
int  num_values 
)
inlinestatic

Returns the maximum byte size it could take to encode 'num_values'.

Definition at line 142 of file rle-encoding.h.

References impala::BitUtil::Ceil(), MAX_VALUES_PER_LITERAL_RUN, and MinBufferSize().

Referenced by impala::DictEncoderBase::EstimatedDataEncodedSize().

static int impala::RleEncoder::MinBufferSize ( int  bit_width)
inlinestatic

Returns the minimum buffer size needed to use the encoder for 'bit_width' This is the maximum length of a single run for 'bit_width'. It is not valid to pass a buffer less than this length.

1 indicator byte and MAX_VALUES_PER_LITERAL_RUN 'bit_width' values.

Up to MAX_VLQ_BYTE_LEN indicator and a single 'bit_width' value.

Definition at line 132 of file rle-encoding.h.

References impala::BitUtil::Ceil(), MAX_VALUES_PER_LITERAL_RUN, and impala::BitReader::MAX_VLQ_BYTE_LEN.

Referenced by MaxBufferSize(), RleEncoder(), and impala::TEST().

bool impala::RleEncoder::Put ( uint64_t  value)
inline

Encode value. Returns true if the value fits in buffer, false otherwise. This value must be representable with bit_width_ bits.

This function buffers input values 8 at a time. After seeing all 8 values, it decides whether they should be encoded as a literal or repeated run.

Definition at line 264 of file rle-encoding.h.

References bit_width_, buffer_full_, buffered_values_, current_value_, FlushBufferedValues(), FlushRepeatedRun(), LIKELY, literal_count_, num_buffered_values_, repeat_count_, and UNLIKELY.

Referenced by impala::TEST(), impala::ValidateRle(), and impala::DictEncoderBase::WriteData().

Member Data Documentation

const int impala::RleEncoder::bit_width_
private

Number of bits needed to encode the value.

Definition at line 191 of file rle-encoding.h.

Referenced by FlushLiteralRun(), FlushRepeatedRun(), Put(), and RleEncoder().

BitWriter impala::RleEncoder::bit_writer_
private

Underlying buffer.

Definition at line 194 of file rle-encoding.h.

Referenced by buffer(), CheckBufferFull(), Clear(), Flush(), FlushLiteralRun(), FlushRepeatedRun(), and len().

bool impala::RleEncoder::buffer_full_
private

If true, the buffer is full and subsequent Put()'s will fail.

Definition at line 197 of file rle-encoding.h.

Referenced by CheckBufferFull(), Clear(), and Put().

int64_t impala::RleEncoder::buffered_values_[8]
private

We need to buffer at most 8 values for literals. This happens when the bit_width is 1 (so 8 values fit in one byte). TODO: generalize this to other bit widths

Definition at line 205 of file rle-encoding.h.

Referenced by Flush(), FlushLiteralRun(), and Put().

int64_t impala::RleEncoder::current_value_
private

The current (also last) value that was written and the count of how many times in a row that value has been seen. This is maintained even if we are in a literal run. If the repeat_count_ get high enough, we switch to encoding repeated runs.

Definition at line 214 of file rle-encoding.h.

Referenced by Clear(), FlushRepeatedRun(), and Put().

int impala::RleEncoder::literal_count_
private

Number of literals in the current run. This does not include the literals that might be in buffered_values_. Only after we've got a group big enough can we decide if they should part of the literal_count_ or repeat_count_

Definition at line 220 of file rle-encoding.h.

Referenced by Clear(), Flush(), FlushBufferedValues(), FlushLiteralRun(), and Put().

uint8_t* impala::RleEncoder::literal_indicator_byte_
private

Pointer to a byte in the underlying buffer that stores the indicator byte. This is reserved as soon as we need a literal run but the value is written when the literal run is complete.

Definition at line 225 of file rle-encoding.h.

Referenced by Clear(), FlushBufferedValues(), and FlushLiteralRun().

int impala::RleEncoder::max_run_byte_size_
private

The maximum byte size a single run can take.

Definition at line 200 of file rle-encoding.h.

Referenced by CheckBufferFull(), and RleEncoder().

const int impala::RleEncoder::MAX_VALUES_PER_LITERAL_RUN = (1 << 6) * 8
staticprivate

The maximum number of values in a single literal run (number of groups encodable by a 1-byte indicator * 8)

Definition at line 188 of file rle-encoding.h.

Referenced by MaxBufferSize(), and MinBufferSize().

int impala::RleEncoder::num_buffered_values_
private

Number of values in buffered_values_.

Definition at line 208 of file rle-encoding.h.

Referenced by Clear(), Flush(), FlushBufferedValues(), FlushLiteralRun(), FlushRepeatedRun(), and Put().

int impala::RleEncoder::repeat_count_
private

Definition at line 215 of file rle-encoding.h.

Referenced by Clear(), Flush(), FlushBufferedValues(), FlushRepeatedRun(), and Put().


The documentation for this class was generated from the following file: