Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
|
#include <rle-encoding.h>
Public Member Functions | |
RleEncoder (uint8_t *buffer, int buffer_len, int bit_width) | |
bool | Put (uint64_t value) |
int | Flush () |
void | Clear () |
Resets all the state in the encoder. More... | |
uint8_t * | buffer () |
Returns pointer to underlying buffer. More... | |
int32_t | len () |
Static Public Member Functions | |
static int | MinBufferSize (int bit_width) |
static int | MaxBufferSize (int bit_width, int num_values) |
Returns the maximum byte size it could take to encode 'num_values'. More... | |
Private Member Functions | |
void | FlushBufferedValues (bool done) |
void | FlushLiteralRun (bool update_indicator_byte) |
void | FlushRepeatedRun () |
Flushes a repeated run to the underlying buffer. More... | |
void | CheckBufferFull () |
Private Attributes | |
const int | bit_width_ |
Number of bits needed to encode the value. More... | |
BitWriter | bit_writer_ |
Underlying buffer. More... | |
bool | buffer_full_ |
If true, the buffer is full and subsequent Put()'s will fail. More... | |
int | max_run_byte_size_ |
The maximum byte size a single run can take. More... | |
int64_t | buffered_values_ [8] |
int | num_buffered_values_ |
Number of values in buffered_values_. More... | |
int64_t | current_value_ |
int | repeat_count_ |
int | literal_count_ |
uint8_t * | literal_indicator_byte_ |
Static Private Attributes | |
static const int | MAX_VALUES_PER_LITERAL_RUN = (1 << 6) * 8 |
Class to incrementally build the rle data. This class does not allocate any memory. The encoding has two modes: encoding repeated runs and literal runs. If the run is sufficiently short, it is more efficient to encode as a literal run. This class does so by buffering 8 values at a time. If they are not all the same they are added to the literal run. If they are the same, they are added to the repeated run. When we switch modes, the previous run is flushed out.
Definition at line 111 of file rle-encoding.h.
|
inline |
buffer/buffer_len: preallocated output buffer. bit_width: max number of bits for value. TODO: consider adding a min_repeated_run_length so the caller can control when values should be encoded as repeated runs. Currently this is derived based on the bit_width, which can determine a storage optimal choice. TODO: allow 0 bit_width (and have dict encoder use it)
Definition at line 119 of file rle-encoding.h.
References bit_width_, Clear(), max_run_byte_size_, and MinBufferSize().
|
inline |
Returns pointer to underlying buffer.
Definition at line 161 of file rle-encoding.h.
References bit_writer_, and impala::BitWriter::buffer().
|
inlineprivate |
Checks and sets buffer_full_. This must be called after flushing a run to make sure there are enough bytes remaining to encode the next run.
Definition at line 397 of file rle-encoding.h.
References bit_writer_, buffer_full_, impala::BitWriter::buffer_len(), impala::BitWriter::bytes_written(), and max_run_byte_size_.
Referenced by FlushLiteralRun(), and FlushRepeatedRun().
|
inline |
Resets all the state in the encoder.
Definition at line 404 of file rle-encoding.h.
References bit_writer_, buffer_full_, impala::BitWriter::Clear(), current_value_, literal_count_, literal_indicator_byte_, num_buffered_values_, and repeat_count_.
Referenced by RleEncoder().
|
inline |
Flushes any pending values to the underlying buffer. Returns the total number of bytes written
Definition at line 370 of file rle-encoding.h.
References bit_writer_, buffered_values_, impala::BitWriter::bytes_written(), impala::BitWriter::Flush(), FlushLiteralRun(), FlushRepeatedRun(), literal_count_, num_buffered_values_, and repeat_count_.
Referenced by impala::TEST(), impala::ValidateRle(), and impala::DictEncoderBase::WriteData().
|
inlineprivate |
Flushes any buffered values. If this is part of a repeated run, this is largely a no-op. If it is part of a literal run, this will call FlushLiteralRun, which writes out the buffered literal values. If 'done' is true, the current run would be written even if it would normally have been buffered more. This should only be called at the end, when the encoder has received all values even if it would normally continue to be buffered.
Flush the values that have been buffered. At this point we decide whether we need to switch between the run types or continue the current one.
Definition at line 340 of file rle-encoding.h.
References FlushLiteralRun(), literal_count_, literal_indicator_byte_, num_buffered_values_, and repeat_count_.
Referenced by Put().
|
inlineprivate |
Flushes literal values to the underlying buffer. If update_indicator_byte, then the current literal run is complete and the indicator byte is updated.
Definition at line 295 of file rle-encoding.h.
References bit_width_, bit_writer_, buffered_values_, CheckBufferFull(), impala::BitWriter::GetNextBytePtr(), literal_count_, literal_indicator_byte_, num_buffered_values_, and impala::BitWriter::PutValue().
Referenced by Flush(), and FlushBufferedValues().
|
inlineprivate |
Flushes a repeated run to the underlying buffer.
Definition at line 325 of file rle-encoding.h.
References bit_width_, bit_writer_, impala::BitUtil::Ceil(), CheckBufferFull(), current_value_, num_buffered_values_, impala::BitWriter::PutAligned(), impala::BitWriter::PutVlqInt(), and repeat_count_.
|
inline |
Definition at line 162 of file rle-encoding.h.
References bit_writer_, and impala::BitWriter::bytes_written().
Referenced by impala::DictEncoderBase::WriteData().
|
inlinestatic |
Returns the maximum byte size it could take to encode 'num_values'.
Definition at line 142 of file rle-encoding.h.
References impala::BitUtil::Ceil(), MAX_VALUES_PER_LITERAL_RUN, and MinBufferSize().
Referenced by impala::DictEncoderBase::EstimatedDataEncodedSize().
|
inlinestatic |
Returns the minimum buffer size needed to use the encoder for 'bit_width' This is the maximum length of a single run for 'bit_width'. It is not valid to pass a buffer less than this length.
1 indicator byte and MAX_VALUES_PER_LITERAL_RUN 'bit_width' values.
Up to MAX_VLQ_BYTE_LEN indicator and a single 'bit_width' value.
Definition at line 132 of file rle-encoding.h.
References impala::BitUtil::Ceil(), MAX_VALUES_PER_LITERAL_RUN, and impala::BitReader::MAX_VLQ_BYTE_LEN.
Referenced by MaxBufferSize(), RleEncoder(), and impala::TEST().
Encode value. Returns true if the value fits in buffer, false otherwise. This value must be representable with bit_width_ bits.
This function buffers input values 8 at a time. After seeing all 8 values, it decides whether they should be encoded as a literal or repeated run.
Definition at line 264 of file rle-encoding.h.
References bit_width_, buffer_full_, buffered_values_, current_value_, FlushBufferedValues(), FlushRepeatedRun(), LIKELY, literal_count_, num_buffered_values_, repeat_count_, and UNLIKELY.
Referenced by impala::TEST(), impala::ValidateRle(), and impala::DictEncoderBase::WriteData().
|
private |
Number of bits needed to encode the value.
Definition at line 191 of file rle-encoding.h.
Referenced by FlushLiteralRun(), FlushRepeatedRun(), Put(), and RleEncoder().
|
private |
Underlying buffer.
Definition at line 194 of file rle-encoding.h.
Referenced by buffer(), CheckBufferFull(), Clear(), Flush(), FlushLiteralRun(), FlushRepeatedRun(), and len().
|
private |
If true, the buffer is full and subsequent Put()'s will fail.
Definition at line 197 of file rle-encoding.h.
Referenced by CheckBufferFull(), Clear(), and Put().
|
private |
We need to buffer at most 8 values for literals. This happens when the bit_width is 1 (so 8 values fit in one byte). TODO: generalize this to other bit widths
Definition at line 205 of file rle-encoding.h.
Referenced by Flush(), FlushLiteralRun(), and Put().
|
private |
The current (also last) value that was written and the count of how many times in a row that value has been seen. This is maintained even if we are in a literal run. If the repeat_count_ get high enough, we switch to encoding repeated runs.
Definition at line 214 of file rle-encoding.h.
Referenced by Clear(), FlushRepeatedRun(), and Put().
|
private |
Number of literals in the current run. This does not include the literals that might be in buffered_values_. Only after we've got a group big enough can we decide if they should part of the literal_count_ or repeat_count_
Definition at line 220 of file rle-encoding.h.
Referenced by Clear(), Flush(), FlushBufferedValues(), FlushLiteralRun(), and Put().
|
private |
Pointer to a byte in the underlying buffer that stores the indicator byte. This is reserved as soon as we need a literal run but the value is written when the literal run is complete.
Definition at line 225 of file rle-encoding.h.
Referenced by Clear(), FlushBufferedValues(), and FlushLiteralRun().
|
private |
The maximum byte size a single run can take.
Definition at line 200 of file rle-encoding.h.
Referenced by CheckBufferFull(), and RleEncoder().
|
staticprivate |
The maximum number of values in a single literal run (number of groups encodable by a 1-byte indicator * 8)
Definition at line 188 of file rle-encoding.h.
Referenced by MaxBufferSize(), and MinBufferSize().
|
private |
Number of values in buffered_values_.
Definition at line 208 of file rle-encoding.h.
Referenced by Clear(), Flush(), FlushBufferedValues(), FlushLiteralRun(), FlushRepeatedRun(), and Put().
|
private |
Definition at line 215 of file rle-encoding.h.
Referenced by Clear(), Flush(), FlushBufferedValues(), FlushRepeatedRun(), and Put().