Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
|
Decoder class for RLE encoded data. More...
#include <rle-encoding.h>
Public Member Functions | |
RleDecoder (uint8_t *buffer, int buffer_len, int bit_width) | |
RleDecoder () | |
template<typename T > | |
bool | Get (T *val) |
Gets the next value. Returns false if there are no more. More... | |
Private Attributes | |
BitReader | bit_reader_ |
int | bit_width_ |
uint64_t | current_value_ |
uint32_t | repeat_count_ |
uint32_t | literal_count_ |
Decoder class for RLE encoded data.
Utility classes to do run length encoding (RLE) for fixed bit width values. If runs are sufficiently long, RLE is used, otherwise, the values are just bit-packed (literal encoding). For both types of runs, there is a byte-aligned indicator which encodes the length of the run and the type of the run. This encoding has the benefit that when there aren't any long enough runs, values are always decoded at fixed (can be precomputed) bit offsets OR both the value and the run length are byte aligned. This allows for very efficient decoding implementations. The encoding is: encoded-block := run* run := literal-run | repeated-run literal-run := literal-indicator < literal bytes > repeated-run := repeated-indicator < repeated value. padded to byte boundary > literal-indicator := varint_encode( number_of_groups << 1 | 1) repeated-indicator := varint_encode( number_of_repetitions << 1 ) Each run is preceded by a varint. The varint's least significant bit is used to indicate whether the run is a literal run or a repeated run. The rest of the varint is used to determine the length of the run (eg how many times the value repeats). In the case of literal runs, the run length is always a multiple of 8 (i.e. encode in groups of 8), so that no matter the bit-width of the value, the sequence will end on a byte boundary without padding. Given that we know it is a multiple of 8, we store the number of 8-groups rather than the actual number of encoded ints. (This means that the total number of encoded values can not be determined from the encoded data, since the number of values in the last group may not be a multiple of 8). For the last group of literal runs, we pad the group to 8 with zeros. This allows for 8 at a time decoding on the read side without the need for additional checks. There is a break-even point when it is more storage efficient to do run length encoding. For 1 bit-width values, that point is 8 values. They require 2 bytes for both the repeated encoding or the literal encoding. This value can always be computed based on the bit-width.
100 1s followed by 100 0s: <varint(100 << 1)> <1, padded to 1 byte> <varint(100 << 1)> <0, padded to 1 byte>
Definition at line 77 of file rle-encoding.h.
|
inline |
Create a decoder object. buffer/buffer_len is the decoded data. bit_width is the width of each value (before encoding).
Definition at line 81 of file rle-encoding.h.
References bit_width_.
|
inline |
Definition at line 91 of file rle-encoding.h.
|
inline |
Gets the next value. Returns false if there are no more.
Definition at line 229 of file rle-encoding.h.
References bit_reader_, bit_width_, impala::BitUtil::Ceil(), current_value_, impala::BitReader::GetAligned(), impala::BitReader::GetValue(), impala::BitReader::GetVlqInt(), LIKELY, literal_count_, repeat_count_, and UNLIKELY.
Referenced by impala::TEST(), and impala::ValidateRle().
|
private |
Definition at line 98 of file rle-encoding.h.
Referenced by Get().
|
private |
Definition at line 99 of file rle-encoding.h.
Referenced by Get(), and RleDecoder().
|
private |
Definition at line 100 of file rle-encoding.h.
Referenced by Get().
|
private |
Definition at line 102 of file rle-encoding.h.
Referenced by Get().
|
private |
Definition at line 101 of file rle-encoding.h.
Referenced by Get().