Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros
impala::RleDecoder Class Reference

Decoder class for RLE encoded data. More...

#include <rle-encoding.h>

Collaboration diagram for impala::RleDecoder:

Public Member Functions

 RleDecoder (uint8_t *buffer, int buffer_len, int bit_width)
 
 RleDecoder ()
 
template<typename T >
bool Get (T *val)
 Gets the next value. Returns false if there are no more. More...
 

Private Attributes

BitReader bit_reader_
 
int bit_width_
 
uint64_t current_value_
 
uint32_t repeat_count_
 
uint32_t literal_count_
 

Detailed Description

Decoder class for RLE encoded data.

Utility classes to do run length encoding (RLE) for fixed bit width values. If runs are sufficiently long, RLE is used, otherwise, the values are just bit-packed (literal encoding). For both types of runs, there is a byte-aligned indicator which encodes the length of the run and the type of the run. This encoding has the benefit that when there aren't any long enough runs, values are always decoded at fixed (can be precomputed) bit offsets OR both the value and the run length are byte aligned. This allows for very efficient decoding implementations. The encoding is: encoded-block := run* run := literal-run | repeated-run literal-run := literal-indicator < literal bytes > repeated-run := repeated-indicator < repeated value. padded to byte boundary > literal-indicator := varint_encode( number_of_groups << 1 | 1) repeated-indicator := varint_encode( number_of_repetitions << 1 ) Each run is preceded by a varint. The varint's least significant bit is used to indicate whether the run is a literal run or a repeated run. The rest of the varint is used to determine the length of the run (eg how many times the value repeats). In the case of literal runs, the run length is always a multiple of 8 (i.e. encode in groups of 8), so that no matter the bit-width of the value, the sequence will end on a byte boundary without padding. Given that we know it is a multiple of 8, we store the number of 8-groups rather than the actual number of encoded ints. (This means that the total number of encoded values can not be determined from the encoded data, since the number of values in the last group may not be a multiple of 8). For the last group of literal runs, we pad the group to 8 with zeros. This allows for 8 at a time decoding on the read side without the need for additional checks. There is a break-even point when it is more storage efficient to do run length encoding. For 1 bit-width values, that point is 8 values. They require 2 bytes for both the repeated encoding or the literal encoding. This value can always be computed based on the bit-width.

TODO: think about how to use this for strings. The bit packing isn't quite the same. Examples with bit-width 1 (eg encoding booleans):

100 1s followed by 100 0s: <varint(100 << 1)> <1, padded to 1 byte>  <varint(100 << 1)> <0, padded to 1 byte>

  • (total 4 bytes) alternating 1s and 0s (200 total): 200 ints = 25 groups of 8 <varint((25 << 1) | 1)> <25 bytes of values, bitpacked> (total 26 bytes, 1 byte overhead)

Definition at line 77 of file rle-encoding.h.

Constructor & Destructor Documentation

impala::RleDecoder::RleDecoder ( uint8_t *  buffer,
int  buffer_len,
int  bit_width 
)
inline

Create a decoder object. buffer/buffer_len is the decoded data. bit_width is the width of each value (before encoding).

Definition at line 81 of file rle-encoding.h.

References bit_width_.

impala::RleDecoder::RleDecoder ( )
inline

Definition at line 91 of file rle-encoding.h.

Member Function Documentation

template<typename T >
bool impala::RleDecoder::Get ( T *  val)
inline

Member Data Documentation

BitReader impala::RleDecoder::bit_reader_
private

Definition at line 98 of file rle-encoding.h.

Referenced by Get().

int impala::RleDecoder::bit_width_
private

Definition at line 99 of file rle-encoding.h.

Referenced by Get(), and RleDecoder().

uint64_t impala::RleDecoder::current_value_
private

Definition at line 100 of file rle-encoding.h.

Referenced by Get().

uint32_t impala::RleDecoder::literal_count_
private

Definition at line 102 of file rle-encoding.h.

Referenced by Get().

uint32_t impala::RleDecoder::repeat_count_
private

Definition at line 101 of file rle-encoding.h.

Referenced by Get().


The documentation for this class was generated from the following file: