Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
|
#include <codec.h>
Public Types | |
typedef std::map< const std::string, const THdfsCompression::type > | CodecMap |
Map from codec string to compression format. More... | |
Public Member Functions | |
virtual | ~Codec () |
virtual Status | ProcessBlock (bool output_preallocated, int64_t input_length, const uint8_t *input, int64_t *output_length, uint8_t **output)=0 |
Process a block of data, either compressing or decompressing it. More... | |
Status | ProcessBlock32 (bool output_preallocated, int input_length, const uint8_t *input, int *output_length, uint8_t **output) |
virtual Status | ProcessBlockStreaming (int64_t input_length, const uint8_t *input, int64_t *input_bytes_read, int64_t *output_length, uint8_t **output, bool *eos) |
virtual int64_t | MaxOutputLen (int64_t input_len, const uint8_t *input=NULL)=0 |
virtual void | Close () |
Must be called on codec before destructor for final cleanup. More... | |
virtual std::string | file_extension () const =0 |
File extension to use for this compression codec. More... | |
bool | reuse_output_buffer () const |
Static Public Member Functions | |
static Status | CreateDecompressor (MemPool *mem_pool, bool reuse, THdfsCompression::type format, boost::scoped_ptr< Codec > *decompressor) |
static Status | CreateDecompressor (MemPool *mem_pool, bool reuse, const std::string &codec, boost::scoped_ptr< Codec > *decompressor) |
Alternate factory method: takes a codec string and populates a scoped pointer. More... | |
static Status | CreateCompressor (MemPool *mem_pool, bool reuse, THdfsCompression::type format, boost::scoped_ptr< Codec > *compressor) |
static Status | CreateCompressor (MemPool *mem_pool, bool reuse, const std::string &codec, boost::scoped_ptr< Codec > *compressor) |
Alternate factory method: takes a codec string and populates a scoped pointer. More... | |
static std::string | GetCodecName (THdfsCompression::type) |
Return the name of a compression algorithm. More... | |
static Status | GetHadoopCodecClassName (THdfsCompression::type, std::string *out_name) |
Returns the java class name for the given compression type. More... | |
Static Public Attributes | |
static const char *const | DEFAULT_COMPRESSION |
These are the codec string representations used in Hadoop. More... | |
static const char *const | GZIP_COMPRESSION = "org.apache.hadoop.io.compress.GzipCodec" |
static const char *const | BZIP2_COMPRESSION = "org.apache.hadoop.io.compress.BZip2Codec" |
static const char *const | SNAPPY_COMPRESSION = "org.apache.hadoop.io.compress.SnappyCodec" |
static const char *const | UNKNOWN_CODEC_ERROR |
static const CodecMap | CODEC_MAP |
static const int | MAX_BLOCK_SIZE = (2L * 1024 * 1024 * 1024) - 1 |
Protected Member Functions | |
Codec (MemPool *mem_pool, bool reuse_buffer) | |
virtual Status | Init ()=0 |
Initialize the codec. This should only be called once. More... | |
Protected Attributes | |
MemPool * | memory_pool_ |
Pool to allocate the buffer to hold transformed data. More... | |
boost::scoped_ptr< MemPool > | temp_memory_pool_ |
bool | reuse_buffer_ |
Can we reuse the output buffer or do we need to allocate on each call? More... | |
uint8_t * | out_buffer_ |
int64_t | buffer_length_ |
Length of the output buffer. More... | |
Create a compression object. This is the base class for all compression algorithms. A compression algorithm is either a compressor or a decompressor. To add a new algorithm, generally, both a compressor and a decompressor will be added. Each of these objects inherits from this class. The objects are instantiated in the Create static methods defined here. The type of compression is defined in the Thrift interface THdfsCompression. TODO: make this pure virtual (no members) so that external codecs (e.g. Lzo) can implement this without binary dependency issues. TODO: this interface is clunky. There should be one class that implements both the compress and decompress APIs so remove duplication.
typedef std::map<const std::string, const THdfsCompression::type> impala::Codec::CodecMap |
Create a compression operator Inputs: mem_pool: memory pool to allocate the output buffer. If mem_pool is NULL then the caller must always preallocate *output in ProcessBlock(). reuse_buffer: if false always allocate a new buffer rather than reuse.
Definition at line 164 of file codec.cc.
References impala::MemPool::mem_tracker(), memory_pool_, and temp_memory_pool_.
|
virtual |
Must be called on codec before destructor for final cleanup.
Definition at line 174 of file codec.cc.
References impala::MemPool::AcquireData(), memory_pool_, and temp_memory_pool_.
|
static |
Create a compressor. Input: mem_pool: the memory pool used to store the compressed data. reuse: if true the allocated buffer can be reused. format: The type of compressor to create. Output: compressor: scoped pointer to the compressor class to use.
Referenced by impala::HdfsParquetTableWriter::BaseColumnWriter::BaseColumnWriter(), impala::HdfsSequenceTableWriter::Init(), impala::HdfsTextTableWriter::Init(), impala::HdfsAvroTableWriter::Init(), impala::DecompressorTest::RunTest(), impala::DecompressorTest::RunTestStreaming(), impala::RowBatch::Serialize(), impala::TEST_F(), and impala::TestCompression().
|
static |
Alternate factory method: takes a codec string and populates a scoped pointer.
|
static |
Create a decompressor. Input: mem_pool: the memory pool used to store the decompressed data. reuse: if true the allocated buffer can be reused. format: the type of decompressor to create. Output: decompressor: scoped pointer to the decompressor class to use. If mem_pool is NULL, then the resulting codec will never allocate memory and the caller must be responsible for it.
Referenced by impala::HdfsRCFileScanner::InitNewRange(), impala::HdfsParquetScanner::BaseColumnReader::Reset(), impala::RowBatch::RowBatch(), impala::DecompressorTest::RunTest(), impala::DecompressorTest::RunTestStreaming(), and impala::HdfsScanner::UpdateDecompressor().
|
static |
Alternate factory method: takes a codec string and populates a scoped pointer.
|
pure virtual |
File extension to use for this compression codec.
Implemented in impala::Lz4Compressor, impala::SnappyBlockDecompressor, impala::SnappyCompressor, impala::Lz4Decompressor, impala::SnappyBlockCompressor, impala::SnappyDecompressor, impala::BzipCompressor, impala::BzipDecompressor, impala::GzipCompressor, and impala::GzipDecompressor.
|
static |
Return the name of a compression algorithm.
Definition at line 50 of file codec.cc.
Referenced by impala::HdfsParquetTableWriter::Init().
|
static |
Returns the java class name for the given compression type.
Definition at line 59 of file codec.cc.
References impala::Status::OK.
Referenced by impala::HdfsSequenceTableWriter::Init().
|
protectedpure virtual |
Initialize the codec. This should only be called once.
Implemented in impala::Lz4Compressor, impala::SnappyBlockDecompressor, impala::SnappyCompressor, impala::Lz4Decompressor, impala::SnappyBlockCompressor, impala::SnappyDecompressor, impala::BzipCompressor, impala::BzipDecompressor, impala::GzipCompressor, and impala::GzipDecompressor.
|
pure virtual |
Returns the maximum result length from applying the codec to input. Note this is not the exact result length, simply a bound to allow preallocating a buffer. This must be an O(1) operation (i.e. cannot read all of input). Codecs that don't support this should return -1.
Implemented in impala::Lz4Compressor, impala::SnappyBlockDecompressor, impala::SnappyCompressor, impala::Lz4Decompressor, impala::SnappyBlockCompressor, impala::SnappyDecompressor, impala::BzipCompressor, impala::BzipDecompressor, impala::GzipCompressor, and impala::GzipDecompressor.
Referenced by impala::DecompressorTest::CompressAndDecompress(), and impala::DecompressorTest::CompressAndDecompressNoOutputAllocated().
|
pure virtual |
Process a block of data, either compressing or decompressing it.
If output_preallocated is true, *output_length must be the length of *output and data will be written directly to *output (*output must be big enough to contain the transformed output). If output_preallocated is false, *output will be allocated from the codec's mempool. In this case, a mempool must have been passed into the c'tor. In either case, *output_length will be set to the actual length of the transformed output. Inputs: input_length: length of the data to process input: data to process
Implemented in impala::Lz4Compressor, impala::SnappyBlockDecompressor, impala::SnappyCompressor, impala::Lz4Decompressor, impala::SnappyBlockCompressor, impala::SnappyDecompressor, impala::BzipCompressor, impala::BzipDecompressor, impala::GzipCompressor, and impala::GzipDecompressor.
Referenced by impala::DecompressorTest::CompressAndDecompress(), impala::DecompressorTest::CompressAndDecompressNoOutputAllocated(), impala::DecompressorTest::CompressAndStreamingDecompress(), and ProcessBlock32().
Status Codec::ProcessBlock32 | ( | bool | output_preallocated, |
int | input_length, | ||
const uint8_t * | input, | ||
int * | output_length, | ||
uint8_t ** | output | ||
) |
Wrapper to the actual ProcessBlock() function. This wrapper uses lengths as ints and not int64_ts. We need to keep this interface because the Parquet thrift uses ints. See IMPALA-1116.
Definition at line 181 of file codec.cc.
References impala::Status::OK, ProcessBlock(), RETURN_IF_ERROR, and UNLIKELY.
|
inlinevirtual |
Process data like ProcessBlock(), but can consume partial input and may only produce partial output. *input_bytes_read returns the number of bytes of input that have been consumed. Even if all input has been consumed, the caller must continue calling to fetch output until *eos returns true.
Reimplemented in impala::GzipDecompressor.
Definition at line 117 of file codec.h.
Referenced by impala::DecompressorTest::CompressAndStreamingDecompress().
|
inline |
Definition at line 135 of file codec.h.
References reuse_buffer_.
|
protected |
Length of the output buffer.
Definition at line 168 of file codec.h.
Referenced by impala::GzipDecompressor::ProcessBlock(), impala::GzipCompressor::ProcessBlock(), impala::BzipDecompressor::ProcessBlock(), impala::BzipCompressor::ProcessBlock(), impala::SnappyDecompressor::ProcessBlock(), impala::SnappyBlockCompressor::ProcessBlock(), impala::SnappyCompressor::ProcessBlock(), impala::SnappyBlockDecompressor::ProcessBlock(), and impala::GzipDecompressor::ProcessBlockStreaming().
|
static |
|
static |
Definition at line 52 of file codec.h.
Referenced by impala::HdfsSequenceScanner::ReadFileHeader(), and impala::HdfsRCFileScanner::ReadFileHeader().
|
static |
|
static |
|
static |
Largest block we will compress/decompress: 2GB. We are dealing with compressed blocks that are never this big but we want to guard against a corrupt file that has the block length as some large number.
Definition at line 140 of file codec.h.
Referenced by impala::GzipDecompressor::ProcessBlock(), impala::BzipDecompressor::ProcessBlock(), impala::SnappyDecompressor::ProcessBlock(), impala::SnappyBlockDecompressor::ProcessBlock(), and SnappyBlockDecompress().
|
protected |
Pool to allocate the buffer to hold transformed data.
Definition at line 154 of file codec.h.
Referenced by Close(), Codec(), impala::GzipDecompressor::ProcessBlock(), impala::GzipCompressor::ProcessBlock(), impala::BzipDecompressor::ProcessBlock(), impala::BzipCompressor::ProcessBlock(), impala::SnappyDecompressor::ProcessBlock(), impala::SnappyBlockCompressor::ProcessBlock(), impala::SnappyCompressor::ProcessBlock(), impala::SnappyBlockDecompressor::ProcessBlock(), and impala::GzipDecompressor::ProcessBlockStreaming().
|
protected |
Buffer to hold transformed data. Either passed from the caller or allocated from memory_pool_.
Definition at line 165 of file codec.h.
Referenced by impala::GzipDecompressor::ProcessBlock(), impala::GzipCompressor::ProcessBlock(), impala::BzipDecompressor::ProcessBlock(), impala::BzipCompressor::ProcessBlock(), impala::SnappyDecompressor::ProcessBlock(), impala::SnappyBlockCompressor::ProcessBlock(), impala::SnappyCompressor::ProcessBlock(), impala::SnappyBlockDecompressor::ProcessBlock(), and impala::GzipDecompressor::ProcessBlockStreaming().
|
protected |
Can we reuse the output buffer or do we need to allocate on each call?
Definition at line 161 of file codec.h.
Referenced by impala::GzipDecompressor::ProcessBlock(), impala::GzipCompressor::ProcessBlock(), impala::BzipDecompressor::ProcessBlock(), impala::BzipCompressor::ProcessBlock(), impala::SnappyDecompressor::ProcessBlock(), impala::SnappyBlockCompressor::ProcessBlock(), impala::SnappyCompressor::ProcessBlock(), impala::SnappyBlockDecompressor::ProcessBlock(), impala::GzipDecompressor::ProcessBlockStreaming(), and reuse_output_buffer().
|
static |
|
protected |
Temporary memory pool: in case we get the output size too small we can use this to free unused buffers.
Definition at line 158 of file codec.h.
Referenced by Close(), Codec(), impala::GzipDecompressor::ProcessBlock(), impala::BzipDecompressor::ProcessBlock(), and impala::BzipCompressor::ProcessBlock().
|
static |