Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
|
#include <compress.h>
Public Types | |
enum | Format { ZLIB, DEFLATE, GZIP } |
Compression formats supported by the zlib library. More... | |
typedef std::map< const std::string, const THdfsCompression::type > | CodecMap |
Map from codec string to compression format. More... | |
Public Member Functions | |
virtual | ~GzipCompressor () |
virtual int64_t | MaxOutputLen (int64_t input_len, const uint8_t *input=NULL) |
virtual Status | ProcessBlock (bool output_preallocated, int64_t input_length, const uint8_t *input, int64_t *output_length, uint8_t **output) |
Process a block of data, either compressing or decompressing it. More... | |
virtual std::string | file_extension () const |
File extension to use for this compression codec. More... | |
Status | ProcessBlock32 (bool output_preallocated, int input_length, const uint8_t *input, int *output_length, uint8_t **output) |
virtual Status | ProcessBlockStreaming (int64_t input_length, const uint8_t *input, int64_t *input_bytes_read, int64_t *output_length, uint8_t **output, bool *eos) |
virtual void | Close () |
Must be called on codec before destructor for final cleanup. More... | |
bool | reuse_output_buffer () const |
Static Public Member Functions | |
static Status | CreateDecompressor (MemPool *mem_pool, bool reuse, THdfsCompression::type format, boost::scoped_ptr< Codec > *decompressor) |
static Status | CreateDecompressor (MemPool *mem_pool, bool reuse, const std::string &codec, boost::scoped_ptr< Codec > *decompressor) |
Alternate factory method: takes a codec string and populates a scoped pointer. More... | |
static Status | CreateCompressor (MemPool *mem_pool, bool reuse, THdfsCompression::type format, boost::scoped_ptr< Codec > *compressor) |
static Status | CreateCompressor (MemPool *mem_pool, bool reuse, const std::string &codec, boost::scoped_ptr< Codec > *compressor) |
Alternate factory method: takes a codec string and populates a scoped pointer. More... | |
static std::string | GetCodecName (THdfsCompression::type) |
Return the name of a compression algorithm. More... | |
static Status | GetHadoopCodecClassName (THdfsCompression::type, std::string *out_name) |
Returns the java class name for the given compression type. More... | |
Static Public Attributes | |
static const char *const | DEFAULT_COMPRESSION |
These are the codec string representations used in Hadoop. More... | |
static const char *const | GZIP_COMPRESSION = "org.apache.hadoop.io.compress.GzipCodec" |
static const char *const | BZIP2_COMPRESSION = "org.apache.hadoop.io.compress.BZip2Codec" |
static const char *const | SNAPPY_COMPRESSION = "org.apache.hadoop.io.compress.SnappyCodec" |
static const char *const | UNKNOWN_CODEC_ERROR |
static const CodecMap | CODEC_MAP |
static const int | MAX_BLOCK_SIZE = (2L * 1024 * 1024 * 1024) - 1 |
Protected Attributes | |
MemPool * | memory_pool_ |
Pool to allocate the buffer to hold transformed data. More... | |
boost::scoped_ptr< MemPool > | temp_memory_pool_ |
bool | reuse_buffer_ |
Can we reuse the output buffer or do we need to allocate on each call? More... | |
uint8_t * | out_buffer_ |
int64_t | buffer_length_ |
Length of the output buffer. More... | |
Private Member Functions | |
GzipCompressor (Format format, MemPool *mem_pool=NULL, bool reuse_buffer=false) | |
virtual Status | Init () |
Initialize the codec. This should only be called once. More... | |
Status | Compress (int64_t input_length, const uint8_t *input, int64_t *output_length, uint8_t *output) |
Private Attributes | |
Format | format_ |
z_stream | stream_ |
Structure used to communicate with the library. More... | |
Static Private Attributes | |
static const int | WINDOW_BITS = 15 |
These are magic numbers from zlib.h. Not clear why they are not defined there. More... | |
static const int | GZIP_CODEC = 16 |
Friends | |
class | Codec |
Different compression classes. The classes all expose the same API and abstracts the underlying calls to the compression libraries. TODO: reconsider the abstracted API
Definition at line 32 of file compress.h.
|
inherited |
Compression formats supported by the zlib library.
Enumerator | |
---|---|
ZLIB | |
DEFLATE | |
GZIP |
Definition at line 35 of file compress.h.
|
virtual |
Definition at line 40 of file compress.cc.
References stream_.
|
private |
Definition at line 34 of file compress.cc.
References stream_.
|
virtualinherited |
Must be called on codec before destructor for final cleanup.
Definition at line 174 of file codec.cc.
References impala::MemPool::AcquireData(), impala::Codec::memory_pool_, and impala::Codec::temp_memory_pool_.
|
private |
Compresses 'input' into 'output'. Output must be preallocated and at least big enough. *output_length should be called with the length of the output buffer and on return is the length of the output.
Definition at line 82 of file compress.cc.
References MaxOutputLen(), impala::Status::OK, and stream_.
Referenced by ProcessBlock().
|
staticinherited |
Create a compressor. Input: mem_pool: the memory pool used to store the compressed data. reuse: if true the allocated buffer can be reused. format: The type of compressor to create. Output: compressor: scoped pointer to the compressor class to use.
Referenced by impala::HdfsParquetTableWriter::BaseColumnWriter::BaseColumnWriter(), impala::HdfsSequenceTableWriter::Init(), impala::HdfsTextTableWriter::Init(), impala::HdfsAvroTableWriter::Init(), impala::DecompressorTest::RunTest(), impala::DecompressorTest::RunTestStreaming(), impala::RowBatch::Serialize(), impala::TEST_F(), and impala::TestCompression().
|
staticinherited |
Alternate factory method: takes a codec string and populates a scoped pointer.
|
staticinherited |
Create a decompressor. Input: mem_pool: the memory pool used to store the decompressed data. reuse: if true the allocated buffer can be reused. format: the type of decompressor to create. Output: decompressor: scoped pointer to the decompressor class to use. If mem_pool is NULL, then the resulting codec will never allocate memory and the caller must be responsible for it.
Referenced by impala::HdfsRCFileScanner::InitNewRange(), impala::HdfsParquetScanner::BaseColumnReader::Reset(), impala::RowBatch::RowBatch(), impala::DecompressorTest::RunTest(), impala::DecompressorTest::RunTestStreaming(), and impala::HdfsScanner::UpdateDecompressor().
|
staticinherited |
Alternate factory method: takes a codec string and populates a scoped pointer.
|
inlinevirtual |
File extension to use for this compression codec.
Implements impala::Codec.
Definition at line 46 of file compress.h.
|
staticinherited |
Return the name of a compression algorithm.
Definition at line 50 of file codec.cc.
Referenced by impala::HdfsParquetTableWriter::Init().
|
staticinherited |
Returns the java class name for the given compression type.
Definition at line 59 of file codec.cc.
References impala::Status::OK.
Referenced by impala::HdfsSequenceTableWriter::Init().
|
privatevirtual |
Initialize the codec. This should only be called once.
Implements impala::Codec.
Definition at line 44 of file compress.cc.
References DEFLATE, format_, GZIP, GZIP_CODEC, impala::Status::OK, stream_, and WINDOW_BITS.
|
virtual |
Returns the maximum result length from applying the codec to input. Note this is not the exact result length, simply a bound to allow preallocating a buffer. This must be an O(1) operation (i.e. cannot read all of input). Codecs that don't support this should return -1.
Implements impala::Codec.
Definition at line 61 of file compress.cc.
References format_, GZIP, stream_, and UNLIKELY.
Referenced by Compress(), and ProcessBlock().
|
virtual |
Process a block of data, either compressing or decompressing it.
If output_preallocated is true, *output_length must be the length of *output and data will be written directly to *output (*output must be big enough to contain the transformed output). If output_preallocated is false, *output will be allocated from the codec's mempool. In this case, a mempool must have been passed into the c'tor. In either case, *output_length will be set to the actual length of the transformed output. Inputs: input_length: length of the data to process input: data to process
Implements impala::Codec.
Definition at line 110 of file compress.cc.
References impala::MemPool::Allocate(), impala::Codec::buffer_length_, Compress(), MaxOutputLen(), impala::Codec::memory_pool_, impala::Status::OK, impala::Codec::out_buffer_, RETURN_IF_ERROR, and impala::Codec::reuse_buffer_.
|
inherited |
Wrapper to the actual ProcessBlock() function. This wrapper uses lengths as ints and not int64_ts. We need to keep this interface because the Parquet thrift uses ints. See IMPALA-1116.
Definition at line 181 of file codec.cc.
References impala::Status::OK, impala::Codec::ProcessBlock(), RETURN_IF_ERROR, and UNLIKELY.
|
inlinevirtualinherited |
Process data like ProcessBlock(), but can consume partial input and may only produce partial output. *input_bytes_read returns the number of bytes of input that have been consumed. Even if all input has been consumed, the caller must continue calling to fetch output until *eos returns true.
Reimplemented in impala::GzipDecompressor.
Definition at line 117 of file codec.h.
Referenced by impala::DecompressorTest::CompressAndStreamingDecompress().
|
inlineinherited |
Definition at line 135 of file codec.h.
References impala::Codec::reuse_buffer_.
|
friend |
Definition at line 49 of file compress.h.
|
protectedinherited |
Length of the output buffer.
Definition at line 168 of file codec.h.
Referenced by impala::GzipDecompressor::ProcessBlock(), ProcessBlock(), impala::BzipDecompressor::ProcessBlock(), impala::BzipCompressor::ProcessBlock(), impala::SnappyDecompressor::ProcessBlock(), impala::SnappyBlockCompressor::ProcessBlock(), impala::SnappyCompressor::ProcessBlock(), impala::SnappyBlockDecompressor::ProcessBlock(), and impala::GzipDecompressor::ProcessBlockStreaming().
|
staticinherited |
|
staticinherited |
Definition at line 52 of file codec.h.
Referenced by impala::HdfsSequenceScanner::ReadFileHeader(), and impala::HdfsRCFileScanner::ReadFileHeader().
|
staticinherited |
|
private |
Definition at line 53 of file compress.h.
Referenced by Init(), and MaxOutputLen().
|
staticprivate |
Definition at line 60 of file compress.h.
Referenced by Init().
|
staticinherited |
|
staticinherited |
Largest block we will compress/decompress: 2GB. We are dealing with compressed blocks that are never this big but we want to guard against a corrupt file that has the block length as some large number.
Definition at line 140 of file codec.h.
Referenced by impala::GzipDecompressor::ProcessBlock(), impala::BzipDecompressor::ProcessBlock(), impala::SnappyDecompressor::ProcessBlock(), impala::SnappyBlockDecompressor::ProcessBlock(), and SnappyBlockDecompress().
|
protectedinherited |
Pool to allocate the buffer to hold transformed data.
Definition at line 154 of file codec.h.
Referenced by impala::Codec::Close(), impala::Codec::Codec(), impala::GzipDecompressor::ProcessBlock(), ProcessBlock(), impala::BzipDecompressor::ProcessBlock(), impala::BzipCompressor::ProcessBlock(), impala::SnappyDecompressor::ProcessBlock(), impala::SnappyBlockCompressor::ProcessBlock(), impala::SnappyCompressor::ProcessBlock(), impala::SnappyBlockDecompressor::ProcessBlock(), and impala::GzipDecompressor::ProcessBlockStreaming().
|
protectedinherited |
Buffer to hold transformed data. Either passed from the caller or allocated from memory_pool_.
Definition at line 165 of file codec.h.
Referenced by impala::GzipDecompressor::ProcessBlock(), ProcessBlock(), impala::BzipDecompressor::ProcessBlock(), impala::BzipCompressor::ProcessBlock(), impala::SnappyDecompressor::ProcessBlock(), impala::SnappyBlockCompressor::ProcessBlock(), impala::SnappyCompressor::ProcessBlock(), impala::SnappyBlockDecompressor::ProcessBlock(), and impala::GzipDecompressor::ProcessBlockStreaming().
|
protectedinherited |
Can we reuse the output buffer or do we need to allocate on each call?
Definition at line 161 of file codec.h.
Referenced by impala::GzipDecompressor::ProcessBlock(), ProcessBlock(), impala::BzipDecompressor::ProcessBlock(), impala::BzipCompressor::ProcessBlock(), impala::SnappyDecompressor::ProcessBlock(), impala::SnappyBlockCompressor::ProcessBlock(), impala::SnappyCompressor::ProcessBlock(), impala::SnappyBlockDecompressor::ProcessBlock(), impala::GzipDecompressor::ProcessBlockStreaming(), and impala::Codec::reuse_output_buffer().
|
staticinherited |
|
private |
Structure used to communicate with the library.
Definition at line 56 of file compress.h.
Referenced by Compress(), GzipCompressor(), Init(), MaxOutputLen(), and ~GzipCompressor().
|
protectedinherited |
Temporary memory pool: in case we get the output size too small we can use this to free unused buffers.
Definition at line 158 of file codec.h.
Referenced by impala::Codec::Close(), impala::Codec::Codec(), impala::GzipDecompressor::ProcessBlock(), impala::BzipDecompressor::ProcessBlock(), and impala::BzipCompressor::ProcessBlock().
|
staticinherited |
|
staticprivate |
These are magic numbers from zlib.h. Not clear why they are not defined there.
Definition at line 59 of file compress.h.
Referenced by Init().