Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros
impala::GzipCompressor Class Reference

#include <compress.h>

Inheritance diagram for impala::GzipCompressor:
Collaboration diagram for impala::GzipCompressor:

Public Types

enum  Format { ZLIB, DEFLATE, GZIP }
 Compression formats supported by the zlib library. More...
 
typedef std::map< const
std::string, const
THdfsCompression::type > 
CodecMap
 Map from codec string to compression format. More...
 

Public Member Functions

virtual ~GzipCompressor ()
 
virtual int64_t MaxOutputLen (int64_t input_len, const uint8_t *input=NULL)
 
virtual Status ProcessBlock (bool output_preallocated, int64_t input_length, const uint8_t *input, int64_t *output_length, uint8_t **output)
 Process a block of data, either compressing or decompressing it. More...
 
virtual std::string file_extension () const
 File extension to use for this compression codec. More...
 
Status ProcessBlock32 (bool output_preallocated, int input_length, const uint8_t *input, int *output_length, uint8_t **output)
 
virtual Status ProcessBlockStreaming (int64_t input_length, const uint8_t *input, int64_t *input_bytes_read, int64_t *output_length, uint8_t **output, bool *eos)
 
virtual void Close ()
 Must be called on codec before destructor for final cleanup. More...
 
bool reuse_output_buffer () const
 

Static Public Member Functions

static Status CreateDecompressor (MemPool *mem_pool, bool reuse, THdfsCompression::type format, boost::scoped_ptr< Codec > *decompressor)
 
static Status CreateDecompressor (MemPool *mem_pool, bool reuse, const std::string &codec, boost::scoped_ptr< Codec > *decompressor)
 Alternate factory method: takes a codec string and populates a scoped pointer. More...
 
static Status CreateCompressor (MemPool *mem_pool, bool reuse, THdfsCompression::type format, boost::scoped_ptr< Codec > *compressor)
 
static Status CreateCompressor (MemPool *mem_pool, bool reuse, const std::string &codec, boost::scoped_ptr< Codec > *compressor)
 Alternate factory method: takes a codec string and populates a scoped pointer. More...
 
static std::string GetCodecName (THdfsCompression::type)
 Return the name of a compression algorithm. More...
 
static Status GetHadoopCodecClassName (THdfsCompression::type, std::string *out_name)
 Returns the java class name for the given compression type. More...
 

Static Public Attributes

static const char *const DEFAULT_COMPRESSION
 These are the codec string representations used in Hadoop. More...
 
static const char *const GZIP_COMPRESSION = "org.apache.hadoop.io.compress.GzipCodec"
 
static const char *const BZIP2_COMPRESSION = "org.apache.hadoop.io.compress.BZip2Codec"
 
static const char *const SNAPPY_COMPRESSION = "org.apache.hadoop.io.compress.SnappyCodec"
 
static const char *const UNKNOWN_CODEC_ERROR
 
static const CodecMap CODEC_MAP
 
static const int MAX_BLOCK_SIZE = (2L * 1024 * 1024 * 1024) - 1
 

Protected Attributes

MemPoolmemory_pool_
 Pool to allocate the buffer to hold transformed data. More...
 
boost::scoped_ptr< MemPooltemp_memory_pool_
 
bool reuse_buffer_
 Can we reuse the output buffer or do we need to allocate on each call? More...
 
uint8_t * out_buffer_
 
int64_t buffer_length_
 Length of the output buffer. More...
 

Private Member Functions

 GzipCompressor (Format format, MemPool *mem_pool=NULL, bool reuse_buffer=false)
 
virtual Status Init ()
 Initialize the codec. This should only be called once. More...
 
Status Compress (int64_t input_length, const uint8_t *input, int64_t *output_length, uint8_t *output)
 

Private Attributes

Format format_
 
z_stream stream_
 Structure used to communicate with the library. More...
 

Static Private Attributes

static const int WINDOW_BITS = 15
 These are magic numbers from zlib.h. Not clear why they are not defined there. More...
 
static const int GZIP_CODEC = 16
 

Friends

class Codec
 

Detailed Description

Different compression classes. The classes all expose the same API and abstracts the underlying calls to the compression libraries. TODO: reconsider the abstracted API

Definition at line 32 of file compress.h.

Member Typedef Documentation

typedef std::map<const std::string, const THdfsCompression::type> impala::Codec::CodecMap
inherited

Map from codec string to compression format.

Definition at line 51 of file codec.h.

Member Enumeration Documentation

Compression formats supported by the zlib library.

Enumerator
ZLIB 
DEFLATE 
GZIP 

Definition at line 35 of file compress.h.

Constructor & Destructor Documentation

GzipCompressor::~GzipCompressor ( )
virtual

Definition at line 40 of file compress.cc.

References stream_.

GzipCompressor::GzipCompressor ( Format  format,
MemPool mem_pool = NULL,
bool  reuse_buffer = false 
)
private

Definition at line 34 of file compress.cc.

References stream_.

Member Function Documentation

void Codec::Close ( )
virtualinherited

Must be called on codec before destructor for final cleanup.

Definition at line 174 of file codec.cc.

References impala::MemPool::AcquireData(), impala::Codec::memory_pool_, and impala::Codec::temp_memory_pool_.

Status GzipCompressor::Compress ( int64_t  input_length,
const uint8_t *  input,
int64_t *  output_length,
uint8_t *  output 
)
private

Compresses 'input' into 'output'. Output must be preallocated and at least big enough. *output_length should be called with the length of the output buffer and on return is the length of the output.

Definition at line 82 of file compress.cc.

References MaxOutputLen(), impala::Status::OK, and stream_.

Referenced by ProcessBlock().

static Status impala::Codec::CreateCompressor ( MemPool mem_pool,
bool  reuse,
THdfsCompression::type  format,
boost::scoped_ptr< Codec > *  compressor 
)
staticinherited

Create a compressor. Input: mem_pool: the memory pool used to store the compressed data. reuse: if true the allocated buffer can be reused. format: The type of compressor to create. Output: compressor: scoped pointer to the compressor class to use.

Referenced by impala::HdfsParquetTableWriter::BaseColumnWriter::BaseColumnWriter(), impala::HdfsSequenceTableWriter::Init(), impala::HdfsTextTableWriter::Init(), impala::HdfsAvroTableWriter::Init(), impala::DecompressorTest::RunTest(), impala::DecompressorTest::RunTestStreaming(), impala::RowBatch::Serialize(), impala::TEST_F(), and impala::TestCompression().

static Status impala::Codec::CreateCompressor ( MemPool mem_pool,
bool  reuse,
const std::string &  codec,
boost::scoped_ptr< Codec > *  compressor 
)
staticinherited

Alternate factory method: takes a codec string and populates a scoped pointer.

static Status impala::Codec::CreateDecompressor ( MemPool mem_pool,
bool  reuse,
THdfsCompression::type  format,
boost::scoped_ptr< Codec > *  decompressor 
)
staticinherited

Create a decompressor. Input: mem_pool: the memory pool used to store the decompressed data. reuse: if true the allocated buffer can be reused. format: the type of decompressor to create. Output: decompressor: scoped pointer to the decompressor class to use. If mem_pool is NULL, then the resulting codec will never allocate memory and the caller must be responsible for it.

Referenced by impala::HdfsRCFileScanner::InitNewRange(), impala::HdfsParquetScanner::BaseColumnReader::Reset(), impala::RowBatch::RowBatch(), impala::DecompressorTest::RunTest(), impala::DecompressorTest::RunTestStreaming(), and impala::HdfsScanner::UpdateDecompressor().

static Status impala::Codec::CreateDecompressor ( MemPool mem_pool,
bool  reuse,
const std::string &  codec,
boost::scoped_ptr< Codec > *  decompressor 
)
staticinherited

Alternate factory method: takes a codec string and populates a scoped pointer.

virtual std::string impala::GzipCompressor::file_extension ( ) const
inlinevirtual

File extension to use for this compression codec.

Implements impala::Codec.

Definition at line 46 of file compress.h.

string Codec::GetCodecName ( THdfsCompression::type  type)
staticinherited

Return the name of a compression algorithm.

Definition at line 50 of file codec.cc.

Referenced by impala::HdfsParquetTableWriter::Init().

Status Codec::GetHadoopCodecClassName ( THdfsCompression::type  ,
std::string *  out_name 
)
staticinherited

Returns the java class name for the given compression type.

Definition at line 59 of file codec.cc.

References impala::Status::OK.

Referenced by impala::HdfsSequenceTableWriter::Init().

Status GzipCompressor::Init ( )
privatevirtual

Initialize the codec. This should only be called once.

Implements impala::Codec.

Definition at line 44 of file compress.cc.

References DEFLATE, format_, GZIP, GZIP_CODEC, impala::Status::OK, stream_, and WINDOW_BITS.

int64_t GzipCompressor::MaxOutputLen ( int64_t  input_len,
const uint8_t *  input = NULL 
)
virtual

Returns the maximum result length from applying the codec to input. Note this is not the exact result length, simply a bound to allow preallocating a buffer. This must be an O(1) operation (i.e. cannot read all of input). Codecs that don't support this should return -1.

Implements impala::Codec.

Definition at line 61 of file compress.cc.

References format_, GZIP, stream_, and UNLIKELY.

Referenced by Compress(), and ProcessBlock().

Status GzipCompressor::ProcessBlock ( bool  output_preallocated,
int64_t  input_length,
const uint8_t *  input,
int64_t *  output_length,
uint8_t **  output 
)
virtual

Process a block of data, either compressing or decompressing it.

If output_preallocated is true, *output_length must be the length of *output and data will be written directly to *output (*output must be big enough to contain the transformed output). If output_preallocated is false, *output will be allocated from the codec's mempool. In this case, a mempool must have been passed into the c'tor. In either case, *output_length will be set to the actual length of the transformed output. Inputs: input_length: length of the data to process input: data to process

Implements impala::Codec.

Definition at line 110 of file compress.cc.

References impala::MemPool::Allocate(), impala::Codec::buffer_length_, Compress(), MaxOutputLen(), impala::Codec::memory_pool_, impala::Status::OK, impala::Codec::out_buffer_, RETURN_IF_ERROR, and impala::Codec::reuse_buffer_.

Status Codec::ProcessBlock32 ( bool  output_preallocated,
int  input_length,
const uint8_t *  input,
int *  output_length,
uint8_t **  output 
)
inherited

Wrapper to the actual ProcessBlock() function. This wrapper uses lengths as ints and not int64_ts. We need to keep this interface because the Parquet thrift uses ints. See IMPALA-1116.

Definition at line 181 of file codec.cc.

References impala::Status::OK, impala::Codec::ProcessBlock(), RETURN_IF_ERROR, and UNLIKELY.

virtual Status impala::Codec::ProcessBlockStreaming ( int64_t  input_length,
const uint8_t *  input,
int64_t *  input_bytes_read,
int64_t *  output_length,
uint8_t **  output,
bool eos 
)
inlinevirtualinherited

Process data like ProcessBlock(), but can consume partial input and may only produce partial output. *input_bytes_read returns the number of bytes of input that have been consumed. Even if all input has been consumed, the caller must continue calling to fetch output until *eos returns true.

Reimplemented in impala::GzipDecompressor.

Definition at line 117 of file codec.h.

Referenced by impala::DecompressorTest::CompressAndStreamingDecompress().

bool impala::Codec::reuse_output_buffer ( ) const
inlineinherited

Definition at line 135 of file codec.h.

References impala::Codec::reuse_buffer_.

Friends And Related Function Documentation

friend class Codec
friend

Definition at line 49 of file compress.h.

Member Data Documentation

const char *const Codec::BZIP2_COMPRESSION = "org.apache.hadoop.io.compress.BZip2Codec"
staticinherited

Definition at line 46 of file codec.h.

const Codec::CodecMap Codec::CODEC_MAP
staticinherited
Initial value:
= map_list_of
("", THdfsCompression::NONE)
(DEFAULT_COMPRESSION, THdfsCompression::DEFAULT)
(GZIP_COMPRESSION, THdfsCompression::GZIP)
(BZIP2_COMPRESSION, THdfsCompression::BZIP2)
(SNAPPY_COMPRESSION, THdfsCompression::SNAPPY_BLOCKED)

Definition at line 52 of file codec.h.

Referenced by impala::HdfsSequenceScanner::ReadFileHeader(), and impala::HdfsRCFileScanner::ReadFileHeader().

const char *const Codec::DEFAULT_COMPRESSION
staticinherited
Initial value:
=
"org.apache.hadoop.io.compress.DefaultCodec"

These are the codec string representations used in Hadoop.

Definition at line 44 of file codec.h.

Format impala::GzipCompressor::format_
private

Definition at line 53 of file compress.h.

Referenced by Init(), and MaxOutputLen().

const int impala::GzipCompressor::GZIP_CODEC = 16
staticprivate

Definition at line 60 of file compress.h.

Referenced by Init().

const char *const Codec::GZIP_COMPRESSION = "org.apache.hadoop.io.compress.GzipCodec"
staticinherited

Definition at line 45 of file codec.h.

const int impala::Codec::MAX_BLOCK_SIZE = (2L * 1024 * 1024 * 1024) - 1
staticinherited

Largest block we will compress/decompress: 2GB. We are dealing with compressed blocks that are never this big but we want to guard against a corrupt file that has the block length as some large number.

Definition at line 140 of file codec.h.

Referenced by impala::GzipDecompressor::ProcessBlock(), impala::BzipDecompressor::ProcessBlock(), impala::SnappyDecompressor::ProcessBlock(), impala::SnappyBlockDecompressor::ProcessBlock(), and SnappyBlockDecompress().

const char *const Codec::SNAPPY_COMPRESSION = "org.apache.hadoop.io.compress.SnappyCodec"
staticinherited

Definition at line 47 of file codec.h.

z_stream impala::GzipCompressor::stream_
private

Structure used to communicate with the library.

Definition at line 56 of file compress.h.

Referenced by Compress(), GzipCompressor(), Init(), MaxOutputLen(), and ~GzipCompressor().

boost::scoped_ptr<MemPool> impala::Codec::temp_memory_pool_
protectedinherited

Temporary memory pool: in case we get the output size too small we can use this to free unused buffers.

Definition at line 158 of file codec.h.

Referenced by impala::Codec::Close(), impala::Codec::Codec(), impala::GzipDecompressor::ProcessBlock(), impala::BzipDecompressor::ProcessBlock(), and impala::BzipCompressor::ProcessBlock().

const char *const Codec::UNKNOWN_CODEC_ERROR
staticinherited
Initial value:
=
"This compression codec is currently unsupported: "

Definition at line 48 of file codec.h.

const int impala::GzipCompressor::WINDOW_BITS = 15
staticprivate

These are magic numbers from zlib.h. Not clear why they are not defined there.

Definition at line 59 of file compress.h.

Referenced by Init().


The documentation for this class was generated from the following files: