Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros
impala::Codec Class Referenceabstract

#include <codec.h>

Inheritance diagram for impala::Codec:
Collaboration diagram for impala::Codec:

Public Types

typedef std::map< const
std::string, const
THdfsCompression::type > 
CodecMap
 Map from codec string to compression format. More...
 

Public Member Functions

virtual ~Codec ()
 
virtual Status ProcessBlock (bool output_preallocated, int64_t input_length, const uint8_t *input, int64_t *output_length, uint8_t **output)=0
 Process a block of data, either compressing or decompressing it. More...
 
Status ProcessBlock32 (bool output_preallocated, int input_length, const uint8_t *input, int *output_length, uint8_t **output)
 
virtual Status ProcessBlockStreaming (int64_t input_length, const uint8_t *input, int64_t *input_bytes_read, int64_t *output_length, uint8_t **output, bool *eos)
 
virtual int64_t MaxOutputLen (int64_t input_len, const uint8_t *input=NULL)=0
 
virtual void Close ()
 Must be called on codec before destructor for final cleanup. More...
 
virtual std::string file_extension () const =0
 File extension to use for this compression codec. More...
 
bool reuse_output_buffer () const
 

Static Public Member Functions

static Status CreateDecompressor (MemPool *mem_pool, bool reuse, THdfsCompression::type format, boost::scoped_ptr< Codec > *decompressor)
 
static Status CreateDecompressor (MemPool *mem_pool, bool reuse, const std::string &codec, boost::scoped_ptr< Codec > *decompressor)
 Alternate factory method: takes a codec string and populates a scoped pointer. More...
 
static Status CreateCompressor (MemPool *mem_pool, bool reuse, THdfsCompression::type format, boost::scoped_ptr< Codec > *compressor)
 
static Status CreateCompressor (MemPool *mem_pool, bool reuse, const std::string &codec, boost::scoped_ptr< Codec > *compressor)
 Alternate factory method: takes a codec string and populates a scoped pointer. More...
 
static std::string GetCodecName (THdfsCompression::type)
 Return the name of a compression algorithm. More...
 
static Status GetHadoopCodecClassName (THdfsCompression::type, std::string *out_name)
 Returns the java class name for the given compression type. More...
 

Static Public Attributes

static const char *const DEFAULT_COMPRESSION
 These are the codec string representations used in Hadoop. More...
 
static const char *const GZIP_COMPRESSION = "org.apache.hadoop.io.compress.GzipCodec"
 
static const char *const BZIP2_COMPRESSION = "org.apache.hadoop.io.compress.BZip2Codec"
 
static const char *const SNAPPY_COMPRESSION = "org.apache.hadoop.io.compress.SnappyCodec"
 
static const char *const UNKNOWN_CODEC_ERROR
 
static const CodecMap CODEC_MAP
 
static const int MAX_BLOCK_SIZE = (2L * 1024 * 1024 * 1024) - 1
 

Protected Member Functions

 Codec (MemPool *mem_pool, bool reuse_buffer)
 
virtual Status Init ()=0
 Initialize the codec. This should only be called once. More...
 

Protected Attributes

MemPoolmemory_pool_
 Pool to allocate the buffer to hold transformed data. More...
 
boost::scoped_ptr< MemPooltemp_memory_pool_
 
bool reuse_buffer_
 Can we reuse the output buffer or do we need to allocate on each call? More...
 
uint8_t * out_buffer_
 
int64_t buffer_length_
 Length of the output buffer. More...
 

Detailed Description

Create a compression object. This is the base class for all compression algorithms. A compression algorithm is either a compressor or a decompressor. To add a new algorithm, generally, both a compressor and a decompressor will be added. Each of these objects inherits from this class. The objects are instantiated in the Create static methods defined here. The type of compression is defined in the Thrift interface THdfsCompression. TODO: make this pure virtual (no members) so that external codecs (e.g. Lzo) can implement this without binary dependency issues. TODO: this interface is clunky. There should be one class that implements both the compress and decompress APIs so remove duplication.

Definition at line 41 of file codec.h.

Member Typedef Documentation

typedef std::map<const std::string, const THdfsCompression::type> impala::Codec::CodecMap

Map from codec string to compression format.

Definition at line 51 of file codec.h.

Constructor & Destructor Documentation

virtual impala::Codec::~Codec ( )
inlinevirtual

Definition at line 89 of file codec.h.

Codec::Codec ( MemPool mem_pool,
bool  reuse_buffer 
)
protected

Create a compression operator Inputs: mem_pool: memory pool to allocate the output buffer. If mem_pool is NULL then the caller must always preallocate *output in ProcessBlock(). reuse_buffer: if false always allocate a new buffer rather than reuse.

Definition at line 164 of file codec.cc.

References impala::MemPool::mem_tracker(), memory_pool_, and temp_memory_pool_.

Member Function Documentation

void Codec::Close ( )
virtual

Must be called on codec before destructor for final cleanup.

Definition at line 174 of file codec.cc.

References impala::MemPool::AcquireData(), memory_pool_, and temp_memory_pool_.

static Status impala::Codec::CreateCompressor ( MemPool mem_pool,
bool  reuse,
THdfsCompression::type  format,
boost::scoped_ptr< Codec > *  compressor 
)
static

Create a compressor. Input: mem_pool: the memory pool used to store the compressed data. reuse: if true the allocated buffer can be reused. format: The type of compressor to create. Output: compressor: scoped pointer to the compressor class to use.

Referenced by impala::HdfsParquetTableWriter::BaseColumnWriter::BaseColumnWriter(), impala::HdfsSequenceTableWriter::Init(), impala::HdfsTextTableWriter::Init(), impala::HdfsAvroTableWriter::Init(), impala::DecompressorTest::RunTest(), impala::DecompressorTest::RunTestStreaming(), impala::RowBatch::Serialize(), impala::TEST_F(), and impala::TestCompression().

static Status impala::Codec::CreateCompressor ( MemPool mem_pool,
bool  reuse,
const std::string &  codec,
boost::scoped_ptr< Codec > *  compressor 
)
static

Alternate factory method: takes a codec string and populates a scoped pointer.

static Status impala::Codec::CreateDecompressor ( MemPool mem_pool,
bool  reuse,
THdfsCompression::type  format,
boost::scoped_ptr< Codec > *  decompressor 
)
static

Create a decompressor. Input: mem_pool: the memory pool used to store the decompressed data. reuse: if true the allocated buffer can be reused. format: the type of decompressor to create. Output: decompressor: scoped pointer to the decompressor class to use. If mem_pool is NULL, then the resulting codec will never allocate memory and the caller must be responsible for it.

Referenced by impala::HdfsRCFileScanner::InitNewRange(), impala::HdfsParquetScanner::BaseColumnReader::Reset(), impala::RowBatch::RowBatch(), impala::DecompressorTest::RunTest(), impala::DecompressorTest::RunTestStreaming(), and impala::HdfsScanner::UpdateDecompressor().

static Status impala::Codec::CreateDecompressor ( MemPool mem_pool,
bool  reuse,
const std::string &  codec,
boost::scoped_ptr< Codec > *  decompressor 
)
static

Alternate factory method: takes a codec string and populates a scoped pointer.

string Codec::GetCodecName ( THdfsCompression::type  type)
static

Return the name of a compression algorithm.

Definition at line 50 of file codec.cc.

Referenced by impala::HdfsParquetTableWriter::Init().

Status Codec::GetHadoopCodecClassName ( THdfsCompression::type  ,
std::string *  out_name 
)
static

Returns the java class name for the given compression type.

Definition at line 59 of file codec.cc.

References impala::Status::OK.

Referenced by impala::HdfsSequenceTableWriter::Init().

virtual int64_t impala::Codec::MaxOutputLen ( int64_t  input_len,
const uint8_t *  input = NULL 
)
pure virtual

Returns the maximum result length from applying the codec to input. Note this is not the exact result length, simply a bound to allow preallocating a buffer. This must be an O(1) operation (i.e. cannot read all of input). Codecs that don't support this should return -1.

Implemented in impala::Lz4Compressor, impala::SnappyBlockDecompressor, impala::SnappyCompressor, impala::Lz4Decompressor, impala::SnappyBlockCompressor, impala::SnappyDecompressor, impala::BzipCompressor, impala::BzipDecompressor, impala::GzipCompressor, and impala::GzipDecompressor.

Referenced by impala::DecompressorTest::CompressAndDecompress(), and impala::DecompressorTest::CompressAndDecompressNoOutputAllocated().

virtual Status impala::Codec::ProcessBlock ( bool  output_preallocated,
int64_t  input_length,
const uint8_t *  input,
int64_t *  output_length,
uint8_t **  output 
)
pure virtual

Process a block of data, either compressing or decompressing it.

If output_preallocated is true, *output_length must be the length of *output and data will be written directly to *output (*output must be big enough to contain the transformed output). If output_preallocated is false, *output will be allocated from the codec's mempool. In this case, a mempool must have been passed into the c'tor. In either case, *output_length will be set to the actual length of the transformed output. Inputs: input_length: length of the data to process input: data to process

Implemented in impala::Lz4Compressor, impala::SnappyBlockDecompressor, impala::SnappyCompressor, impala::Lz4Decompressor, impala::SnappyBlockCompressor, impala::SnappyDecompressor, impala::BzipCompressor, impala::BzipDecompressor, impala::GzipCompressor, and impala::GzipDecompressor.

Referenced by impala::DecompressorTest::CompressAndDecompress(), impala::DecompressorTest::CompressAndDecompressNoOutputAllocated(), impala::DecompressorTest::CompressAndStreamingDecompress(), and ProcessBlock32().

Status Codec::ProcessBlock32 ( bool  output_preallocated,
int  input_length,
const uint8_t *  input,
int *  output_length,
uint8_t **  output 
)

Wrapper to the actual ProcessBlock() function. This wrapper uses lengths as ints and not int64_ts. We need to keep this interface because the Parquet thrift uses ints. See IMPALA-1116.

Definition at line 181 of file codec.cc.

References impala::Status::OK, ProcessBlock(), RETURN_IF_ERROR, and UNLIKELY.

virtual Status impala::Codec::ProcessBlockStreaming ( int64_t  input_length,
const uint8_t *  input,
int64_t *  input_bytes_read,
int64_t *  output_length,
uint8_t **  output,
bool eos 
)
inlinevirtual

Process data like ProcessBlock(), but can consume partial input and may only produce partial output. *input_bytes_read returns the number of bytes of input that have been consumed. Even if all input has been consumed, the caller must continue calling to fetch output until *eos returns true.

Reimplemented in impala::GzipDecompressor.

Definition at line 117 of file codec.h.

Referenced by impala::DecompressorTest::CompressAndStreamingDecompress().

bool impala::Codec::reuse_output_buffer ( ) const
inline

Definition at line 135 of file codec.h.

References reuse_buffer_.

Member Data Documentation

const char *const Codec::BZIP2_COMPRESSION = "org.apache.hadoop.io.compress.BZip2Codec"
static

Definition at line 46 of file codec.h.

const Codec::CodecMap Codec::CODEC_MAP
static
Initial value:
= map_list_of
("", THdfsCompression::NONE)
(DEFAULT_COMPRESSION, THdfsCompression::DEFAULT)
(GZIP_COMPRESSION, THdfsCompression::GZIP)
(BZIP2_COMPRESSION, THdfsCompression::BZIP2)
(SNAPPY_COMPRESSION, THdfsCompression::SNAPPY_BLOCKED)

Definition at line 52 of file codec.h.

Referenced by impala::HdfsSequenceScanner::ReadFileHeader(), and impala::HdfsRCFileScanner::ReadFileHeader().

const char *const Codec::DEFAULT_COMPRESSION
static
Initial value:
=
"org.apache.hadoop.io.compress.DefaultCodec"

These are the codec string representations used in Hadoop.

Definition at line 44 of file codec.h.

const char *const Codec::GZIP_COMPRESSION = "org.apache.hadoop.io.compress.GzipCodec"
static

Definition at line 45 of file codec.h.

const int impala::Codec::MAX_BLOCK_SIZE = (2L * 1024 * 1024 * 1024) - 1
static

Largest block we will compress/decompress: 2GB. We are dealing with compressed blocks that are never this big but we want to guard against a corrupt file that has the block length as some large number.

Definition at line 140 of file codec.h.

Referenced by impala::GzipDecompressor::ProcessBlock(), impala::BzipDecompressor::ProcessBlock(), impala::SnappyDecompressor::ProcessBlock(), impala::SnappyBlockDecompressor::ProcessBlock(), and SnappyBlockDecompress().

const char *const Codec::SNAPPY_COMPRESSION = "org.apache.hadoop.io.compress.SnappyCodec"
static

Definition at line 47 of file codec.h.

boost::scoped_ptr<MemPool> impala::Codec::temp_memory_pool_
protected

Temporary memory pool: in case we get the output size too small we can use this to free unused buffers.

Definition at line 158 of file codec.h.

Referenced by Close(), Codec(), impala::GzipDecompressor::ProcessBlock(), impala::BzipDecompressor::ProcessBlock(), and impala::BzipCompressor::ProcessBlock().

const char *const Codec::UNKNOWN_CODEC_ERROR
static
Initial value:
=
"This compression codec is currently unsupported: "

Definition at line 48 of file codec.h.


The documentation for this class was generated from the following files: