Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros
impala::HdfsAvroTableWriter Class Reference

#include <hdfs-avro-table-writer.h>

Inheritance diagram for impala::HdfsAvroTableWriter:
Collaboration diagram for impala::HdfsAvroTableWriter:

Public Member Functions

 HdfsAvroTableWriter (HdfsTableSink *parent, RuntimeState *state, OutputPartition *output, const HdfsPartitionDescriptor *partition, const HdfsTableDescriptor *table_desc, const std::vector< ExprContext * > &output_exprs)
 
virtual ~HdfsAvroTableWriter ()
 
virtual Status Init ()
 Do initialization of writer. More...
 
virtual Status Finalize ()
 
virtual Status InitNewFile ()
 Called when a new file is started. More...
 
virtual void Close ()
 Called once when this writer should cleanup any resources. More...
 
virtual uint64_t default_block_size () const
 
virtual std::string file_extension () const
 Returns the file extension for this writer. More...
 
virtual Status AppendRowBatch (RowBatch *rows, const std::vector< int32_t > &row_group_indices, bool *new_file)
 
TInsertStats & stats ()
 Returns the stats for this writer. More...
 

Protected Member Functions

Status Write (const char *data, int32_t len)
 Write to the current hdfs file. More...
 
Status Write (const uint8_t *data, int32_t len)
 
template<typename T >
Status Write (T v)
 

Protected Attributes

HdfsTableSinkparent_
 Parent table sink object. More...
 
RuntimeStatestate_
 Runtime state. More...
 
OutputPartitionoutput_
 Structure describing partition written to by this writer. More...
 
const HdfsTableDescriptortable_desc_
 Table descriptor of table to be written. More...
 
std::vector< ExprContext * > output_expr_ctxs_
 Expressions that materialize output values. More...
 
TInsertStats stats_
 Subclass should populate any file format specific stats. More...
 

Static Protected Attributes

static const int HDFS_FLUSH_WRITE_SIZE = 50 * 1024
 

Private Member Functions

void ConsumeRow (TupleRow *row)
 Processes a single row, appending to out_. More...
 
void AppendField (const ColumnType &type, const void *value)
 Adds an encoded field to out_. More...
 
Status WriteFileHeader ()
 Writes the Avro file header to HDFS. More...
 
Status Flush ()
 

Private Attributes

WriteStream out_
 Buffer which holds accumulated output. More...
 
boost::scoped_ptr< MemPoolmem_pool_
 
uint64_t unflushed_rows_
 Number of rows consumed since last flush. More...
 
std::string codec_name_
 Name of codec, only set if codec_type_ != NONE. More...
 
THdfsCompression::type codec_type_
 Type of the codec, will be NONE if no compression is used. More...
 
boost::scoped_ptr< Codeccompressor_
 The codec for compressing, only set if codec_type_ != NONE. More...
 
std::string sync_marker_
 16 byte sync marker (a uuid) More...
 

Detailed Description

Consumes rows and outputs the rows into an Avro file in HDFS Each Avro file contains a block of records (rows). The file metadata specifies the schema of the records in addition to the name of the codec, if any, used to compress blocks. The structure is: [ Metadata ] [ Sync Marker ] [ Data Block ] ... [ Data Block ] Each Data Block consists of: [ Number of Rows in Block ] [ Size of serialized objects, after compression ] [ Serialized objects, compressed ] [ Sync Marker ] If compression is used, each block is compressed individually. The block size defaults to about 64KB before compression. This writer implements the Avro 1.7.7 spec: http://avro.apache.org/docs/1.7.7/spec.html

Definition at line 56 of file hdfs-avro-table-writer.h.

Constructor & Destructor Documentation

HdfsAvroTableWriter::HdfsAvroTableWriter ( HdfsTableSink parent,
RuntimeState state,
OutputPartition output,
const HdfsPartitionDescriptor partition,
const HdfsTableDescriptor table_desc,
const std::vector< ExprContext * > &  output_exprs 
)

Definition at line 49 of file hdfs-avro-table-writer.cc.

References mem_pool_, and impala::HdfsTableSink::mem_tracker().

virtual impala::HdfsAvroTableWriter::~HdfsAvroTableWriter ( )
inlinevirtual

Definition at line 64 of file hdfs-avro-table-writer.h.

Member Function Documentation

Status HdfsAvroTableWriter::AppendRowBatch ( RowBatch rows,
const std::vector< int32_t > &  row_group_indices,
bool new_file 
)
virtual
virtual void impala::HdfsAvroTableWriter::Close ( )
inlinevirtual

Called once when this writer should cleanup any resources.

Implements impala::HdfsTableWriter.

Definition at line 69 of file hdfs-avro-table-writer.h.

References mem_pool_.

void HdfsAvroTableWriter::ConsumeRow ( TupleRow row)
private
virtual uint64_t impala::HdfsAvroTableWriter::default_block_size ( ) const
inlinevirtual

Default block size to use for this file format. If the file format doesn't care, it should return 0 and the hdfs config default will be used.

Implements impala::HdfsTableWriter.

Definition at line 70 of file hdfs-avro-table-writer.h.

virtual std::string impala::HdfsAvroTableWriter::file_extension ( ) const
inlinevirtual

Returns the file extension for this writer.

Implements impala::HdfsTableWriter.

Definition at line 71 of file hdfs-avro-table-writer.h.

virtual Status impala::HdfsAvroTableWriter::Finalize ( )
inlinevirtual

Finalize this partition. The writer needs to finish processing all data have written out after the return from this call. This is called once for each call to InitNewFile()

Implements impala::HdfsTableWriter.

Definition at line 67 of file hdfs-avro-table-writer.h.

References Flush().

Status HdfsAvroTableWriter::Init ( )
virtual

Do initialization of writer.

The sequence of calls to this object are:

  1. Init()
  2. InitNewFile()
  3. AppendRowBatch() - called repeatedly
  4. Finalize() For files formats that are splittable (and therefore can be written to an arbitrarily large file), 1-4 is called once. For files formats that are not splittable (i.e. columnar formats, compressed text), 1) is called once and 2-4) is called repeatedly for each file.

Implements impala::HdfsTableWriter.

Definition at line 135 of file hdfs-avro-table-writer.cc.

References AVRO_DEFAULT_CODEC, codec_name_, codec_type_, compressor_, impala::Codec::CreateCompressor(), impala::GenerateUUIDString(), mem_pool_, impala::name, impala::Status::OK, impala::RuntimeState::query_options(), RETURN_IF_ERROR, impala::HdfsTableWriter::state_, and sync_marker_.

virtual Status impala::HdfsAvroTableWriter::InitNewFile ( )
inlinevirtual

Called when a new file is started.

Implements impala::HdfsTableWriter.

Definition at line 68 of file hdfs-avro-table-writer.h.

References WriteFileHeader().

TInsertStats& impala::HdfsTableWriter::stats ( )
inlineinherited

Returns the stats for this writer.

Definition at line 86 of file hdfs-table-writer.h.

References impala::HdfsTableWriter::stats_.

template<typename T >
Status impala::HdfsTableWriter::Write ( v)
inlineprotectedinherited

Definition at line 107 of file hdfs-table-writer.h.

References impala::HdfsTableWriter::Write().

Member Data Documentation

std::string impala::HdfsAvroTableWriter::codec_name_
private

Name of codec, only set if codec_type_ != NONE.

Definition at line 104 of file hdfs-avro-table-writer.h.

Referenced by Init(), and WriteFileHeader().

THdfsCompression::type impala::HdfsAvroTableWriter::codec_type_
private

Type of the codec, will be NONE if no compression is used.

Definition at line 107 of file hdfs-avro-table-writer.h.

Referenced by Flush(), and Init().

boost::scoped_ptr<Codec> impala::HdfsAvroTableWriter::compressor_
private

The codec for compressing, only set if codec_type_ != NONE.

Definition at line 110 of file hdfs-avro-table-writer.h.

Referenced by Flush(), and Init().

const int impala::HdfsTableWriter::HDFS_FLUSH_WRITE_SIZE = 50 * 1024
staticprotectedinherited

Size to buffer output before calling Write() (which calls hdfsWrite), in bytes to minimize the overhead of Write()

Definition at line 98 of file hdfs-table-writer.h.

Referenced by impala::HdfsTextTableWriter::HdfsTextTableWriter(), and impala::HdfsTextTableWriter::Init().

boost::scoped_ptr<MemPool> impala::HdfsAvroTableWriter::mem_pool_
private

Memory pool used by codec to allocate output buffer. Owned by this class. Initialized using parent's memtracker.

Definition at line 98 of file hdfs-avro-table-writer.h.

Referenced by Close(), HdfsAvroTableWriter(), and Init().

WriteStream impala::HdfsAvroTableWriter::out_
private

Buffer which holds accumulated output.

Definition at line 94 of file hdfs-avro-table-writer.h.

Referenced by AppendField(), AppendRowBatch(), Flush(), and WriteFileHeader().

OutputPartition* impala::HdfsTableWriter::output_
protectedinherited
TInsertStats impala::HdfsTableWriter::stats_
protectedinherited

Subclass should populate any file format specific stats.

Definition at line 127 of file hdfs-table-writer.h.

Referenced by impala::HdfsParquetTableWriter::Finalize(), impala::HdfsTableWriter::stats(), and impala::HdfsTableWriter::Write().

std::string impala::HdfsAvroTableWriter::sync_marker_
private

16 byte sync marker (a uuid)

Definition at line 113 of file hdfs-avro-table-writer.h.

Referenced by Flush(), Init(), and WriteFileHeader().

uint64_t impala::HdfsAvroTableWriter::unflushed_rows_
private

Number of rows consumed since last flush.

Definition at line 101 of file hdfs-avro-table-writer.h.

Referenced by ConsumeRow(), and Flush().


The documentation for this class was generated from the following files: