Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros
impala::HdfsTableWriter Class Referenceabstract

#include <hdfs-table-writer.h>

Inheritance diagram for impala::HdfsTableWriter:
Collaboration diagram for impala::HdfsTableWriter:

Public Member Functions

 HdfsTableWriter (HdfsTableSink *parent, RuntimeState *state, OutputPartition *output_partition, const HdfsPartitionDescriptor *partition_desc, const HdfsTableDescriptor *table_desc, const std::vector< ExprContext * > &output_expr_ctxs)
 
virtual ~HdfsTableWriter ()
 
virtual Status Init ()=0
 Do initialization of writer. More...
 
virtual Status InitNewFile ()=0
 Called when a new file is started. More...
 
virtual Status AppendRowBatch (RowBatch *batch, const std::vector< int32_t > &row_group_indices, bool *new_file)=0
 
virtual Status Finalize ()=0
 
virtual void Close ()=0
 Called once when this writer should cleanup any resources. More...
 
TInsertStats & stats ()
 Returns the stats for this writer. More...
 
virtual uint64_t default_block_size () const =0
 
virtual std::string file_extension () const =0
 Returns the file extension for this writer. More...
 

Protected Member Functions

Status Write (const char *data, int32_t len)
 Write to the current hdfs file. More...
 
Status Write (const uint8_t *data, int32_t len)
 
template<typename T >
Status Write (T v)
 

Protected Attributes

HdfsTableSinkparent_
 Parent table sink object. More...
 
RuntimeStatestate_
 Runtime state. More...
 
OutputPartitionoutput_
 Structure describing partition written to by this writer. More...
 
const HdfsTableDescriptortable_desc_
 Table descriptor of table to be written. More...
 
std::vector< ExprContext * > output_expr_ctxs_
 Expressions that materialize output values. More...
 
TInsertStats stats_
 Subclass should populate any file format specific stats. More...
 

Static Protected Attributes

static const int HDFS_FLUSH_WRITE_SIZE = 50 * 1024
 

Detailed Description

Pure virtual class for writing to hdfs table partition files. Subclasses implement the code needed to write to a specific file type. A subclass needs to implement functions to format and add rows to the file and to do whatever processing is needed prior to closing the file.

Definition at line 33 of file hdfs-table-writer.h.

Constructor & Destructor Documentation

impala::HdfsTableWriter::HdfsTableWriter ( HdfsTableSink parent,
RuntimeState state,
OutputPartition output_partition,
const HdfsPartitionDescriptor partition_desc,
const HdfsTableDescriptor table_desc,
const std::vector< ExprContext * > &  output_expr_ctxs 
)

The implementation of a writer may reference the parameters to the constructor during the lifetime of the object. output_partition – Information on the output partition file. partition – the descriptor for the partition being written table_desc – the descriptor for the table being written. output_exprs – expressions which generate the output values.

Definition at line 21 of file hdfs-table-writer.cc.

References impala::HdfsTableSink::DebugString(), impala::TableDescriptor::num_clustering_cols(), impala::TableDescriptor::num_cols(), output_expr_ctxs_, parent_, and table_desc_.

virtual impala::HdfsTableWriter::~HdfsTableWriter ( )
inlinevirtual

Definition at line 47 of file hdfs-table-writer.h.

Member Function Documentation

virtual Status impala::HdfsTableWriter::AppendRowBatch ( RowBatch batch,
const std::vector< int32_t > &  row_group_indices,
bool new_file 
)
pure virtual

Appends the current batch of rows to the partition. If there are multiple partitions then row_group_indices will contain the rows that are for this partition, otherwise all rows in the batch are appended. If the current file is full, the writer stops appending and returns with *new_file == true. A new file will be opened and the same row batch will be passed again. The writer must track how much of the batch it had already processed asking for a new file. Otherwise the writer will return with *newfile == false.

Implemented in impala::HdfsAvroTableWriter, impala::HdfsParquetTableWriter, impala::HdfsTextTableWriter, and impala::HdfsSequenceTableWriter.

virtual void impala::HdfsTableWriter::Close ( )
pure virtual

Called once when this writer should cleanup any resources.

Implemented in impala::HdfsParquetTableWriter, impala::HdfsAvroTableWriter, impala::HdfsTextTableWriter, and impala::HdfsSequenceTableWriter.

virtual uint64_t impala::HdfsTableWriter::default_block_size ( ) const
pure virtual

Default block size to use for this file format. If the file format doesn't care, it should return 0 and the hdfs config default will be used.

Implemented in impala::HdfsParquetTableWriter, impala::HdfsAvroTableWriter, impala::HdfsTextTableWriter, and impala::HdfsSequenceTableWriter.

virtual std::string impala::HdfsTableWriter::file_extension ( ) const
pure virtual
virtual Status impala::HdfsTableWriter::Finalize ( )
pure virtual

Finalize this partition. The writer needs to finish processing all data have written out after the return from this call. This is called once for each call to InitNewFile()

Implemented in impala::HdfsParquetTableWriter, impala::HdfsAvroTableWriter, impala::HdfsTextTableWriter, and impala::HdfsSequenceTableWriter.

virtual Status impala::HdfsTableWriter::Init ( )
pure virtual

Do initialization of writer.

The sequence of calls to this object are:

  1. Init()
  2. InitNewFile()
  3. AppendRowBatch() - called repeatedly
  4. Finalize() For files formats that are splittable (and therefore can be written to an arbitrarily large file), 1-4 is called once. For files formats that are not splittable (i.e. columnar formats, compressed text), 1) is called once and 2-4) is called repeatedly for each file.

Implemented in impala::HdfsAvroTableWriter, impala::HdfsParquetTableWriter, impala::HdfsTextTableWriter, and impala::HdfsSequenceTableWriter.

virtual Status impala::HdfsTableWriter::InitNewFile ( )
pure virtual
TInsertStats& impala::HdfsTableWriter::stats ( )
inline

Returns the stats for this writer.

Definition at line 86 of file hdfs-table-writer.h.

References stats_.

template<typename T >
Status impala::HdfsTableWriter::Write ( v)
inlineprotected

Definition at line 107 of file hdfs-table-writer.h.

References Write().

Member Data Documentation

const int impala::HdfsTableWriter::HDFS_FLUSH_WRITE_SIZE = 50 * 1024
staticprotected

Size to buffer output before calling Write() (which calls hdfsWrite), in bytes to minimize the overhead of Write()

Definition at line 98 of file hdfs-table-writer.h.

Referenced by impala::HdfsTextTableWriter::HdfsTextTableWriter(), and impala::HdfsTextTableWriter::Init().

OutputPartition* impala::HdfsTableWriter::output_
protected

Structure describing partition written to by this writer.

Definition at line 118 of file hdfs-table-writer.h.

Referenced by impala::HdfsTextTableWriter::AppendRowBatch(), impala::HdfsParquetTableWriter::AppendRowBatch(), impala::HdfsParquetTableWriter::InitNewFile(), and Write().

TInsertStats impala::HdfsTableWriter::stats_
protected

Subclass should populate any file format specific stats.

Definition at line 127 of file hdfs-table-writer.h.

Referenced by impala::HdfsParquetTableWriter::Finalize(), stats(), and Write().


The documentation for this class was generated from the following files: