Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros
impala::HdfsSequenceTableWriter Class Reference

#include <hdfs-sequence-table-writer.h>

Inheritance diagram for impala::HdfsSequenceTableWriter:
Collaboration diagram for impala::HdfsSequenceTableWriter:

Public Member Functions

 HdfsSequenceTableWriter (HdfsTableSink *parent, RuntimeState *state, OutputPartition *output, const HdfsPartitionDescriptor *partition, const HdfsTableDescriptor *table_desc, const std::vector< ExprContext * > &output_exprs)
 
 ~HdfsSequenceTableWriter ()
 
virtual Status Init ()
 Do initialization of writer. More...
 
virtual Status Finalize ()
 
virtual Status InitNewFile ()
 Called when a new file is started. More...
 
virtual void Close ()
 Called once when this writer should cleanup any resources. More...
 
virtual uint64_t default_block_size () const
 
virtual std::string file_extension () const
 Returns the file extension for this writer. More...
 
virtual Status AppendRowBatch (RowBatch *rows, const std::vector< int32_t > &row_group_indices, bool *new_file)
 
TInsertStats & stats ()
 Returns the stats for this writer. More...
 

Protected Member Functions

Status Write (const char *data, int32_t len)
 Write to the current hdfs file. More...
 
Status Write (const uint8_t *data, int32_t len)
 
template<typename T >
Status Write (T v)
 

Protected Attributes

HdfsTableSinkparent_
 Parent table sink object. More...
 
RuntimeStatestate_
 Runtime state. More...
 
OutputPartitionoutput_
 Structure describing partition written to by this writer. More...
 
const HdfsTableDescriptortable_desc_
 Table descriptor of table to be written. More...
 
std::vector< ExprContext * > output_expr_ctxs_
 Expressions that materialize output values. More...
 
TInsertStats stats_
 Subclass should populate any file format specific stats. More...
 

Static Protected Attributes

static const int HDFS_FLUSH_WRITE_SIZE = 50 * 1024
 

Private Member Functions

Status ConsumeRow (TupleRow *row)
 processes a single row, delegates to Compress or NoCompress ConsumeRow(). More...
 
Status WriteFileHeader ()
 writes the SEQ file header to HDFS More...
 
Status WriteCompressedBlock ()
 writes the contents of out_ as a single compressed block More...
 
void EncodeRow (TupleRow *row, WriteStream *buf)
 
void WriteEscapedString (const StringValue *str_val, WriteStream *buf)
 writes the str_val to the buffer, escaping special characters More...
 
Status Flush ()
 

Private Attributes

uint64_t approx_block_size_
 
WriteStream out_
 buffer which holds accumulated output More...
 
WriteStream row_buf_
 Temporary Buffer for a single row. More...
 
MemPoolmem_pool_
 memory pool used by codec to allocate output buffer More...
 
bool compress_flag_
 true if compression is enabled More...
 
uint64_t unflushed_rows_
 number of rows consumed since last flush More...
 
std::string codec_name_
 name of codec, only set if compress_flag_ More...
 
boost::scoped_ptr< Codeccompressor_
 the codec for compressing, only set if compress_flag_ More...
 
bool record_compression_
 true if compression is applied on each record individually More...
 
char field_delim_
 Character delimiting fields. More...
 
char escape_char_
 Escape character for text encoding. More...
 
std::string sync_marker_
 16 byte sync marker (a uuid) More...
 
std::string neg1_sync_marker_
 A -1 infront of the sync marker, used in decompressed formats. More...
 

Static Private Attributes

static const char * VALUE_CLASS_NAME = "org.apache.hadoop.io.Text"
 Name of java class to use when reading the values. More...
 
static uint8_t SEQ6_CODE [4] = {'S', 'E', 'Q', 6}
 Magic characters used to identify the file type. More...
 

Detailed Description

Consumes rows and outputs the rows into a sequence file in HDFS Output is buffered to fill sequence file blocks.

Definition at line 38 of file hdfs-sequence-table-writer.h.

Constructor & Destructor Documentation

impala::HdfsSequenceTableWriter::HdfsSequenceTableWriter ( HdfsTableSink parent,
RuntimeState state,
OutputPartition output,
const HdfsPartitionDescriptor partition,
const HdfsTableDescriptor table_desc,
const std::vector< ExprContext * > &  output_exprs 
)
impala::HdfsSequenceTableWriter::~HdfsSequenceTableWriter ( )
inline

Definition at line 46 of file hdfs-sequence-table-writer.h.

Member Function Documentation

virtual void impala::HdfsSequenceTableWriter::Close ( )
inlinevirtual

Called once when this writer should cleanup any resources.

Implements impala::HdfsTableWriter.

Definition at line 51 of file hdfs-sequence-table-writer.h.

virtual uint64_t impala::HdfsSequenceTableWriter::default_block_size ( ) const
inlinevirtual

Default block size to use for this file format. If the file format doesn't care, it should return 0 and the hdfs config default will be used.

Implements impala::HdfsTableWriter.

Definition at line 52 of file hdfs-sequence-table-writer.h.

virtual std::string impala::HdfsSequenceTableWriter::file_extension ( ) const
inlinevirtual

Returns the file extension for this writer.

Implements impala::HdfsTableWriter.

Definition at line 53 of file hdfs-sequence-table-writer.h.

virtual Status impala::HdfsSequenceTableWriter::Finalize ( )
inlinevirtual

Finalize this partition. The writer needs to finish processing all data have written out after the return from this call. This is called once for each call to InitNewFile()

Implements impala::HdfsTableWriter.

Definition at line 49 of file hdfs-sequence-table-writer.h.

References Flush().

Status impala::HdfsSequenceTableWriter::Flush ( )
private

flushes the output – clearing out_ and writing to HDFS if compress_flag_, will write contents of out_ as a single compressed block

Definition at line 291 of file hdfs-sequence-table-writer.cc.

References impala::WriteStream::Clear(), compress_flag_, impala::HdfsTableSink::hdfs_write_timer(), impala::Status::OK, out_, impala::HdfsTableWriter::parent_, record_compression_, RETURN_IF_ERROR, SCOPED_TIMER, impala::WriteStream::String(), unflushed_rows_, impala::HdfsTableWriter::Write(), and WriteCompressedBlock().

Referenced by AppendRowBatch(), and Finalize().

Status impala::HdfsSequenceTableWriter::Init ( )
virtual

Do initialization of writer.

The sequence of calls to this object are:

  1. Init()
  2. InitNewFile()
  3. AppendRowBatch() - called repeatedly
  4. Finalize() For files formats that are splittable (and therefore can be written to an arbitrarily large file), 1-4 is called once. For files formats that are not splittable (i.e. columnar formats, compressed text), 1) is called once and 2-4) is called repeatedly for each file.

Implements impala::HdfsTableWriter.

Definition at line 54 of file hdfs-sequence-table-writer.cc.

References codec_name_, compress_flag_, compressor_, impala::Codec::CreateCompressor(), impala::GenerateUUIDString(), impala::Codec::GetHadoopCodecClassName(), mem_pool_, neg1_sync_marker_, impala::Status::OK, impala::ReadWriteUtil::PutInt(), impala::RuntimeState::query_options(), record_compression_, RETURN_IF_ERROR, impala::HdfsTableWriter::state_, and sync_marker_.

virtual Status impala::HdfsSequenceTableWriter::InitNewFile ( )
inlinevirtual

Called when a new file is started.

Implements impala::HdfsTableWriter.

Definition at line 50 of file hdfs-sequence-table-writer.h.

References WriteFileHeader().

TInsertStats& impala::HdfsTableWriter::stats ( )
inlineinherited

Returns the stats for this writer.

Definition at line 86 of file hdfs-table-writer.h.

References impala::HdfsTableWriter::stats_.

template<typename T >
Status impala::HdfsTableWriter::Write ( v)
inlineprotectedinherited

Definition at line 107 of file hdfs-table-writer.h.

References impala::HdfsTableWriter::Write().

void impala::HdfsSequenceTableWriter::WriteEscapedString ( const StringValue str_val,
WriteStream buf 
)
inlineprivate

writes the str_val to the buffer, escaping special characters

Definition at line 196 of file hdfs-sequence-table-writer.cc.

References escape_char_, field_delim_, impala::StringValue::len, impala::StringValue::ptr, and impala::WriteStream::WriteByte().

Referenced by EncodeRow().

Member Data Documentation

uint64_t impala::HdfsSequenceTableWriter::approx_block_size_
private

desired size of each block (bytes); actual block size will vary +/- the size of a row; this is before compression is applied.

Definition at line 84 of file hdfs-sequence-table-writer.h.

Referenced by AppendRowBatch(), and HdfsSequenceTableWriter().

std::string impala::HdfsSequenceTableWriter::codec_name_
private

name of codec, only set if compress_flag_

Definition at line 102 of file hdfs-sequence-table-writer.h.

Referenced by Init(), and WriteFileHeader().

bool impala::HdfsSequenceTableWriter::compress_flag_
private

true if compression is enabled

Definition at line 96 of file hdfs-sequence-table-writer.h.

Referenced by AppendRowBatch(), ConsumeRow(), Flush(), Init(), WriteCompressedBlock(), and WriteFileHeader().

boost::scoped_ptr<Codec> impala::HdfsSequenceTableWriter::compressor_
private

the codec for compressing, only set if compress_flag_

Definition at line 104 of file hdfs-sequence-table-writer.h.

Referenced by ConsumeRow(), Init(), and WriteCompressedBlock().

char impala::HdfsSequenceTableWriter::escape_char_
private

Escape character for text encoding.

Definition at line 113 of file hdfs-sequence-table-writer.h.

Referenced by HdfsSequenceTableWriter(), and WriteEscapedString().

char impala::HdfsSequenceTableWriter::field_delim_
private

Character delimiting fields.

Definition at line 110 of file hdfs-sequence-table-writer.h.

Referenced by EncodeRow(), HdfsSequenceTableWriter(), and WriteEscapedString().

const int impala::HdfsTableWriter::HDFS_FLUSH_WRITE_SIZE = 50 * 1024
staticprotectedinherited

Size to buffer output before calling Write() (which calls hdfsWrite), in bytes to minimize the overhead of Write()

Definition at line 98 of file hdfs-table-writer.h.

Referenced by impala::HdfsTextTableWriter::HdfsTextTableWriter(), and impala::HdfsTextTableWriter::Init().

MemPool* impala::HdfsSequenceTableWriter::mem_pool_
private

memory pool used by codec to allocate output buffer

Definition at line 93 of file hdfs-sequence-table-writer.h.

Referenced by Init().

std::string impala::HdfsSequenceTableWriter::neg1_sync_marker_
private

A -1 infront of the sync marker, used in decompressed formats.

Definition at line 118 of file hdfs-sequence-table-writer.h.

Referenced by AppendRowBatch(), and Init().

WriteStream impala::HdfsSequenceTableWriter::out_
private

buffer which holds accumulated output

Definition at line 87 of file hdfs-sequence-table-writer.h.

Referenced by AppendRowBatch(), ConsumeRow(), Flush(), WriteCompressedBlock(), and WriteFileHeader().

OutputPartition* impala::HdfsTableWriter::output_
protectedinherited
std::vector<ExprContext*> impala::HdfsTableWriter::output_expr_ctxs_
protectedinherited
bool impala::HdfsSequenceTableWriter::record_compression_
private

true if compression is applied on each record individually

Definition at line 107 of file hdfs-sequence-table-writer.h.

Referenced by ConsumeRow(), Flush(), Init(), and WriteFileHeader().

WriteStream impala::HdfsSequenceTableWriter::row_buf_
private

Temporary Buffer for a single row.

Definition at line 90 of file hdfs-sequence-table-writer.h.

Referenced by ConsumeRow(), and EncodeRow().

uint8_t impala::HdfsSequenceTableWriter::SEQ6_CODE = {'S', 'E', 'Q', 6}
staticprivate

Magic characters used to identify the file type.

Definition at line 123 of file hdfs-sequence-table-writer.h.

Referenced by WriteFileHeader().

TInsertStats impala::HdfsTableWriter::stats_
protectedinherited

Subclass should populate any file format specific stats.

Definition at line 127 of file hdfs-table-writer.h.

Referenced by impala::HdfsParquetTableWriter::Finalize(), impala::HdfsTableWriter::stats(), and impala::HdfsTableWriter::Write().

std::string impala::HdfsSequenceTableWriter::sync_marker_
private

16 byte sync marker (a uuid)

Definition at line 116 of file hdfs-sequence-table-writer.h.

Referenced by Init(), WriteCompressedBlock(), and WriteFileHeader().

uint64_t impala::HdfsSequenceTableWriter::unflushed_rows_
private

number of rows consumed since last flush

Definition at line 99 of file hdfs-sequence-table-writer.h.

Referenced by ConsumeRow(), Flush(), and WriteCompressedBlock().

const char * impala::HdfsSequenceTableWriter::VALUE_CLASS_NAME = "org.apache.hadoop.io.Text"
staticprivate

Name of java class to use when reading the values.

Definition at line 121 of file hdfs-sequence-table-writer.h.

Referenced by WriteFileHeader().


The documentation for this class was generated from the following files: