Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
|
#include <hdfs-avro-table-writer.h>
Public Member Functions | |
HdfsAvroTableWriter (HdfsTableSink *parent, RuntimeState *state, OutputPartition *output, const HdfsPartitionDescriptor *partition, const HdfsTableDescriptor *table_desc, const std::vector< ExprContext * > &output_exprs) | |
virtual | ~HdfsAvroTableWriter () |
virtual Status | Init () |
Do initialization of writer. More... | |
virtual Status | Finalize () |
virtual Status | InitNewFile () |
Called when a new file is started. More... | |
virtual void | Close () |
Called once when this writer should cleanup any resources. More... | |
virtual uint64_t | default_block_size () const |
virtual std::string | file_extension () const |
Returns the file extension for this writer. More... | |
virtual Status | AppendRowBatch (RowBatch *rows, const std::vector< int32_t > &row_group_indices, bool *new_file) |
TInsertStats & | stats () |
Returns the stats for this writer. More... | |
Protected Member Functions | |
Status | Write (const char *data, int32_t len) |
Write to the current hdfs file. More... | |
Status | Write (const uint8_t *data, int32_t len) |
template<typename T > | |
Status | Write (T v) |
Protected Attributes | |
HdfsTableSink * | parent_ |
Parent table sink object. More... | |
RuntimeState * | state_ |
Runtime state. More... | |
OutputPartition * | output_ |
Structure describing partition written to by this writer. More... | |
const HdfsTableDescriptor * | table_desc_ |
Table descriptor of table to be written. More... | |
std::vector< ExprContext * > | output_expr_ctxs_ |
Expressions that materialize output values. More... | |
TInsertStats | stats_ |
Subclass should populate any file format specific stats. More... | |
Static Protected Attributes | |
static const int | HDFS_FLUSH_WRITE_SIZE = 50 * 1024 |
Private Member Functions | |
void | ConsumeRow (TupleRow *row) |
Processes a single row, appending to out_. More... | |
void | AppendField (const ColumnType &type, const void *value) |
Adds an encoded field to out_. More... | |
Status | WriteFileHeader () |
Writes the Avro file header to HDFS. More... | |
Status | Flush () |
Private Attributes | |
WriteStream | out_ |
Buffer which holds accumulated output. More... | |
boost::scoped_ptr< MemPool > | mem_pool_ |
uint64_t | unflushed_rows_ |
Number of rows consumed since last flush. More... | |
std::string | codec_name_ |
Name of codec, only set if codec_type_ != NONE. More... | |
THdfsCompression::type | codec_type_ |
Type of the codec, will be NONE if no compression is used. More... | |
boost::scoped_ptr< Codec > | compressor_ |
The codec for compressing, only set if codec_type_ != NONE. More... | |
std::string | sync_marker_ |
16 byte sync marker (a uuid) More... | |
Consumes rows and outputs the rows into an Avro file in HDFS Each Avro file contains a block of records (rows). The file metadata specifies the schema of the records in addition to the name of the codec, if any, used to compress blocks. The structure is: [ Metadata ] [ Sync Marker ] [ Data Block ] ... [ Data Block ] Each Data Block consists of: [ Number of Rows in Block ] [ Size of serialized objects, after compression ] [ Serialized objects, compressed ] [ Sync Marker ] If compression is used, each block is compressed individually. The block size defaults to about 64KB before compression. This writer implements the Avro 1.7.7 spec: http://avro.apache.org/docs/1.7.7/spec.html
Definition at line 56 of file hdfs-avro-table-writer.h.
HdfsAvroTableWriter::HdfsAvroTableWriter | ( | HdfsTableSink * | parent, |
RuntimeState * | state, | ||
OutputPartition * | output, | ||
const HdfsPartitionDescriptor * | partition, | ||
const HdfsTableDescriptor * | table_desc, | ||
const std::vector< ExprContext * > & | output_exprs | ||
) |
Definition at line 49 of file hdfs-avro-table-writer.cc.
References mem_pool_, and impala::HdfsTableSink::mem_tracker().
|
inlinevirtual |
Definition at line 64 of file hdfs-avro-table-writer.h.
|
inlineprivate |
Adds an encoded field to out_.
Definition at line 68 of file hdfs-avro-table-writer.cc.
References impala::BitUtil::ByteSwap(), impala::ColumnType::GetDecimalByteSize(), impala::INVALID_TYPE, impala::StringValue::len, out_, impala::ColumnType::precision, impala::StringValue::ptr, impala::ColumnType::type, impala::TYPE_BIGINT, impala::TYPE_BINARY, impala::TYPE_BOOLEAN, impala::TYPE_DATE, impala::TYPE_DATETIME, impala::TYPE_DECIMAL, impala::TYPE_DOUBLE, impala::TYPE_FLOAT, impala::TYPE_INT, impala::TYPE_NULL, impala::TYPE_SMALLINT, impala::TYPE_STRING, impala::TYPE_TIMESTAMP, impala::TYPE_TINYINT, impala::WriteStream::WriteByte(), impala::WriteStream::WriteBytes(), impala::WriteStream::WriteZInt(), and impala::WriteStream::WriteZLong().
Referenced by ConsumeRow().
|
virtual |
Outputs the given rows into an HDFS sequence file. The rows are buffered to fill a sequence file block.
Implements impala::HdfsTableWriter.
Definition at line 168 of file hdfs-avro-table-writer.cc.
References ConsumeRow(), COUNTER_ADD, DEFAULT_AVRO_BLOCK_SIZE, impala::HdfsTableSink::encode_timer(), Flush(), impala::RowBatch::GetRow(), impala::RowBatch::num_rows(), impala::Status::OK, out_, impala::HdfsTableWriter::parent_, impala::HdfsTableSink::rows_inserted_counter(), SCOPED_TIMER, and impala::WriteStream::Size().
|
inlinevirtual |
Called once when this writer should cleanup any resources.
Implements impala::HdfsTableWriter.
Definition at line 69 of file hdfs-avro-table-writer.h.
References mem_pool_.
|
private |
Processes a single row, appending to out_.
Definition at line 58 of file hdfs-avro-table-writer.cc.
References AppendField(), impala::TableDescriptor::num_clustering_cols(), impala::TableDescriptor::num_cols(), impala::HdfsTableWriter::output_expr_ctxs_, impala::HdfsTableWriter::table_desc_, and unflushed_rows_.
Referenced by AppendRowBatch().
|
inlinevirtual |
Default block size to use for this file format. If the file format doesn't care, it should return 0 and the hdfs config default will be used.
Implements impala::HdfsTableWriter.
Definition at line 70 of file hdfs-avro-table-writer.h.
|
inlinevirtual |
Returns the file extension for this writer.
Implements impala::HdfsTableWriter.
Definition at line 71 of file hdfs-avro-table-writer.h.
|
inlinevirtual |
Finalize this partition. The writer needs to finish processing all data have written out after the return from this call. This is called once for each call to InitNewFile()
Implements impala::HdfsTableWriter.
Definition at line 67 of file hdfs-avro-table-writer.h.
References Flush().
|
private |
Writes the contents of out_ to HDFS as a single Avro file block. Returns an error if write to HDFS fails.
Definition at line 226 of file hdfs-avro-table-writer.cc.
References impala::WriteStream::Clear(), codec_type_, impala::HdfsTableSink::compress_timer(), compressor_, impala::SnappyCompressor::ComputeChecksum(), impala::HdfsTableSink::hdfs_write_timer(), impala::Status::OK, out_, impala::HdfsTableWriter::parent_, RETURN_IF_ERROR, SCOPED_TIMER, impala::WriteStream::Size(), impala::WriteStream::String(), sync_marker_, unflushed_rows_, impala::HdfsTableWriter::Write(), and impala::WriteStream::WriteZLong().
Referenced by AppendRowBatch(), and Finalize().
|
virtual |
Do initialization of writer.
The sequence of calls to this object are:
Implements impala::HdfsTableWriter.
Definition at line 135 of file hdfs-avro-table-writer.cc.
References AVRO_DEFAULT_CODEC, codec_name_, codec_type_, compressor_, impala::Codec::CreateCompressor(), impala::GenerateUUIDString(), mem_pool_, impala::name, impala::Status::OK, impala::RuntimeState::query_options(), RETURN_IF_ERROR, impala::HdfsTableWriter::state_, and sync_marker_.
|
inlinevirtual |
Called when a new file is started.
Implements impala::HdfsTableWriter.
Definition at line 68 of file hdfs-avro-table-writer.h.
References WriteFileHeader().
|
inlineinherited |
Returns the stats for this writer.
Definition at line 86 of file hdfs-table-writer.h.
References impala::HdfsTableWriter::stats_.
|
inlineprotectedinherited |
Write to the current hdfs file.
Definition at line 101 of file hdfs-table-writer.h.
Referenced by impala::HdfsTextTableWriter::Flush(), impala::HdfsSequenceTableWriter::Flush(), Flush(), impala::HdfsParquetTableWriter::FlushCurrentRowGroup(), impala::HdfsTableWriter::Write(), impala::HdfsSequenceTableWriter::WriteCompressedBlock(), impala::HdfsParquetTableWriter::WriteFileFooter(), impala::HdfsSequenceTableWriter::WriteFileHeader(), WriteFileHeader(), and impala::HdfsParquetTableWriter::WriteFileHeader().
|
protectedinherited |
Definition at line 36 of file hdfs-table-writer.cc.
References impala::HdfsTableSink::bytes_written_counter(), COUNTER_ADD, impala::OutputPartition::current_file_name, impala::GetHdfsErrorMsg(), impala::OutputPartition::hdfs_connection, impala::Status::OK, impala::HdfsTableWriter::output_, impala::HdfsTableWriter::parent_, impala::HdfsTableWriter::stats_, and impala::OutputPartition::tmp_hdfs_file.
|
inlineprotectedinherited |
Definition at line 107 of file hdfs-table-writer.h.
References impala::HdfsTableWriter::Write().
|
private |
Writes the Avro file header to HDFS.
Definition at line 193 of file hdfs-avro-table-writer.cc.
References AVRO_CODEC_STR, impala::HdfsTableDescriptor::avro_schema(), AVRO_SCHEMA_STR, impala::WriteStream::Clear(), codec_name_, OBJ1, impala::Status::OK, out_, RETURN_IF_ERROR, impala::WriteStream::String(), sync_marker_, impala::HdfsTableWriter::table_desc_, impala::HdfsTableWriter::Write(), impala::WriteStream::WriteBytes(), and impala::WriteStream::WriteZLong().
Referenced by InitNewFile().
|
private |
Name of codec, only set if codec_type_ != NONE.
Definition at line 104 of file hdfs-avro-table-writer.h.
Referenced by Init(), and WriteFileHeader().
|
private |
Type of the codec, will be NONE if no compression is used.
Definition at line 107 of file hdfs-avro-table-writer.h.
|
private |
The codec for compressing, only set if codec_type_ != NONE.
Definition at line 110 of file hdfs-avro-table-writer.h.
|
staticprotectedinherited |
Size to buffer output before calling Write() (which calls hdfsWrite), in bytes to minimize the overhead of Write()
Definition at line 98 of file hdfs-table-writer.h.
Referenced by impala::HdfsTextTableWriter::HdfsTextTableWriter(), and impala::HdfsTextTableWriter::Init().
|
private |
Memory pool used by codec to allocate output buffer. Owned by this class. Initialized using parent's memtracker.
Definition at line 98 of file hdfs-avro-table-writer.h.
Referenced by Close(), HdfsAvroTableWriter(), and Init().
|
private |
Buffer which holds accumulated output.
Definition at line 94 of file hdfs-avro-table-writer.h.
Referenced by AppendField(), AppendRowBatch(), Flush(), and WriteFileHeader().
|
protectedinherited |
Structure describing partition written to by this writer.
Definition at line 118 of file hdfs-table-writer.h.
Referenced by impala::HdfsTextTableWriter::AppendRowBatch(), impala::HdfsParquetTableWriter::AppendRowBatch(), impala::HdfsParquetTableWriter::InitNewFile(), and impala::HdfsTableWriter::Write().
|
protectedinherited |
Expressions that materialize output values.
Definition at line 124 of file hdfs-table-writer.h.
Referenced by impala::HdfsSequenceTableWriter::AppendRowBatch(), impala::HdfsTextTableWriter::AppendRowBatch(), ConsumeRow(), impala::HdfsParquetTableWriter::CreateSchema(), impala::HdfsSequenceTableWriter::EncodeRow(), impala::HdfsTableWriter::HdfsTableWriter(), and impala::HdfsParquetTableWriter::Init().
|
protectedinherited |
Parent table sink object.
Definition at line 112 of file hdfs-table-writer.h.
Referenced by impala::HdfsSequenceTableWriter::AppendRowBatch(), impala::HdfsTextTableWriter::AppendRowBatch(), impala::HdfsParquetTableWriter::AppendRowBatch(), AppendRowBatch(), impala::HdfsTextTableWriter::Close(), impala::HdfsSequenceTableWriter::ConsumeRow(), impala::HdfsSequenceTableWriter::EncodeRow(), impala::HdfsParquetTableWriter::Finalize(), impala::HdfsTextTableWriter::Flush(), impala::HdfsSequenceTableWriter::Flush(), Flush(), impala::HdfsTableWriter::HdfsTableWriter(), impala::HdfsTextTableWriter::Init(), impala::HdfsTableWriter::Write(), and impala::HdfsSequenceTableWriter::WriteCompressedBlock().
|
protectedinherited |
Runtime state.
Definition at line 115 of file hdfs-table-writer.h.
Referenced by impala::HdfsParquetTableWriter::default_block_size(), impala::HdfsSequenceTableWriter::Init(), impala::HdfsTextTableWriter::Init(), impala::HdfsParquetTableWriter::Init(), and Init().
|
protectedinherited |
Subclass should populate any file format specific stats.
Definition at line 127 of file hdfs-table-writer.h.
Referenced by impala::HdfsParquetTableWriter::Finalize(), impala::HdfsTableWriter::stats(), and impala::HdfsTableWriter::Write().
|
private |
16 byte sync marker (a uuid)
Definition at line 113 of file hdfs-avro-table-writer.h.
Referenced by Flush(), Init(), and WriteFileHeader().
|
protectedinherited |
Table descriptor of table to be written.
Definition at line 121 of file hdfs-table-writer.h.
Referenced by impala::HdfsParquetTableWriter::AddRowGroup(), impala::HdfsSequenceTableWriter::AppendRowBatch(), impala::HdfsTextTableWriter::AppendRowBatch(), ConsumeRow(), impala::HdfsParquetTableWriter::CreateSchema(), impala::HdfsSequenceTableWriter::EncodeRow(), impala::HdfsParquetTableWriter::FlushCurrentRowGroup(), impala::HdfsTableWriter::HdfsTableWriter(), impala::HdfsParquetTableWriter::Init(), and WriteFileHeader().
|
private |
Number of rows consumed since last flush.
Definition at line 101 of file hdfs-avro-table-writer.h.
Referenced by ConsumeRow(), and Flush().