Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
|
#include <hdfs-parquet-table-writer.h>
Classes | |
class | BaseColumnWriter |
class | BoolColumnWriter |
class | ColumnWriter |
Public Member Functions | |
HdfsParquetTableWriter (HdfsTableSink *parent, RuntimeState *state, OutputPartition *output_partition, const HdfsPartitionDescriptor *part_desc, const HdfsTableDescriptor *table_desc, const std::vector< ExprContext * > &output_expr_ctxs) | |
~HdfsParquetTableWriter () | |
virtual Status | Init () |
Initialize column information. More... | |
virtual Status | InitNewFile () |
virtual Status | AppendRowBatch (RowBatch *batch, const std::vector< int32_t > &row_group_indices, bool *new_file) |
Appends parquet representation of rows in the batch to the current file. More... | |
virtual Status | Finalize () |
Write out all the data. More... | |
virtual void | Close () |
Called once when this writer should cleanup any resources. More... | |
virtual uint64_t | default_block_size () const |
Returns the target HDFS block size to use. More... | |
virtual std::string | file_extension () const |
Returns the file extension for this writer. More... | |
TInsertStats & | stats () |
Returns the stats for this writer. More... | |
Protected Member Functions | |
Status | Write (const char *data, int32_t len) |
Write to the current hdfs file. More... | |
Status | Write (const uint8_t *data, int32_t len) |
template<typename T > | |
Status | Write (T v) |
Protected Attributes | |
HdfsTableSink * | parent_ |
Parent table sink object. More... | |
RuntimeState * | state_ |
Runtime state. More... | |
OutputPartition * | output_ |
Structure describing partition written to by this writer. More... | |
const HdfsTableDescriptor * | table_desc_ |
Table descriptor of table to be written. More... | |
std::vector< ExprContext * > | output_expr_ctxs_ |
Expressions that materialize output values. More... | |
TInsertStats | stats_ |
Subclass should populate any file format specific stats. More... | |
Static Protected Attributes | |
static const int | HDFS_FLUSH_WRITE_SIZE = 50 * 1024 |
Private Member Functions | |
int64_t | MinBlockSize () const |
Minimum allowable block size in bytes. This is a function of the number of columns. More... | |
Status | CreateSchema () |
Status | WriteFileHeader () |
Write the file header information to the output file. More... | |
Status | WriteFileFooter () |
Write the file metadata and footer. More... | |
Status | FlushCurrentRowGroup () |
Status | AddRowGroup () |
Private Attributes | |
boost::scoped_ptr < ThriftSerializer > | thrift_serializer_ |
parquet::FileMetaData | file_metadata_ |
File metdata thrift description. More... | |
parquet::RowGroup * | current_row_group_ |
The current row group being written to. More... | |
std::vector< BaseColumnWriter * > | columns_ |
array of pointers to column information. More... | |
int64_t | row_count_ |
Number of rows in current file. More... | |
int64_t | file_size_estimate_ |
int64_t | file_size_limit_ |
Limit on the total size of the file. More... | |
int64_t | file_pos_ |
boost::scoped_ptr< MemPool > | reusable_col_mem_pool_ |
boost::scoped_ptr< MemPool > | per_file_mem_pool_ |
int | row_idx_ |
std::vector< uint8_t > | compression_staging_buffer_ |
TParquetInsertStats | parquet_stats_ |
For each column, the on disk size written. More... | |
Static Private Attributes | |
static const int | DEFAULT_DATA_PAGE_SIZE = 64 * 1024 |
Default data page size. In bytes. More... | |
static const int64_t | MAX_DATA_PAGE_SIZE = 1024 * 1024 * 1024 |
static const int | HDFS_BLOCK_SIZE = 256 * 1024 * 1024 |
Default hdfs block size. In bytes. More... | |
static const int | HDFS_BLOCK_ALIGNMENT = 1024 * 1024 |
Align block sizes to this constant. In bytes. More... | |
static const int | ROW_GROUP_SIZE = HDFS_BLOCK_SIZE |
Default row group size. In bytes. More... | |
static const int | HDFS_MIN_FILE_SIZE = 8 * 1024 * 1024 |
Minimum file size. If the configured size is less, fail. More... | |
Friends | |
class | BaseColumnWriter |
template<typename T > | |
class | ColumnWriter |
class | BoolColumnWriter |
The writer consumes all rows passed to it and writes the evaluated output_exprs as a parquet file in hdfs. TODO: (parts of the format that are not implemented)
Definition at line 49 of file hdfs-parquet-table-writer.h.
HdfsParquetTableWriter::HdfsParquetTableWriter | ( | HdfsTableSink * | parent, |
RuntimeState * | state, | ||
OutputPartition * | output_partition, | ||
const HdfsPartitionDescriptor * | part_desc, | ||
const HdfsTableDescriptor * | table_desc, | ||
const std::vector< ExprContext * > & | output_expr_ctxs | ||
) |
Definition at line 618 of file hdfs-parquet-table-writer.cc.
HdfsParquetTableWriter::~HdfsParquetTableWriter | ( | ) |
Definition at line 632 of file hdfs-parquet-table-writer.cc.
|
private |
Adds a row group to the metadata and updates current_row_group_ to the new row group. current_row_group_ will be flushed.
Definition at line 766 of file hdfs-parquet-table-writer.cc.
References impala::TableDescriptor::col_names(), columns_, current_row_group_, file_metadata_, FlushCurrentRowGroup(), impala::IMPALA_TO_PARQUET_TYPES, impala::TableDescriptor::num_clustering_cols(), impala::Status::OK, impala::PLAIN, RETURN_IF_ERROR, and impala::HdfsTableWriter::table_desc_.
Referenced by InitNewFile().
|
virtual |
Appends parquet representation of rows in the batch to the current file.
Implements impala::HdfsTableWriter.
Definition at line 862 of file hdfs-parquet-table-writer.cc.
References columns_, impala::HdfsTableSink::encode_timer(), file_size_estimate_, file_size_limit_, impala::RowBatch::GetRow(), impala::OutputPartition::num_rows, impala::RowBatch::num_rows(), impala::Status::OK, impala::HdfsTableWriter::output_, impala::HdfsTableWriter::parent_, RETURN_IF_ERROR, row_count_, row_idx_, and SCOPED_TIMER.
|
virtual |
Called once when this writer should cleanup any resources.
Implements impala::HdfsTableWriter.
Definition at line 910 of file hdfs-parquet-table-writer.cc.
References columns_, compression_staging_buffer_, per_file_mem_pool_, and reusable_col_mem_pool_.
|
private |
Fills in the schema portion of the file metadata, converting the schema in table_desc_ into the format in the file metadata
Definition at line 733 of file hdfs-parquet-table-writer.cc.
References impala::TableDescriptor::col_names(), columns_, impala::ParquetPlainEncoder::DecimalSize(), file_metadata_, impala::IMPALA_TO_PARQUET_TYPES, impala::TableDescriptor::num_clustering_cols(), impala::Status::OK, impala::HdfsTableWriter::output_expr_ctxs_, impala::HdfsTableWriter::table_desc_, impala::ColumnType::type, impala::TYPE_CHAR, impala::TYPE_DECIMAL, and impala::TYPE_VARCHAR.
Referenced by Init().
|
virtual |
Returns the target HDFS block size to use.
Implements impala::HdfsTableWriter.
Definition at line 797 of file hdfs-parquet-table-writer.cc.
References HDFS_BLOCK_ALIGNMENT, HDFS_BLOCK_SIZE, MinBlockSize(), impala::RuntimeState::query_options(), impala::BitUtil::RoundUp(), and impala::HdfsTableWriter::state_.
|
inlinevirtual |
Returns the file extension for this writer.
Implements impala::HdfsTableWriter.
Definition at line 79 of file hdfs-parquet-table-writer.h.
|
virtual |
Write out all the data.
Implements impala::HdfsTableWriter.
Definition at line 897 of file hdfs-parquet-table-writer.cc.
References COUNTER_ADD, file_metadata_, FlushCurrentRowGroup(), impala::HdfsTableSink::hdfs_write_timer(), impala::Status::OK, impala::HdfsTableWriter::parent_, parquet_stats_, RETURN_IF_ERROR, row_count_, impala::HdfsTableSink::rows_inserted_counter(), SCOPED_TIMER, impala::HdfsTableWriter::stats_, and WriteFileFooter().
|
private |
Flushes the current row group to file. This will compute the final offsets of column chunks, updating the file metadata.
Definition at line 928 of file hdfs-parquet-table-writer.cc.
References impala::TableDescriptor::col_names(), columns_, current_row_group_, file_pos_, impala::TableDescriptor::num_clustering_cols(), impala::Status::OK, parquet_stats_, RETURN_IF_ERROR, impala::HdfsTableWriter::table_desc_, thrift_serializer_, and impala::HdfsTableWriter::Write().
Referenced by AddRowGroup(), and Finalize().
|
virtual |
Initialize column information.
Implements impala::HdfsTableWriter.
Definition at line 635 of file hdfs-parquet-table-writer.cc.
References impala::ObjectPool::Add(), BoolColumnWriter, columns_, CreateSchema(), file_metadata_, impala::ColumnType::GetByteSize(), impala::Codec::GetCodecName(), IMPALA_BUILD_HASH, IMPALA_BUILD_VERSION, impala::TableDescriptor::num_clustering_cols(), impala::TableDescriptor::num_cols(), impala::RuntimeState::obj_pool(), impala::Status::OK, impala::HdfsTableWriter::output_expr_ctxs_, impala::PARQUET_CURRENT_VERSION, impala::RuntimeState::query_options(), RETURN_IF_ERROR, impala::HdfsTableWriter::state_, impala::HdfsTableWriter::table_desc_, impala::ColumnType::type, impala::HdfsParquetTableWriter::BaseColumnWriter::type(), impala::TYPE_BIGINT, impala::TYPE_BOOLEAN, impala::TYPE_CHAR, impala::TYPE_DECIMAL, impala::TYPE_DOUBLE, impala::TYPE_FLOAT, impala::TYPE_INT, impala::TYPE_SMALLINT, impala::TYPE_STRING, impala::TYPE_TIMESTAMP, impala::TYPE_TINYINT, impala::TYPE_VARCHAR, and VLOG_FILE.
|
virtual |
Initializes a new file. This resets the file metadata object and writes the file header to the output file.
Implements impala::HdfsTableWriter.
Definition at line 814 of file hdfs-parquet-table-writer.cc.
References AddRowGroup(), columns_, current_row_group_, DEFAULT_DATA_PAGE_SIZE, file_metadata_, file_pos_, file_size_estimate_, file_size_limit_, impala::HdfsTableSink::GetFileBlockSize(), HDFS_MIN_FILE_SIZE, MinBlockSize(), impala::Status::OK, impala::HdfsTableWriter::output_, per_file_mem_pool_, RETURN_IF_ERROR, row_count_, and WriteFileHeader().
|
private |
Minimum allowable block size in bytes. This is a function of the number of columns.
Definition at line 792 of file hdfs-parquet-table-writer.cc.
References columns_, and DEFAULT_DATA_PAGE_SIZE.
Referenced by default_block_size(), and InitNewFile().
|
inlineinherited |
Returns the stats for this writer.
Definition at line 86 of file hdfs-table-writer.h.
References impala::HdfsTableWriter::stats_.
|
inlineprotectedinherited |
Write to the current hdfs file.
Definition at line 101 of file hdfs-table-writer.h.
Referenced by impala::HdfsTextTableWriter::Flush(), impala::HdfsSequenceTableWriter::Flush(), impala::HdfsAvroTableWriter::Flush(), FlushCurrentRowGroup(), impala::HdfsTableWriter::Write(), impala::HdfsSequenceTableWriter::WriteCompressedBlock(), WriteFileFooter(), impala::HdfsSequenceTableWriter::WriteFileHeader(), impala::HdfsAvroTableWriter::WriteFileHeader(), and WriteFileHeader().
|
protectedinherited |
Definition at line 36 of file hdfs-table-writer.cc.
References impala::HdfsTableSink::bytes_written_counter(), COUNTER_ADD, impala::OutputPartition::current_file_name, impala::GetHdfsErrorMsg(), impala::OutputPartition::hdfs_connection, impala::Status::OK, impala::HdfsTableWriter::output_, impala::HdfsTableWriter::parent_, impala::HdfsTableWriter::stats_, and impala::OutputPartition::tmp_hdfs_file.
|
inlineprotectedinherited |
Definition at line 107 of file hdfs-table-writer.h.
References impala::HdfsTableWriter::Write().
|
private |
Write the file metadata and footer.
Definition at line 977 of file hdfs-parquet-table-writer.cc.
References file_metadata_, impala::Status::OK, impala::PARQUET_VERSION_NUMBER, RETURN_IF_ERROR, thrift_serializer_, and impala::HdfsTableWriter::Write().
Referenced by Finalize().
|
private |
Write the file header information to the output file.
Definition at line 920 of file hdfs-parquet-table-writer.cc.
References file_pos_, file_size_estimate_, impala::Status::OK, impala::PARQUET_VERSION_NUMBER, RETURN_IF_ERROR, and impala::HdfsTableWriter::Write().
Referenced by InitNewFile().
|
friend |
Definition at line 103 of file hdfs-parquet-table-writer.h.
|
friend |
Definition at line 108 of file hdfs-parquet-table-writer.h.
Referenced by Init().
|
friend |
Definition at line 106 of file hdfs-parquet-table-writer.h.
|
private |
array of pointers to column information.
Definition at line 143 of file hdfs-parquet-table-writer.h.
Referenced by AddRowGroup(), AppendRowBatch(), Close(), CreateSchema(), FlushCurrentRowGroup(), Init(), InitNewFile(), and MinBlockSize().
|
private |
Staging buffer to use to compress data. This is used only if compression is enabled and is reused between all data pages.
Definition at line 178 of file hdfs-parquet-table-writer.h.
Referenced by Close().
|
private |
The current row group being written to.
Definition at line 140 of file hdfs-parquet-table-writer.h.
Referenced by AddRowGroup(), FlushCurrentRowGroup(), and InitNewFile().
|
staticprivate |
Default data page size. In bytes.
Definition at line 83 of file hdfs-parquet-table-writer.h.
Referenced by InitNewFile(), and MinBlockSize().
|
private |
File metdata thrift description.
Definition at line 137 of file hdfs-parquet-table-writer.h.
Referenced by AddRowGroup(), CreateSchema(), Finalize(), Init(), InitNewFile(), and WriteFileFooter().
|
private |
The file location in the current output file. This is the number of bytes that have been written to the file so far. The metadata uses file offsets in a few places.
Definition at line 161 of file hdfs-parquet-table-writer.h.
Referenced by FlushCurrentRowGroup(), InitNewFile(), and WriteFileHeader().
|
private |
Current estimate of the total size of the file. The file size estimate includes the running size of the (uncompressed) dictionary, the size of all finalized (compressed) data pages and their page headers. If this size exceeds file_size_limit_, the current data is written and a new file is started.
Definition at line 153 of file hdfs-parquet-table-writer.h.
Referenced by AppendRowBatch(), InitNewFile(), and WriteFileHeader().
|
private |
Limit on the total size of the file.
Definition at line 156 of file hdfs-parquet-table-writer.h.
Referenced by AppendRowBatch(), and InitNewFile().
|
staticprivate |
Align block sizes to this constant. In bytes.
Definition at line 93 of file hdfs-parquet-table-writer.h.
Referenced by default_block_size().
|
staticprivate |
Default hdfs block size. In bytes.
Definition at line 90 of file hdfs-parquet-table-writer.h.
Referenced by default_block_size().
|
staticprotectedinherited |
Size to buffer output before calling Write() (which calls hdfsWrite), in bytes to minimize the overhead of Write()
Definition at line 98 of file hdfs-table-writer.h.
Referenced by impala::HdfsTextTableWriter::HdfsTextTableWriter(), and impala::HdfsTextTableWriter::Init().
|
staticprivate |
Minimum file size. If the configured size is less, fail.
Definition at line 99 of file hdfs-parquet-table-writer.h.
Referenced by InitNewFile().
|
staticprivate |
Max data page size. In bytes. TODO: May need to be increased after addressing IMPALA-1619.
Definition at line 87 of file hdfs-parquet-table-writer.h.
|
protectedinherited |
Structure describing partition written to by this writer.
Definition at line 118 of file hdfs-table-writer.h.
Referenced by impala::HdfsTextTableWriter::AppendRowBatch(), AppendRowBatch(), InitNewFile(), and impala::HdfsTableWriter::Write().
|
protectedinherited |
Expressions that materialize output values.
Definition at line 124 of file hdfs-table-writer.h.
Referenced by impala::HdfsSequenceTableWriter::AppendRowBatch(), impala::HdfsTextTableWriter::AppendRowBatch(), impala::HdfsAvroTableWriter::ConsumeRow(), CreateSchema(), impala::HdfsSequenceTableWriter::EncodeRow(), impala::HdfsTableWriter::HdfsTableWriter(), and Init().
|
protectedinherited |
Parent table sink object.
Definition at line 112 of file hdfs-table-writer.h.
Referenced by impala::HdfsSequenceTableWriter::AppendRowBatch(), impala::HdfsTextTableWriter::AppendRowBatch(), AppendRowBatch(), impala::HdfsAvroTableWriter::AppendRowBatch(), impala::HdfsTextTableWriter::Close(), impala::HdfsSequenceTableWriter::ConsumeRow(), impala::HdfsSequenceTableWriter::EncodeRow(), Finalize(), impala::HdfsTextTableWriter::Flush(), impala::HdfsSequenceTableWriter::Flush(), impala::HdfsAvroTableWriter::Flush(), impala::HdfsTableWriter::HdfsTableWriter(), impala::HdfsTextTableWriter::Init(), impala::HdfsTableWriter::Write(), and impala::HdfsSequenceTableWriter::WriteCompressedBlock().
|
private |
For each column, the on disk size written.
Definition at line 181 of file hdfs-parquet-table-writer.h.
Referenced by Finalize(), and FlushCurrentRowGroup().
|
private |
Memory for column/block buffers that is allocated per file. We need to reset this pool after flushing a file.
Definition at line 169 of file hdfs-parquet-table-writer.h.
Referenced by Close(), and InitNewFile().
|
private |
Memory for column/block buffers that are reused for the duration of the writer (i.e. reused across files).
Definition at line 165 of file hdfs-parquet-table-writer.h.
Referenced by Close().
|
private |
Number of rows in current file.
Definition at line 146 of file hdfs-parquet-table-writer.h.
Referenced by AppendRowBatch(), Finalize(), and InitNewFile().
|
staticprivate |
Default row group size. In bytes.
Definition at line 96 of file hdfs-parquet-table-writer.h.
|
private |
Current position in the batch being written. This must be persistent across calls since the writer may stop in the middle of a row batch and ask for a new file.
Definition at line 174 of file hdfs-parquet-table-writer.h.
Referenced by AppendRowBatch().
|
protectedinherited |
Runtime state.
Definition at line 115 of file hdfs-table-writer.h.
Referenced by default_block_size(), impala::HdfsSequenceTableWriter::Init(), impala::HdfsTextTableWriter::Init(), Init(), and impala::HdfsAvroTableWriter::Init().
|
protectedinherited |
Subclass should populate any file format specific stats.
Definition at line 127 of file hdfs-table-writer.h.
Referenced by Finalize(), impala::HdfsTableWriter::stats(), and impala::HdfsTableWriter::Write().
|
protectedinherited |
Table descriptor of table to be written.
Definition at line 121 of file hdfs-table-writer.h.
Referenced by AddRowGroup(), impala::HdfsSequenceTableWriter::AppendRowBatch(), impala::HdfsTextTableWriter::AppendRowBatch(), impala::HdfsAvroTableWriter::ConsumeRow(), CreateSchema(), impala::HdfsSequenceTableWriter::EncodeRow(), FlushCurrentRowGroup(), impala::HdfsTableWriter::HdfsTableWriter(), Init(), and impala::HdfsAvroTableWriter::WriteFileHeader().
|
private |
Thrift serializer utility object. Reusing this object allows for fewer memory allocations.
Definition at line 134 of file hdfs-parquet-table-writer.h.
Referenced by FlushCurrentRowGroup(), and WriteFileFooter().