Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros
impala::HdfsParquetTableWriter Class Reference

#include <hdfs-parquet-table-writer.h>

Inheritance diagram for impala::HdfsParquetTableWriter:
Collaboration diagram for impala::HdfsParquetTableWriter:

Classes

class  BaseColumnWriter
 
class  BoolColumnWriter
 
class  ColumnWriter
 

Public Member Functions

 HdfsParquetTableWriter (HdfsTableSink *parent, RuntimeState *state, OutputPartition *output_partition, const HdfsPartitionDescriptor *part_desc, const HdfsTableDescriptor *table_desc, const std::vector< ExprContext * > &output_expr_ctxs)
 
 ~HdfsParquetTableWriter ()
 
virtual Status Init ()
 Initialize column information. More...
 
virtual Status InitNewFile ()
 
virtual Status AppendRowBatch (RowBatch *batch, const std::vector< int32_t > &row_group_indices, bool *new_file)
 Appends parquet representation of rows in the batch to the current file. More...
 
virtual Status Finalize ()
 Write out all the data. More...
 
virtual void Close ()
 Called once when this writer should cleanup any resources. More...
 
virtual uint64_t default_block_size () const
 Returns the target HDFS block size to use. More...
 
virtual std::string file_extension () const
 Returns the file extension for this writer. More...
 
TInsertStats & stats ()
 Returns the stats for this writer. More...
 

Protected Member Functions

Status Write (const char *data, int32_t len)
 Write to the current hdfs file. More...
 
Status Write (const uint8_t *data, int32_t len)
 
template<typename T >
Status Write (T v)
 

Protected Attributes

HdfsTableSinkparent_
 Parent table sink object. More...
 
RuntimeStatestate_
 Runtime state. More...
 
OutputPartitionoutput_
 Structure describing partition written to by this writer. More...
 
const HdfsTableDescriptortable_desc_
 Table descriptor of table to be written. More...
 
std::vector< ExprContext * > output_expr_ctxs_
 Expressions that materialize output values. More...
 
TInsertStats stats_
 Subclass should populate any file format specific stats. More...
 

Static Protected Attributes

static const int HDFS_FLUSH_WRITE_SIZE = 50 * 1024
 

Private Member Functions

int64_t MinBlockSize () const
 Minimum allowable block size in bytes. This is a function of the number of columns. More...
 
Status CreateSchema ()
 
Status WriteFileHeader ()
 Write the file header information to the output file. More...
 
Status WriteFileFooter ()
 Write the file metadata and footer. More...
 
Status FlushCurrentRowGroup ()
 
Status AddRowGroup ()
 

Private Attributes

boost::scoped_ptr
< ThriftSerializer
thrift_serializer_
 
parquet::FileMetaData file_metadata_
 File metdata thrift description. More...
 
parquet::RowGroup * current_row_group_
 The current row group being written to. More...
 
std::vector< BaseColumnWriter * > columns_
 array of pointers to column information. More...
 
int64_t row_count_
 Number of rows in current file. More...
 
int64_t file_size_estimate_
 
int64_t file_size_limit_
 Limit on the total size of the file. More...
 
int64_t file_pos_
 
boost::scoped_ptr< MemPoolreusable_col_mem_pool_
 
boost::scoped_ptr< MemPoolper_file_mem_pool_
 
int row_idx_
 
std::vector< uint8_t > compression_staging_buffer_
 
TParquetInsertStats parquet_stats_
 For each column, the on disk size written. More...
 

Static Private Attributes

static const int DEFAULT_DATA_PAGE_SIZE = 64 * 1024
 Default data page size. In bytes. More...
 
static const int64_t MAX_DATA_PAGE_SIZE = 1024 * 1024 * 1024
 
static const int HDFS_BLOCK_SIZE = 256 * 1024 * 1024
 Default hdfs block size. In bytes. More...
 
static const int HDFS_BLOCK_ALIGNMENT = 1024 * 1024
 Align block sizes to this constant. In bytes. More...
 
static const int ROW_GROUP_SIZE = HDFS_BLOCK_SIZE
 Default row group size. In bytes. More...
 
static const int HDFS_MIN_FILE_SIZE = 8 * 1024 * 1024
 Minimum file size. If the configured size is less, fail. More...
 

Friends

class BaseColumnWriter
 
template<typename T >
class ColumnWriter
 
class BoolColumnWriter
 

Detailed Description

The writer consumes all rows passed to it and writes the evaluated output_exprs as a parquet file in hdfs. TODO: (parts of the format that are not implemented)

  • group var encoding
  • compression
  • multiple row groups per file TODO: we need a mechanism to pass the equivalent of serde params to this class from the FE. This includes:
  • compression & codec
  • type of encoding to use for each type

Definition at line 49 of file hdfs-parquet-table-writer.h.

Constructor & Destructor Documentation

HdfsParquetTableWriter::HdfsParquetTableWriter ( HdfsTableSink parent,
RuntimeState state,
OutputPartition output_partition,
const HdfsPartitionDescriptor part_desc,
const HdfsTableDescriptor table_desc,
const std::vector< ExprContext * > &  output_expr_ctxs 
)

Definition at line 618 of file hdfs-parquet-table-writer.cc.

HdfsParquetTableWriter::~HdfsParquetTableWriter ( )

Definition at line 632 of file hdfs-parquet-table-writer.cc.

Member Function Documentation

Status HdfsParquetTableWriter::AddRowGroup ( )
private

Adds a row group to the metadata and updates current_row_group_ to the new row group. current_row_group_ will be flushed.

Definition at line 766 of file hdfs-parquet-table-writer.cc.

References impala::TableDescriptor::col_names(), columns_, current_row_group_, file_metadata_, FlushCurrentRowGroup(), impala::IMPALA_TO_PARQUET_TYPES, impala::TableDescriptor::num_clustering_cols(), impala::Status::OK, impala::PLAIN, RETURN_IF_ERROR, and impala::HdfsTableWriter::table_desc_.

Referenced by InitNewFile().

Status HdfsParquetTableWriter::AppendRowBatch ( RowBatch batch,
const std::vector< int32_t > &  row_group_indices,
bool new_file 
)
virtual
void HdfsParquetTableWriter::Close ( )
virtual

Called once when this writer should cleanup any resources.

Implements impala::HdfsTableWriter.

Definition at line 910 of file hdfs-parquet-table-writer.cc.

References columns_, compression_staging_buffer_, per_file_mem_pool_, and reusable_col_mem_pool_.

uint64_t HdfsParquetTableWriter::default_block_size ( ) const
virtual
virtual std::string impala::HdfsParquetTableWriter::file_extension ( ) const
inlinevirtual

Returns the file extension for this writer.

Implements impala::HdfsTableWriter.

Definition at line 79 of file hdfs-parquet-table-writer.h.

Status HdfsParquetTableWriter::FlushCurrentRowGroup ( )
private

Flushes the current row group to file. This will compute the final offsets of column chunks, updating the file metadata.

Definition at line 928 of file hdfs-parquet-table-writer.cc.

References impala::TableDescriptor::col_names(), columns_, current_row_group_, file_pos_, impala::TableDescriptor::num_clustering_cols(), impala::Status::OK, parquet_stats_, RETURN_IF_ERROR, impala::HdfsTableWriter::table_desc_, thrift_serializer_, and impala::HdfsTableWriter::Write().

Referenced by AddRowGroup(), and Finalize().

Status HdfsParquetTableWriter::InitNewFile ( )
virtual
int64_t HdfsParquetTableWriter::MinBlockSize ( ) const
private

Minimum allowable block size in bytes. This is a function of the number of columns.

Definition at line 792 of file hdfs-parquet-table-writer.cc.

References columns_, and DEFAULT_DATA_PAGE_SIZE.

Referenced by default_block_size(), and InitNewFile().

TInsertStats& impala::HdfsTableWriter::stats ( )
inlineinherited

Returns the stats for this writer.

Definition at line 86 of file hdfs-table-writer.h.

References impala::HdfsTableWriter::stats_.

template<typename T >
Status impala::HdfsTableWriter::Write ( v)
inlineprotectedinherited

Definition at line 107 of file hdfs-table-writer.h.

References impala::HdfsTableWriter::Write().

Status HdfsParquetTableWriter::WriteFileFooter ( )
private
Status HdfsParquetTableWriter::WriteFileHeader ( )
private

Write the file header information to the output file.

Definition at line 920 of file hdfs-parquet-table-writer.cc.

References file_pos_, file_size_estimate_, impala::Status::OK, impala::PARQUET_VERSION_NUMBER, RETURN_IF_ERROR, and impala::HdfsTableWriter::Write().

Referenced by InitNewFile().

Friends And Related Function Documentation

friend class BaseColumnWriter
friend

Definition at line 103 of file hdfs-parquet-table-writer.h.

friend class BoolColumnWriter
friend

Definition at line 108 of file hdfs-parquet-table-writer.h.

Referenced by Init().

template<typename T >
friend class ColumnWriter
friend

Definition at line 106 of file hdfs-parquet-table-writer.h.

Member Data Documentation

std::vector<BaseColumnWriter*> impala::HdfsParquetTableWriter::columns_
private

array of pointers to column information.

Definition at line 143 of file hdfs-parquet-table-writer.h.

Referenced by AddRowGroup(), AppendRowBatch(), Close(), CreateSchema(), FlushCurrentRowGroup(), Init(), InitNewFile(), and MinBlockSize().

std::vector<uint8_t> impala::HdfsParquetTableWriter::compression_staging_buffer_
private

Staging buffer to use to compress data. This is used only if compression is enabled and is reused between all data pages.

Definition at line 178 of file hdfs-parquet-table-writer.h.

Referenced by Close().

parquet::RowGroup* impala::HdfsParquetTableWriter::current_row_group_
private

The current row group being written to.

Definition at line 140 of file hdfs-parquet-table-writer.h.

Referenced by AddRowGroup(), FlushCurrentRowGroup(), and InitNewFile().

const int impala::HdfsParquetTableWriter::DEFAULT_DATA_PAGE_SIZE = 64 * 1024
staticprivate

Default data page size. In bytes.

Definition at line 83 of file hdfs-parquet-table-writer.h.

Referenced by InitNewFile(), and MinBlockSize().

parquet::FileMetaData impala::HdfsParquetTableWriter::file_metadata_
private

File metdata thrift description.

Definition at line 137 of file hdfs-parquet-table-writer.h.

Referenced by AddRowGroup(), CreateSchema(), Finalize(), Init(), InitNewFile(), and WriteFileFooter().

int64_t impala::HdfsParquetTableWriter::file_pos_
private

The file location in the current output file. This is the number of bytes that have been written to the file so far. The metadata uses file offsets in a few places.

Definition at line 161 of file hdfs-parquet-table-writer.h.

Referenced by FlushCurrentRowGroup(), InitNewFile(), and WriteFileHeader().

int64_t impala::HdfsParquetTableWriter::file_size_estimate_
private

Current estimate of the total size of the file. The file size estimate includes the running size of the (uncompressed) dictionary, the size of all finalized (compressed) data pages and their page headers. If this size exceeds file_size_limit_, the current data is written and a new file is started.

Definition at line 153 of file hdfs-parquet-table-writer.h.

Referenced by AppendRowBatch(), InitNewFile(), and WriteFileHeader().

int64_t impala::HdfsParquetTableWriter::file_size_limit_
private

Limit on the total size of the file.

Definition at line 156 of file hdfs-parquet-table-writer.h.

Referenced by AppendRowBatch(), and InitNewFile().

const int impala::HdfsParquetTableWriter::HDFS_BLOCK_ALIGNMENT = 1024 * 1024
staticprivate

Align block sizes to this constant. In bytes.

Definition at line 93 of file hdfs-parquet-table-writer.h.

Referenced by default_block_size().

const int impala::HdfsParquetTableWriter::HDFS_BLOCK_SIZE = 256 * 1024 * 1024
staticprivate

Default hdfs block size. In bytes.

Definition at line 90 of file hdfs-parquet-table-writer.h.

Referenced by default_block_size().

const int impala::HdfsTableWriter::HDFS_FLUSH_WRITE_SIZE = 50 * 1024
staticprotectedinherited

Size to buffer output before calling Write() (which calls hdfsWrite), in bytes to minimize the overhead of Write()

Definition at line 98 of file hdfs-table-writer.h.

Referenced by impala::HdfsTextTableWriter::HdfsTextTableWriter(), and impala::HdfsTextTableWriter::Init().

const int impala::HdfsParquetTableWriter::HDFS_MIN_FILE_SIZE = 8 * 1024 * 1024
staticprivate

Minimum file size. If the configured size is less, fail.

Definition at line 99 of file hdfs-parquet-table-writer.h.

Referenced by InitNewFile().

const int64_t impala::HdfsParquetTableWriter::MAX_DATA_PAGE_SIZE = 1024 * 1024 * 1024
staticprivate

Max data page size. In bytes. TODO: May need to be increased after addressing IMPALA-1619.

Definition at line 87 of file hdfs-parquet-table-writer.h.

OutputPartition* impala::HdfsTableWriter::output_
protectedinherited

Structure describing partition written to by this writer.

Definition at line 118 of file hdfs-table-writer.h.

Referenced by impala::HdfsTextTableWriter::AppendRowBatch(), AppendRowBatch(), InitNewFile(), and impala::HdfsTableWriter::Write().

std::vector<ExprContext*> impala::HdfsTableWriter::output_expr_ctxs_
protectedinherited
TParquetInsertStats impala::HdfsParquetTableWriter::parquet_stats_
private

For each column, the on disk size written.

Definition at line 181 of file hdfs-parquet-table-writer.h.

Referenced by Finalize(), and FlushCurrentRowGroup().

boost::scoped_ptr<MemPool> impala::HdfsParquetTableWriter::per_file_mem_pool_
private

Memory for column/block buffers that is allocated per file. We need to reset this pool after flushing a file.

Definition at line 169 of file hdfs-parquet-table-writer.h.

Referenced by Close(), and InitNewFile().

boost::scoped_ptr<MemPool> impala::HdfsParquetTableWriter::reusable_col_mem_pool_
private

Memory for column/block buffers that are reused for the duration of the writer (i.e. reused across files).

Definition at line 165 of file hdfs-parquet-table-writer.h.

Referenced by Close().

int64_t impala::HdfsParquetTableWriter::row_count_
private

Number of rows in current file.

Definition at line 146 of file hdfs-parquet-table-writer.h.

Referenced by AppendRowBatch(), Finalize(), and InitNewFile().

const int impala::HdfsParquetTableWriter::ROW_GROUP_SIZE = HDFS_BLOCK_SIZE
staticprivate

Default row group size. In bytes.

Definition at line 96 of file hdfs-parquet-table-writer.h.

int impala::HdfsParquetTableWriter::row_idx_
private

Current position in the batch being written. This must be persistent across calls since the writer may stop in the middle of a row batch and ask for a new file.

Definition at line 174 of file hdfs-parquet-table-writer.h.

Referenced by AppendRowBatch().

RuntimeState* impala::HdfsTableWriter::state_
protectedinherited
TInsertStats impala::HdfsTableWriter::stats_
protectedinherited

Subclass should populate any file format specific stats.

Definition at line 127 of file hdfs-table-writer.h.

Referenced by Finalize(), impala::HdfsTableWriter::stats(), and impala::HdfsTableWriter::Write().

boost::scoped_ptr<ThriftSerializer> impala::HdfsParquetTableWriter::thrift_serializer_
private

Thrift serializer utility object. Reusing this object allows for fewer memory allocations.

Definition at line 134 of file hdfs-parquet-table-writer.h.

Referenced by FlushCurrentRowGroup(), and WriteFileFooter().


The documentation for this class was generated from the following files: