Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros
impala::DelimitedTextParser Class Reference

#include <delimited-text-parser.h>

Collaboration diagram for impala::DelimitedTextParser:

Public Member Functions

 DelimitedTextParser (int num_cols, int num_partition_keys, const bool *is_materialized_col, char tuple_delim, char field_delim_= '\0', char collection_item_delim= '^', char escape_char= '\0')
 num_cols is the total number of columns including partition keys. More...
 
void ParserReset ()
 Called to initialize parser at beginning of scan range. More...
 
bool AtTupleStart ()
 Check if we are at the start of a tuple. More...
 
char escape_char () const
 
Status ParseFieldLocations (int max_tuples, int64_t remaining_len, char **byte_buffer_ptr, char **row_end_locations, FieldLocation *field_locations, int *num_tuples, int *num_fields, char **next_column_start)
 
template<bool process_escapes>
void ParseSingleTuple (int64_t len, char *buffer, FieldLocation *field_locations, int *num_fields)
 Simplified version of ParseSSE which does not handle tuple delimiters. More...
 
int FindFirstInstance (const char *buffer, int len)
 
bool ReturnCurrentColumn () const
 
template<bool process_escapes>
void FillColumns (int len, char **last_column, int *num_fields, impala::FieldLocation *field_locations)
 
bool HasUnfinishedTuple ()
 

Private Member Functions

void ParserInit (HdfsScanNode *scan_node)
 Initialize the parser state. More...
 
template<bool process_escapes>
void AddColumn (int len, char **next_column_start, int *num_fields, FieldLocation *field_locations)
 
template<bool process_escapes>
void ParseSse (int max_tuples, int64_t *remaining_len, char **byte_buffer_ptr, char **row_end_locations_, FieldLocation *field_locations, int *num_tuples, int *num_fields, char **next_column_start)
 

Private Attributes

__m128i xmm_tuple_search_
 SSE(xmm) register containing the tuple search character. More...
 
__m128i xmm_delim_search_
 SSE(xmm) register containing the delimiter search character. More...
 
int num_delims_
 The number of delimiters contained in xmm_delim_search_, i.e. its length. More...
 
__m128i xmm_escape_search_
 SSE(xmm) register containing the escape search character. More...
 
char field_delim_
 Character delimiting fields (to become slots). More...
 
bool process_escapes_
 True if this parser should handle escape characters. More...
 
char escape_char_
 Escape character. Only used if process_escapes_ is true. More...
 
char collection_item_delim_
 Character delimiting collection items (to become slots). More...
 
char tuple_delim_
 Character delimiting tuples. More...
 
bool current_column_has_escape_
 
bool last_char_is_escape_
 Whether or not the previous character was the escape character. More...
 
int32_t last_row_delim_offset_
 
uint16_t low_mask_ [16]
 Precomputed masks to process escape characters. More...
 
uint16_t high_mask_ [16]
 
int num_cols_
 Number of columns in the table (including partition columns) More...
 
int num_partition_keys_
 Number of partition columns in the table. More...
 
const boolis_materialized_col_
 
int column_idx_
 Index to keep track of the current column in the current file. More...
 
bool unfinished_tuple_
 True if the last tuple is unfinished (not ended with tuple delimiter). More...
 

Detailed Description

Definition at line 25 of file delimited-text-parser.h.

Constructor & Destructor Documentation

DelimitedTextParser::DelimitedTextParser ( int  num_cols,
int  num_partition_keys,
const bool is_materialized_col,
char  tuple_delim,
char  field_delim_ = '\0',
char  collection_item_delim = '^',
char  escape_char = '\0' 
)

num_cols is the total number of columns including partition keys.

The Delimited Text Parser parses text rows that are delimited by specific characters: tuple_delim: delimits tuples field_delim: delimits fields collection_item_delim: delimits collection items escape_char: escape delimiters, make them part of the data.is_materialized_col should be initialized to an array of length 'num_cols', with is_materialized_col[i] = <true if column i should be materialized, false otherwise> Owned by caller. The main method is ParseData which fills in a vector of pointers and lengths to the fields. It also can handle an escape character which masks a tuple or field delimiter that occurs in the data.

Definition at line 24 of file delimited-text-parser.cc.

References impala::SSEUtil::CHARS_PER_128_BIT_REGISTER, collection_item_delim_, escape_char_, field_delim_, high_mask_, low_mask_, num_delims_, ParserReset(), process_escapes_, tuple_delim_, xmm_delim_search_, xmm_escape_search_, and xmm_tuple_search_.

Member Function Documentation

template<bool process_escapes>
void impala::DelimitedTextParser::AddColumn ( int  len,
char **  next_column_start,
int *  num_fields,
FieldLocation field_locations 
)
inlineprivate

Helper routine to add a column to the field_locations vector. Template parameter: process_escapes – if true the the column may have escape characters and the negative of the len will be stored. len: lenght of the current column. Input/Output: next_column_start: Start of the current column, moved to the start of the next. num_fields: current number of fields processed, updated to next field. Output: field_locations: updated with start and length of current field.

Definition at line 53 of file delimited-text-parser.inline.h.

References column_idx_, current_column_has_escape_, impala::FieldLocation::len, ReturnCurrentColumn(), and impala::FieldLocation::start.

bool impala::DelimitedTextParser::AtTupleStart ( )
inline

Check if we are at the start of a tuple.

Definition at line 53 of file delimited-text-parser.h.

References column_idx_, and num_partition_keys_.

char impala::DelimitedTextParser::escape_char ( ) const
inline

Definition at line 55 of file delimited-text-parser.h.

References escape_char_.

template<bool process_escapes>
void impala::DelimitedTextParser::FillColumns ( int  len,
char **  last_column,
int *  num_fields,
impala::FieldLocation field_locations 
)
inline

Fill in columns missing at the end of the tuple. len and last_column may contain the length and the pointer to the last column on which the file ended without a delimiter. Fills in the offsets and lengths in field_locations. If parsing stopped on a delimiter and there is no last column then len will be 0. Other columns beyond that are filled with 0 length fields. num_fields points to an initialized count of fields and will incremented by the number fields added. field_locations will be updated with the start and length of the fields.

Definition at line 71 of file delimited-text-parser.inline.h.

References column_idx_, and num_cols_.

int DelimitedTextParser::FindFirstInstance ( const char *  buffer,
int  len 
)

FindFirstInstance returns the position after the first non-escaped tuple delimiter from the starting offset. Used to find the start of a tuple if jumping into the middle of a text file. Also used to find the sync marker for Sequenced and RC files. If no tuple delimiter is found within the buffer, return -1;

Definition at line 194 of file delimited-text-parser.cc.

References impala::SSEUtil::CHARS_PER_128_BIT_REGISTER, escape_char_, impala::CpuInfo::IsSupported(), last_row_delim_offset_, process_escapes_, impala::CpuInfo::SSE4_2, impala::SSE4_cmpestrm(), impala::SSEUtil::SSE_BITMASK, impala::SSEUtil::STRCHR_MODE, tuple_delim_, and xmm_tuple_search_.

Referenced by impala::Validate().

bool impala::DelimitedTextParser::HasUnfinishedTuple ( )
inline

Return true if we have not seen a tuple delimiter for the current tuple being parsed (i.e., the last byte read was not a tuple delimiter).

Definition at line 121 of file delimited-text-parser.h.

References unfinished_tuple_.

Status DelimitedTextParser::ParseFieldLocations ( int  max_tuples,
int64_t  remaining_len,
char **  byte_buffer_ptr,
char **  row_end_locations,
FieldLocation field_locations,
int *  num_tuples,
int *  num_fields,
char **  next_column_start 
)

Parses a byte buffer for the field and tuple breaks. This function will write the field start & len to field_locations which can then be written out to tuples. This function uses SSE ("Intel x86 instruction set extension 'Streaming Simd Extension') if the hardware supports SSE4.2 instructions. SSE4.2 added string processing instructions that allow for processing 16 characters at a time. Otherwise, this function walks the file_buffer_ character by character. Input Parameters: max_tuples: The maximum number of tuples that should be parsed. This is used to control how the batching works. remaining_len: Length of data remaining in the byte_buffer_pointer. byte_buffer_pointer: Pointer to the buffer containing the data to be parsed. Output Parameters: field_locations: array of pointers to data fields and their lengths num_tuples: Number of tuples parsed num_fields: Number of materialized fields parsed next_column_start: pointer within file_buffer_ where the next field starts after the return from the call to ParseData

Definition at line 98 of file delimited-text-parser.cc.

References collection_item_delim_, column_idx_, current_column_has_escape_, escape_char_, field_delim_, impala::CpuInfo::IsSupported(), last_char_is_escape_, last_row_delim_offset_, num_partition_keys_, impala::Status::OK, process_escapes_, impala::CpuInfo::SSE4_2, tuple_delim_, and unfinished_tuple_.

Referenced by impala::Validate().

void impala::DelimitedTextParser::ParserInit ( HdfsScanNode scan_node)
private

Initialize the parser state.

void DelimitedTextParser::ParserReset ( )

Called to initialize parser at beginning of scan range.

Definition at line 90 of file delimited-text-parser.cc.

References column_idx_, current_column_has_escape_, last_char_is_escape_, last_row_delim_offset_, and num_partition_keys_.

Referenced by DelimitedTextParser(), and impala::Validate().

template<bool process_escapes>
void impala::DelimitedTextParser::ParseSingleTuple ( int64_t  len,
char *  buffer,
FieldLocation field_locations,
int *  num_fields 
)
inline

Simplified version of ParseSSE which does not handle tuple delimiters.

Parse a single tuple from buffer.

  • buffer/len are input parameters for the entire record.
  • on return field_locations will contain the start/len for each materialized col.
  • *num_fields returns the number of fields processed. This function is used to parse sequence file records which do not need to parse for tuple delimiters.

Definition at line 221 of file delimited-text-parser.inline.h.

References impala::SSEUtil::CHARS_PER_128_BIT_REGISTER, collection_item_delim_, column_idx_, current_column_has_escape_, escape_char_, field_delim_, high_mask_, impala::CpuInfo::IsSupported(), last_char_is_escape_, LIKELY, low_mask_, num_delims_, num_partition_keys_, impala::ProcessEscapeMask(), impala::CpuInfo::SSE4_2, impala::SSE4_cmpestrm(), impala::SSEUtil::SSE_BITMASK, impala::SSEUtil::STRCHR_MODE, xmm_delim_search_, and xmm_escape_search_.

template<bool process_escapes>
void impala::DelimitedTextParser::ParseSse ( int  max_tuples,
int64_t *  remaining_len,
char **  byte_buffer_ptr,
char **  row_end_locations,
FieldLocation field_locations,
int *  num_tuples,
int *  num_fields,
char **  next_column_start 
)
inlineprivate

Helper routine to parse delimited text using SSE instructions. Identical arguments as ParseFieldLocations. If the template argument, 'process_escapes' is true, this function will handle escapes, otherwise, it will assume the text is unescaped. By using templates, we can special case the un-escaped path for better performance. The unescaped path is optimized away by the compiler.

SSE optimized raw text file parsing. SSE4_2 added an instruction (with 3 modes) for text processing. The modes mimic strchr, strstr and strcmp. For text parsing, we can leverage the strchr functionality. The instruction operates on two sse registers:

  • the needle (what you are searching for)
  • the haystack (where you are searching in) Both registers can contain up to 16 characters. The result is a 16-bit mask with a bit set for each character in the haystack that matched any character in the needle. For example: Needle = 'abcd000000000000' (we're searching for any a's, b's, c's or d's) Haystack = 'asdfghjklhjbdwwc' (the raw string) Result = '1010000000011001'

Definition at line 98 of file delimited-text-parser.inline.h.

References impala::SSEUtil::CHARS_PER_128_BIT_REGISTER, collection_item_delim_, column_idx_, current_column_has_escape_, escape_char_, field_delim_, high_mask_, impala::CpuInfo::IsSupported(), last_char_is_escape_, last_row_delim_offset_, LIKELY, low_mask_, num_delims_, num_partition_keys_, impala::ProcessEscapeMask(), impala::CpuInfo::SSE4_2, impala::SSE4_cmpestrm(), impala::SSEUtil::SSE_BITMASK, impala::SSEUtil::STRCHR_MODE, tuple_delim_, unfinished_tuple_, UNLIKELY, xmm_delim_search_, and xmm_escape_search_.

bool impala::DelimitedTextParser::ReturnCurrentColumn ( ) const
inline

Will we return the current column to the query? Hive allows cols at the end of the table that are not in the schema. We'll just ignore those columns

Definition at line 102 of file delimited-text-parser.h.

References column_idx_, is_materialized_col_, and num_cols_.

Referenced by AddColumn().

Member Data Documentation

char impala::DelimitedTextParser::collection_item_delim_
private

Character delimiting collection items (to become slots).

Definition at line 175 of file delimited-text-parser.h.

Referenced by DelimitedTextParser(), ParseFieldLocations(), ParseSingleTuple(), and ParseSse().

int impala::DelimitedTextParser::column_idx_
private

Index to keep track of the current column in the current file.

Definition at line 211 of file delimited-text-parser.h.

Referenced by AddColumn(), AtTupleStart(), FillColumns(), ParseFieldLocations(), ParserReset(), ParseSingleTuple(), ParseSse(), and ReturnCurrentColumn().

bool impala::DelimitedTextParser::current_column_has_escape_
private

Whether or not the current column has an escape character in it (and needs to be unescaped)

Definition at line 182 of file delimited-text-parser.h.

Referenced by AddColumn(), ParseFieldLocations(), ParserReset(), ParseSingleTuple(), and ParseSse().

char impala::DelimitedTextParser::escape_char_
private

Escape character. Only used if process_escapes_ is true.

Definition at line 172 of file delimited-text-parser.h.

Referenced by DelimitedTextParser(), escape_char(), FindFirstInstance(), ParseFieldLocations(), ParseSingleTuple(), and ParseSse().

char impala::DelimitedTextParser::field_delim_
private

Character delimiting fields (to become slots).

Definition at line 166 of file delimited-text-parser.h.

Referenced by DelimitedTextParser(), ParseFieldLocations(), ParseSingleTuple(), and ParseSse().

uint16_t impala::DelimitedTextParser::high_mask_[16]
private

Definition at line 198 of file delimited-text-parser.h.

Referenced by DelimitedTextParser(), ParseSingleTuple(), and ParseSse().

const bool* impala::DelimitedTextParser::is_materialized_col_
private

For each col index [0, num_cols_), true if the column should be materialized. Not owned.

Definition at line 208 of file delimited-text-parser.h.

Referenced by ReturnCurrentColumn().

bool impala::DelimitedTextParser::last_char_is_escape_
private

Whether or not the previous character was the escape character.

Definition at line 185 of file delimited-text-parser.h.

Referenced by ParseFieldLocations(), ParserReset(), ParseSingleTuple(), and ParseSse().

int32_t impala::DelimitedTextParser::last_row_delim_offset_
private

Used for special processing of . This will be the offset of the last instance of from the end of the current buffer being searched unless the last row delimiter was not a in which case it will be -1. If the last character in a buffer is then the value will be 0. At the start of processing a new buffer if last_row_delim_offset_ is 0 then it is set to be one more than the size of the buffer so that if the buffer starts with
it is processed as
.

Definition at line 194 of file delimited-text-parser.h.

Referenced by FindFirstInstance(), ParseFieldLocations(), ParserReset(), and ParseSse().

uint16_t impala::DelimitedTextParser::low_mask_[16]
private

Precomputed masks to process escape characters.

Definition at line 197 of file delimited-text-parser.h.

Referenced by DelimitedTextParser(), ParseSingleTuple(), and ParseSse().

int impala::DelimitedTextParser::num_cols_
private

Number of columns in the table (including partition columns)

Definition at line 201 of file delimited-text-parser.h.

Referenced by FillColumns(), and ReturnCurrentColumn().

int impala::DelimitedTextParser::num_delims_
private

The number of delimiters contained in xmm_delim_search_, i.e. its length.

Definition at line 160 of file delimited-text-parser.h.

Referenced by DelimitedTextParser(), ParseSingleTuple(), and ParseSse().

int impala::DelimitedTextParser::num_partition_keys_
private

Number of partition columns in the table.

Definition at line 204 of file delimited-text-parser.h.

Referenced by AtTupleStart(), ParseFieldLocations(), ParserReset(), ParseSingleTuple(), and ParseSse().

bool impala::DelimitedTextParser::process_escapes_
private

True if this parser should handle escape characters.

Definition at line 169 of file delimited-text-parser.h.

Referenced by DelimitedTextParser(), FindFirstInstance(), and ParseFieldLocations().

char impala::DelimitedTextParser::tuple_delim_
private

Character delimiting tuples.

Definition at line 178 of file delimited-text-parser.h.

Referenced by DelimitedTextParser(), FindFirstInstance(), ParseFieldLocations(), and ParseSse().

bool impala::DelimitedTextParser::unfinished_tuple_
private

True if the last tuple is unfinished (not ended with tuple delimiter).

Definition at line 214 of file delimited-text-parser.h.

Referenced by HasUnfinishedTuple(), ParseFieldLocations(), and ParseSse().

__m128i impala::DelimitedTextParser::xmm_delim_search_
private

SSE(xmm) register containing the delimiter search character.

Definition at line 157 of file delimited-text-parser.h.

Referenced by DelimitedTextParser(), ParseSingleTuple(), and ParseSse().

__m128i impala::DelimitedTextParser::xmm_escape_search_
private

SSE(xmm) register containing the escape search character.

Definition at line 163 of file delimited-text-parser.h.

Referenced by DelimitedTextParser(), ParseSingleTuple(), and ParseSse().

__m128i impala::DelimitedTextParser::xmm_tuple_search_
private

SSE(xmm) register containing the tuple search character.

Definition at line 154 of file delimited-text-parser.h.

Referenced by DelimitedTextParser(), and FindFirstInstance().


The documentation for this class was generated from the following files: