Impala
Impalaistheopensource,nativeanalyticdatabaseforApacheHadoop.
|
#include <delimited-text-parser.h>
Public Member Functions | |
DelimitedTextParser (int num_cols, int num_partition_keys, const bool *is_materialized_col, char tuple_delim, char field_delim_= '\0', char collection_item_delim= '^', char escape_char= '\0') | |
num_cols is the total number of columns including partition keys. More... | |
void | ParserReset () |
Called to initialize parser at beginning of scan range. More... | |
bool | AtTupleStart () |
Check if we are at the start of a tuple. More... | |
char | escape_char () const |
Status | ParseFieldLocations (int max_tuples, int64_t remaining_len, char **byte_buffer_ptr, char **row_end_locations, FieldLocation *field_locations, int *num_tuples, int *num_fields, char **next_column_start) |
template<bool process_escapes> | |
void | ParseSingleTuple (int64_t len, char *buffer, FieldLocation *field_locations, int *num_fields) |
Simplified version of ParseSSE which does not handle tuple delimiters. More... | |
int | FindFirstInstance (const char *buffer, int len) |
bool | ReturnCurrentColumn () const |
template<bool process_escapes> | |
void | FillColumns (int len, char **last_column, int *num_fields, impala::FieldLocation *field_locations) |
bool | HasUnfinishedTuple () |
Private Member Functions | |
void | ParserInit (HdfsScanNode *scan_node) |
Initialize the parser state. More... | |
template<bool process_escapes> | |
void | AddColumn (int len, char **next_column_start, int *num_fields, FieldLocation *field_locations) |
template<bool process_escapes> | |
void | ParseSse (int max_tuples, int64_t *remaining_len, char **byte_buffer_ptr, char **row_end_locations_, FieldLocation *field_locations, int *num_tuples, int *num_fields, char **next_column_start) |
Private Attributes | |
__m128i | xmm_tuple_search_ |
SSE(xmm) register containing the tuple search character. More... | |
__m128i | xmm_delim_search_ |
SSE(xmm) register containing the delimiter search character. More... | |
int | num_delims_ |
The number of delimiters contained in xmm_delim_search_, i.e. its length. More... | |
__m128i | xmm_escape_search_ |
SSE(xmm) register containing the escape search character. More... | |
char | field_delim_ |
Character delimiting fields (to become slots). More... | |
bool | process_escapes_ |
True if this parser should handle escape characters. More... | |
char | escape_char_ |
Escape character. Only used if process_escapes_ is true. More... | |
char | collection_item_delim_ |
Character delimiting collection items (to become slots). More... | |
char | tuple_delim_ |
Character delimiting tuples. More... | |
bool | current_column_has_escape_ |
bool | last_char_is_escape_ |
Whether or not the previous character was the escape character. More... | |
int32_t | last_row_delim_offset_ |
uint16_t | low_mask_ [16] |
Precomputed masks to process escape characters. More... | |
uint16_t | high_mask_ [16] |
int | num_cols_ |
Number of columns in the table (including partition columns) More... | |
int | num_partition_keys_ |
Number of partition columns in the table. More... | |
const bool * | is_materialized_col_ |
int | column_idx_ |
Index to keep track of the current column in the current file. More... | |
bool | unfinished_tuple_ |
True if the last tuple is unfinished (not ended with tuple delimiter). More... | |
Definition at line 25 of file delimited-text-parser.h.
DelimitedTextParser::DelimitedTextParser | ( | int | num_cols, |
int | num_partition_keys, | ||
const bool * | is_materialized_col, | ||
char | tuple_delim, | ||
char | field_delim_ = '\0' , |
||
char | collection_item_delim = '^' , |
||
char | escape_char = '\0' |
||
) |
num_cols is the total number of columns including partition keys.
The Delimited Text Parser parses text rows that are delimited by specific characters: tuple_delim: delimits tuples field_delim: delimits fields collection_item_delim: delimits collection items escape_char: escape delimiters, make them part of the data.is_materialized_col should be initialized to an array of length 'num_cols', with is_materialized_col[i] = <true if column i should be materialized, false otherwise> Owned by caller. The main method is ParseData which fills in a vector of pointers and lengths to the fields. It also can handle an escape character which masks a tuple or field delimiter that occurs in the data.
Definition at line 24 of file delimited-text-parser.cc.
References impala::SSEUtil::CHARS_PER_128_BIT_REGISTER, collection_item_delim_, escape_char_, field_delim_, high_mask_, low_mask_, num_delims_, ParserReset(), process_escapes_, tuple_delim_, xmm_delim_search_, xmm_escape_search_, and xmm_tuple_search_.
|
inlineprivate |
Helper routine to add a column to the field_locations vector. Template parameter: process_escapes – if true the the column may have escape characters and the negative of the len will be stored. len: lenght of the current column. Input/Output: next_column_start: Start of the current column, moved to the start of the next. num_fields: current number of fields processed, updated to next field. Output: field_locations: updated with start and length of current field.
Definition at line 53 of file delimited-text-parser.inline.h.
References column_idx_, current_column_has_escape_, impala::FieldLocation::len, ReturnCurrentColumn(), and impala::FieldLocation::start.
|
inline |
Check if we are at the start of a tuple.
Definition at line 53 of file delimited-text-parser.h.
References column_idx_, and num_partition_keys_.
|
inline |
Definition at line 55 of file delimited-text-parser.h.
References escape_char_.
|
inline |
Fill in columns missing at the end of the tuple. len and last_column may contain the length and the pointer to the last column on which the file ended without a delimiter. Fills in the offsets and lengths in field_locations. If parsing stopped on a delimiter and there is no last column then len will be 0. Other columns beyond that are filled with 0 length fields. num_fields points to an initialized count of fields and will incremented by the number fields added. field_locations will be updated with the start and length of the fields.
Definition at line 71 of file delimited-text-parser.inline.h.
References column_idx_, and num_cols_.
int DelimitedTextParser::FindFirstInstance | ( | const char * | buffer, |
int | len | ||
) |
FindFirstInstance returns the position after the first non-escaped tuple delimiter from the starting offset. Used to find the start of a tuple if jumping into the middle of a text file. Also used to find the sync marker for Sequenced and RC files. If no tuple delimiter is found within the buffer, return -1;
Definition at line 194 of file delimited-text-parser.cc.
References impala::SSEUtil::CHARS_PER_128_BIT_REGISTER, escape_char_, impala::CpuInfo::IsSupported(), last_row_delim_offset_, process_escapes_, impala::CpuInfo::SSE4_2, impala::SSE4_cmpestrm(), impala::SSEUtil::SSE_BITMASK, impala::SSEUtil::STRCHR_MODE, tuple_delim_, and xmm_tuple_search_.
Referenced by impala::Validate().
|
inline |
Return true if we have not seen a tuple delimiter for the current tuple being parsed (i.e., the last byte read was not a tuple delimiter).
Definition at line 121 of file delimited-text-parser.h.
References unfinished_tuple_.
Status DelimitedTextParser::ParseFieldLocations | ( | int | max_tuples, |
int64_t | remaining_len, | ||
char ** | byte_buffer_ptr, | ||
char ** | row_end_locations, | ||
FieldLocation * | field_locations, | ||
int * | num_tuples, | ||
int * | num_fields, | ||
char ** | next_column_start | ||
) |
Parses a byte buffer for the field and tuple breaks. This function will write the field start & len to field_locations which can then be written out to tuples. This function uses SSE ("Intel x86 instruction set extension 'Streaming Simd Extension') if the hardware supports SSE4.2 instructions. SSE4.2 added string processing instructions that allow for processing 16 characters at a time. Otherwise, this function walks the file_buffer_ character by character. Input Parameters: max_tuples: The maximum number of tuples that should be parsed. This is used to control how the batching works. remaining_len: Length of data remaining in the byte_buffer_pointer. byte_buffer_pointer: Pointer to the buffer containing the data to be parsed. Output Parameters: field_locations: array of pointers to data fields and their lengths num_tuples: Number of tuples parsed num_fields: Number of materialized fields parsed next_column_start: pointer within file_buffer_ where the next field starts after the return from the call to ParseData
Definition at line 98 of file delimited-text-parser.cc.
References collection_item_delim_, column_idx_, current_column_has_escape_, escape_char_, field_delim_, impala::CpuInfo::IsSupported(), last_char_is_escape_, last_row_delim_offset_, num_partition_keys_, impala::Status::OK, process_escapes_, impala::CpuInfo::SSE4_2, tuple_delim_, and unfinished_tuple_.
Referenced by impala::Validate().
|
private |
Initialize the parser state.
void DelimitedTextParser::ParserReset | ( | ) |
Called to initialize parser at beginning of scan range.
Definition at line 90 of file delimited-text-parser.cc.
References column_idx_, current_column_has_escape_, last_char_is_escape_, last_row_delim_offset_, and num_partition_keys_.
Referenced by DelimitedTextParser(), and impala::Validate().
|
inline |
Simplified version of ParseSSE which does not handle tuple delimiters.
Parse a single tuple from buffer.
Definition at line 221 of file delimited-text-parser.inline.h.
References impala::SSEUtil::CHARS_PER_128_BIT_REGISTER, collection_item_delim_, column_idx_, current_column_has_escape_, escape_char_, field_delim_, high_mask_, impala::CpuInfo::IsSupported(), last_char_is_escape_, LIKELY, low_mask_, num_delims_, num_partition_keys_, impala::ProcessEscapeMask(), impala::CpuInfo::SSE4_2, impala::SSE4_cmpestrm(), impala::SSEUtil::SSE_BITMASK, impala::SSEUtil::STRCHR_MODE, xmm_delim_search_, and xmm_escape_search_.
|
inlineprivate |
Helper routine to parse delimited text using SSE instructions. Identical arguments as ParseFieldLocations. If the template argument, 'process_escapes' is true, this function will handle escapes, otherwise, it will assume the text is unescaped. By using templates, we can special case the un-escaped path for better performance. The unescaped path is optimized away by the compiler.
SSE optimized raw text file parsing. SSE4_2 added an instruction (with 3 modes) for text processing. The modes mimic strchr, strstr and strcmp. For text parsing, we can leverage the strchr functionality. The instruction operates on two sse registers:
Definition at line 98 of file delimited-text-parser.inline.h.
References impala::SSEUtil::CHARS_PER_128_BIT_REGISTER, collection_item_delim_, column_idx_, current_column_has_escape_, escape_char_, field_delim_, high_mask_, impala::CpuInfo::IsSupported(), last_char_is_escape_, last_row_delim_offset_, LIKELY, low_mask_, num_delims_, num_partition_keys_, impala::ProcessEscapeMask(), impala::CpuInfo::SSE4_2, impala::SSE4_cmpestrm(), impala::SSEUtil::SSE_BITMASK, impala::SSEUtil::STRCHR_MODE, tuple_delim_, unfinished_tuple_, UNLIKELY, xmm_delim_search_, and xmm_escape_search_.
|
inline |
Will we return the current column to the query? Hive allows cols at the end of the table that are not in the schema. We'll just ignore those columns
Definition at line 102 of file delimited-text-parser.h.
References column_idx_, is_materialized_col_, and num_cols_.
Referenced by AddColumn().
|
private |
Character delimiting collection items (to become slots).
Definition at line 175 of file delimited-text-parser.h.
Referenced by DelimitedTextParser(), ParseFieldLocations(), ParseSingleTuple(), and ParseSse().
|
private |
Index to keep track of the current column in the current file.
Definition at line 211 of file delimited-text-parser.h.
Referenced by AddColumn(), AtTupleStart(), FillColumns(), ParseFieldLocations(), ParserReset(), ParseSingleTuple(), ParseSse(), and ReturnCurrentColumn().
|
private |
Whether or not the current column has an escape character in it (and needs to be unescaped)
Definition at line 182 of file delimited-text-parser.h.
Referenced by AddColumn(), ParseFieldLocations(), ParserReset(), ParseSingleTuple(), and ParseSse().
|
private |
Escape character. Only used if process_escapes_ is true.
Definition at line 172 of file delimited-text-parser.h.
Referenced by DelimitedTextParser(), escape_char(), FindFirstInstance(), ParseFieldLocations(), ParseSingleTuple(), and ParseSse().
|
private |
Character delimiting fields (to become slots).
Definition at line 166 of file delimited-text-parser.h.
Referenced by DelimitedTextParser(), ParseFieldLocations(), ParseSingleTuple(), and ParseSse().
|
private |
Definition at line 198 of file delimited-text-parser.h.
Referenced by DelimitedTextParser(), ParseSingleTuple(), and ParseSse().
|
private |
For each col index [0, num_cols_), true if the column should be materialized. Not owned.
Definition at line 208 of file delimited-text-parser.h.
Referenced by ReturnCurrentColumn().
|
private |
Whether or not the previous character was the escape character.
Definition at line 185 of file delimited-text-parser.h.
Referenced by ParseFieldLocations(), ParserReset(), ParseSingleTuple(), and ParseSse().
|
private |
Used for special processing of . This will be the offset of the last instance of from the end of the current buffer being searched unless the last row delimiter was not a in which case it will be -1. If the last character in a buffer is then the value will be 0. At the start of processing a new buffer if last_row_delim_offset_ is 0 then it is set to be one more than the size of the buffer so that if the buffer starts with
it is processed as
.
Definition at line 194 of file delimited-text-parser.h.
Referenced by FindFirstInstance(), ParseFieldLocations(), ParserReset(), and ParseSse().
|
private |
Precomputed masks to process escape characters.
Definition at line 197 of file delimited-text-parser.h.
Referenced by DelimitedTextParser(), ParseSingleTuple(), and ParseSse().
|
private |
Number of columns in the table (including partition columns)
Definition at line 201 of file delimited-text-parser.h.
Referenced by FillColumns(), and ReturnCurrentColumn().
|
private |
The number of delimiters contained in xmm_delim_search_, i.e. its length.
Definition at line 160 of file delimited-text-parser.h.
Referenced by DelimitedTextParser(), ParseSingleTuple(), and ParseSse().
|
private |
Number of partition columns in the table.
Definition at line 204 of file delimited-text-parser.h.
Referenced by AtTupleStart(), ParseFieldLocations(), ParserReset(), ParseSingleTuple(), and ParseSse().
|
private |
True if this parser should handle escape characters.
Definition at line 169 of file delimited-text-parser.h.
Referenced by DelimitedTextParser(), FindFirstInstance(), and ParseFieldLocations().
|
private |
Character delimiting tuples.
Definition at line 178 of file delimited-text-parser.h.
Referenced by DelimitedTextParser(), FindFirstInstance(), ParseFieldLocations(), and ParseSse().
|
private |
True if the last tuple is unfinished (not ended with tuple delimiter).
Definition at line 214 of file delimited-text-parser.h.
Referenced by HasUnfinishedTuple(), ParseFieldLocations(), and ParseSse().
|
private |
SSE(xmm) register containing the delimiter search character.
Definition at line 157 of file delimited-text-parser.h.
Referenced by DelimitedTextParser(), ParseSingleTuple(), and ParseSse().
|
private |
SSE(xmm) register containing the escape search character.
Definition at line 163 of file delimited-text-parser.h.
Referenced by DelimitedTextParser(), ParseSingleTuple(), and ParseSse().
|
private |
SSE(xmm) register containing the tuple search character.
Definition at line 154 of file delimited-text-parser.h.
Referenced by DelimitedTextParser(), and FindFirstInstance().