Class foreign_storage::LazyParquetChunkLoader¶
-
class
LazyParquetChunkLoader
¶ A lazy parquet to chunk loader
Public Functions
-
std::list<std::unique_ptr<ChunkMetadata>>
loadChunk
(const std::vector<RowGroupInterval> &row_group_intervals, const int parquet_column_index, std::list<Chunk_NS::Chunk> &chunks, StringDictionary *string_dictionary = nullptr, RejectedRowIndices *rejected_row_indices = nullptr)¶ Load a number of row groups of a column in a parquet file into a chunk
NOTE: if more than one chunk is supplied, the first chunk is required to be the chunk corresponding to the logical column, while the remaining chunks correspond to physical columns (in ascending order of column id.) Similarly, if a metada update is expected, the list of
ChunkMetadata shared pointers returned will correspond directly to the listchunks
.- Return
An empty list when no metadata update is applicable, otherwise a list of ChunkMetadata shared pointers with which to update the corresponding column chunk metadata. NOTE: Only ChunkMetadata.sqlType and the min & max values of the ChunkMetadata.chunkStats are valid, other values are not set.
- Parameters
row_group_interval
: - an inclusive interval [start,end] that specifies row groups to loadparquet_column_index
: - the logical column index in the parquet file (and omnisci db) of column to loadchunks
: - a list containing the chunks to loadstring_dictionary
: - a string dictionary for the column corresponding to the column, if applicablerejected_row_indices
: - optional, if specified errors will be tracked in this data structure while loading
-
std::list<RowGroupMetadata>
metadataScan
(const std::vector<std::string> &file_paths, const ForeignTableSchema &schema, const bool do_metadata_stats_validation = true)¶ Perform a metadata scan for the paths specified.
- Return
a list of the row group metadata extracted from
file_paths
- Parameters
file_paths
: - (ordered) files of the metadata scanschema
: - schema of the foreign table to perform metadata scan fordo_metadata_stats_validation
: - validate stats in metadata of parquet files if true
-
std::pair<size_t, size_t>
loadRowGroups
(const RowGroupInterval &row_group_interval, const std::map<int, Chunk_NS::Chunk> &chunks, const ForeignTableSchema &schema, const std::map<int, StringDictionary *> &column_dictionaries, const int num_threads = 1)¶ Load row groups of data into given chunks.
Note that only logical chunks are expected because the data is read into an intermediate form into the underlying buffers. This member is intended to be used for import.
- Return
[num_rows_completed,num_rows_rejected] - returns number of rows loaded and rejected while loading
- Parameters
row_group_interval
: - specifies which row groups to loadchunks
: - map of column index to chunk which data will be loaded intoschema
: - schema of the foreign table to perform metadata scan forcolumn_dictionaries
: - a map of string dictionaries for columns that require itnum_threads
: - number of threads to utilize while reading (if applicale)
NOTE: Currently, loading one row group at a time is required.
Public Static Functions
-
bool
isColumnMappingSupported
(const ColumnDescriptor *omnisci_column, const parquet::ColumnDescriptor *parquet_column)¶ Determine if a Parquet to OmniSci column mapping is supported.
- Return
true if the column mapping is supported by LazyParquetChunkLoader, false otherwise
- Parameters
omnisci_column
: - the column descriptor of the OmniSci columnparquet_column
: - the column descriptor of the Parquet column
Public Static Attributes
-
const int
batch_reader_num_elements
= 4096¶
Private Functions
-
std::list<std::unique_ptr<ChunkMetadata>>
appendRowGroups
(const std::vector<RowGroupInterval> &row_group_intervals, const int parquet_column_index, const ColumnDescriptor *column_descriptor, std::list<Chunk_NS::Chunk> &chunks, StringDictionary *string_dictionary, RejectedRowIndices *rejected_row_indices)¶
Private Members
-
std::shared_ptr<arrow::fs::FileSystem>
file_system_
¶
-
FileReaderMap *
file_reader_cache_
¶
-
const RenderGroupAnalyzerMap *
render_group_analyzer_map_
¶
-
std::list<std::unique_ptr<ChunkMetadata>>