Class foreign_storage::LazyParquetChunkLoader

class LazyParquetChunkLoader

A lazy parquet to chunk loader

Public Functions

LazyParquetChunkLoader(std::shared_ptr<arrow::fs::FileSystem> file_system, FileReaderMap *file_reader_cache, const RenderGroupAnalyzerMap *render_group_analyzer_map)
std::list<std::unique_ptr<ChunkMetadata>> loadChunk(const std::vector<RowGroupInterval> &row_group_intervals, const int parquet_column_index, std::list<Chunk_NS::Chunk> &chunks, StringDictionary *string_dictionary = nullptr, RejectedRowIndices *rejected_row_indices = nullptr)

Load a number of row groups of a column in a parquet file into a chunk

NOTE: if more than one chunk is supplied, the first chunk is required to be the chunk corresponding to the logical column, while the remaining chunks correspond to physical columns (in ascending order of column id.) Similarly, if a metada update is expected, the list of

ChunkMetadata shared pointers returned will correspond directly to the list chunks.
Return

An empty list when no metadata update is applicable, otherwise a list of ChunkMetadata shared pointers with which to update the corresponding column chunk metadata. NOTE: Only ChunkMetadata.sqlType and the min & max values of the ChunkMetadata.chunkStats are valid, other values are not set.

Parameters
  • row_group_interval: - an inclusive interval [start,end] that specifies row groups to load

  • parquet_column_index: - the logical column index in the parquet file (and omnisci db) of column to load

  • chunks: - a list containing the chunks to load

  • string_dictionary: - a string dictionary for the column corresponding to the column, if applicable

  • rejected_row_indices: - optional, if specified errors will be tracked in this data structure while loading

std::list<RowGroupMetadata> metadataScan(const std::vector<std::string> &file_paths, const ForeignTableSchema &schema, const bool do_metadata_stats_validation = true)

Perform a metadata scan for the paths specified.

Return

a list of the row group metadata extracted from file_paths

Parameters
  • file_paths: - (ordered) files of the metadata scan

  • schema: - schema of the foreign table to perform metadata scan for

  • do_metadata_stats_validation: - validate stats in metadata of parquet files if true

std::pair<size_t, size_t> loadRowGroups(const RowGroupInterval &row_group_interval, const std::map<int, Chunk_NS::Chunk> &chunks, const ForeignTableSchema &schema, const std::map<int, StringDictionary *> &column_dictionaries, const int num_threads = 1)

Load row groups of data into given chunks.

Note that only logical chunks are expected because the data is read into an intermediate form into the underlying buffers. This member is intended to be used for import.

Return

[num_rows_completed,num_rows_rejected] - returns number of rows loaded and rejected while loading

Parameters
  • row_group_interval: - specifies which row groups to load

  • chunks: - map of column index to chunk which data will be loaded into

  • schema: - schema of the foreign table to perform metadata scan for

  • column_dictionaries: - a map of string dictionaries for columns that require it

  • num_threads: - number of threads to utilize while reading (if applicale)

NOTE: Currently, loading one row group at a time is required.

Public Static Functions

bool isColumnMappingSupported(const ColumnDescriptor *omnisci_column, const parquet::ColumnDescriptor *parquet_column)

Determine if a Parquet to OmniSci column mapping is supported.

Return

true if the column mapping is supported by LazyParquetChunkLoader, false otherwise

Parameters
  • omnisci_column: - the column descriptor of the OmniSci column

  • parquet_column: - the column descriptor of the Parquet column

Public Static Attributes

const int batch_reader_num_elements = 4096

Private Functions

std::list<std::unique_ptr<ChunkMetadata>> appendRowGroups(const std::vector<RowGroupInterval> &row_group_intervals, const int parquet_column_index, const ColumnDescriptor *column_descriptor, std::list<Chunk_NS::Chunk> &chunks, StringDictionary *string_dictionary, RejectedRowIndices *rejected_row_indices)

Private Members

std::shared_ptr<arrow::fs::FileSystem> file_system_
FileReaderMap *file_reader_cache_
const RenderGroupAnalyzerMap *render_group_analyzer_map_