Class foreign_storage::ParquetImporter

class ParquetImporter : public foreign_storage::AbstractFileStorageDataWrapper

Public Functions

ParquetImporter()
ParquetImporter(const int db_id, const ForeignTable *foreign_table, const UserMapping *user_mapping)
void populateChunkMetadata(ChunkMetadataVector &chunk_metadata_vector)

Populates given chunk metadata vector with metadata for all chunks in related foreign table.

Parameters
  • chunk_metadata_vector: - vector that will be populated with chunk metadata

void populateChunkBuffers(const ChunkToBufferMap &required_buffers, const ChunkToBufferMap &optional_buffers, AbstractBuffer *delete_buffer)

Populates given chunk buffers identified by chunk keys. All provided chunk buffers are expected to be for the same fragment.

Parameters
  • required_buffers: - chunk buffers that must always be populated

  • optional_buffers: - chunk buffers that can be optionally populated, if the data wrapper has to scan through chunk data anyways (typically for row wise data formats)

  • delete_buffer: - chunk buffer for fragment’s delete column, if non-null data wrapper is expected to mark deleted rows in buffer and continue processing

std::string getSerializedDataWrapper() const

Serialize internal state of wrapper into file at given path if implemented

void restoreDataWrapperInternals(const std::string &file_path, const ChunkMetadataVector &chunk_metadata)

Restore internal state of datawrapper

Parameters
  • file_path: - location of file created by serializeMetadata

  • chunk_metadata_vector: - vector of chunk metadata recovered from disk

bool isRestored() const
ParallelismLevel getCachedParallelismLevel() const

Gets the desired level of parallelism for the data wrapper when a cache is in use. This affects the optional buffers that the data wrapper is made aware of during data requests.

ParallelismLevel getNonCachedParallelismLevel() const

Gets the desired level of parallelism for the data wrapper when no cache is in use. This affects the optional buffers that the data wrapper is made aware of during data requests.

std::unique_ptr<import_export::ImportBatchResult> getNextImportBatch()

Produce the next ImportBatchResult for import. This is the only functionality of ParquetImporter that is required to be implemented.

Return

a ImportBatchResult for import.

std::vector<std::pair<const ColumnDescriptor *, StringDictionary *>> getStringDictionaries() const

Return string dictionaries that are used per column.

Return

a vector of StringDictionary and ColumnDescriptor pairs

int getMaxNumUsefulThreads() const

Get the maximum number of threads that can do useful computation.

void setNumThreads(const int num_threads)

Set the number of threads to use internally when reading batches.

Private Functions

std::set<std::string> getAllFilePaths()

Private Members

const int db_id_
const ForeignTable *foreign_table_
int num_threads_
std::unique_ptr<AbstractRowGroupIntervalTracker> row_group_interval_tracker_
std::unique_ptr<ForeignTableSchema> schema_
std::shared_ptr<arrow::fs::FileSystem> file_system_
std::unique_ptr<FileReaderMap> file_reader_cache_
std::vector<std::pair<const ColumnDescriptor *, StringDictionary *>> string_dictionaries_per_column_
std::shared_mutex row_group_interval_tracker_mutex_
std::shared_mutex string_dictionaries_per_column_mutex_