parq_tools.utils.memory_utils.parquet_memory_usage
- parq_tools.utils.memory_utils.parquet_memory_usage(input_path, chunk_size=100000, columns=None, max_chunks=None, report_pandas=True, index_columns=None)[source]
Estimate memory usage of a Parquet file, per column, in three ways:
compressed (on disk)
decompressed (Arrow/pyarrow in-memory)
pandas DataFrame memory usage (optional)
Processes the file in chunks for scalability.
- Parameters:
input_path (Path) – Path to the Parquet file.
chunk_size (int) – Number of rows per chunk to process.
columns (Optional[list[str]]) – Columns to include. If None, use all columns.
max_chunks (Optional[int]) – If set, only process up to this many chunks (for Arrow/pyarrow sampling).
report_pandas (bool) – Whether to report pandas DataFrame memory usage. Default True.
index_columns (Optional[list[str]]) – List of columns to mark as index columns in the report.
- Returns:
Detailed memory usage report with the following structure:
{ 'columns': { col: { 'compressed_bytes': int, # On-disk size for this column 'decompressed_bytes': int, # In-memory (Arrow) size for this column 'pandas_bytes': int or None, # In-memory (pandas) size for this column, or None if not reported 'dtype': str, # Arrow dtype string 'is_index': bool # True if column is marked as index }, ... }, 'total_compressed_bytes': int, # Total on-disk size 'total_decompressed_bytes': int, # Total Arrow in-memory size 'total_pandas_bytes': int or None, # Total pandas in-memory size, or None if not reported 'shape': tuple # (n_rows, n_cols) }
- Return type:
dict