parq_tools.utils.memory_utils.parquet_memory_usage

parq_tools.utils.memory_utils.parquet_memory_usage(input_path, chunk_size=100000, columns=None, max_chunks=None, report_pandas=True, index_columns=None)[source]

Estimate memory usage of a Parquet file, per column, in three ways:

  • compressed (on disk)

  • decompressed (Arrow/pyarrow in-memory)

  • pandas DataFrame memory usage (optional)

Processes the file in chunks for scalability.

Parameters:
  • input_path (Path) – Path to the Parquet file.

  • chunk_size (int) – Number of rows per chunk to process.

  • columns (Optional[list[str]]) – Columns to include. If None, use all columns.

  • max_chunks (Optional[int]) – If set, only process up to this many chunks (for Arrow/pyarrow sampling).

  • report_pandas (bool) – Whether to report pandas DataFrame memory usage. Default True.

  • index_columns (Optional[list[str]]) – List of columns to mark as index columns in the report.

Returns:

Detailed memory usage report with the following structure:

{
    'columns': {
        col: {
            'compressed_bytes': int,      # On-disk size for this column
            'decompressed_bytes': int,   # In-memory (Arrow) size for this column
            'pandas_bytes': int or None, # In-memory (pandas) size for this column, or None if not reported
            'dtype': str,                # Arrow dtype string
            'is_index': bool             # True if column is marked as index
        },
        ...
    },
    'total_compressed_bytes': int,      # Total on-disk size
    'total_decompressed_bytes': int,    # Total Arrow in-memory size
    'total_pandas_bytes': int or None,  # Total pandas in-memory size, or None if not reported
    'shape': tuple                      # (n_rows, n_cols)
}

Return type:

dict