Memory Usage

Parquet files are compressed columnar data files. This is great for storage and performance, but it can be useful to understand memory by column usage when working with large datasets. This enables the user to optimise their data processing and storage strategies.

import tempfile

import pandas as pd
from pathlib import Path

from parq_tools import ParquetProfileReport
from parq_tools.utils.demo_block_model import create_demo_blockmodel
from parq_tools.utils.memory_utils import parquet_memory_usage, print_parquet_memory_usage

Create a Parquet file for profiling

temp_dir = Path(tempfile.gettempdir()) / "memory_usage_example"
temp_dir.mkdir(parents=True, exist_ok=True)

parquet_file_path: Path = temp_dir / "test_blockmodel.parquet"

# Create a reasonably large model example
df: pd.DataFrame = create_demo_blockmodel(shape=(300, 100, 100), block_size=(10, 10, 5),
                                          corner=(0, 0, 0))
# Add a categorical column and a string column
df["depth_as_string"] = df["depth"].astype(str)
df["depth_as_category"] = pd.Categorical(df["depth"].astype(str))
df.to_parquet(parquet_file_path)
print("Shape:", df.shape)

Shape: (3000000, 5)

Memory usage reports

Generate a memory usage report for the Parquet file, various ways.

Full report with index marking

report = parquet_memory_usage(parquet_file_path, index_columns=["x", "y", "z"])
print("\nFull memory usage report (with pandas):")
print_parquet_memory_usage(report)

Full memory usage report (with pandas):
Shape: (3000000, 8)
Total compressed:   27.9 MB
Total decompressed (Arrow):  167.5 MB
Total pandas:  294.4 MB

Per-column breakdown:
Column               Dtype             Compressed        Arrow       Pandas Dtype details
c_order_xyz          int64                12.2 MB      23.2 MB      22.9 MB
f_order_zyx          int64                15.2 MB      23.2 MB      22.9 MB
depth                double              136.1 KB      23.2 MB      22.9 MB
depth_as_string      string              135.9 KB      25.1 MB     153.9 MB
depth_as_category    dictionary<>        135.8 KB       2.9 MB       3.2 MB values=string, indices=int8, ordered=0
x*                   double                2.8 KB      23.2 MB      22.9 MB
y*                   double                6.6 KB      23.2 MB      22.9 MB
z*                   double              136.1 KB      23.2 MB      22.9 MB

Report without pandas memory usage

report_no_pandas = parquet_memory_usage(parquet_file_path, report_pandas=False)
print("\nMemory usage report (Arrow only, no pandas):")
print_parquet_memory_usage(report_no_pandas)

Memory usage report (Arrow only, no pandas):
Shape: (3000000, 8)
Total compressed:   27.9 MB
Total decompressed (Arrow):  167.5 MB

Per-column breakdown:
Column               Dtype             Compressed        Arrow       Pandas Dtype details
c_order_xyz          int64                12.2 MB      23.2 MB            -
f_order_zyx          int64                15.2 MB      23.2 MB            -
depth                double              136.1 KB      23.2 MB            -
depth_as_string      string              135.9 KB      25.1 MB            -
depth_as_category    dictionary<>        135.8 KB       2.9 MB            - values=string, indices=int8, ordered=0
x                    double                2.8 KB      23.2 MB            -
y                    double                6.6 KB      23.2 MB            -
z                    double              136.1 KB      23.2 MB            -

Report for a subset of columns

subset_cols = ["x", "y", "depth", "depth_as_category"]
report_subset = parquet_memory_usage(parquet_file_path, columns=subset_cols, index_columns=["x", "y"])
print("\nMemory usage report (subset of columns):")
print_parquet_memory_usage(report_subset)

Memory usage report (subset of columns):
Shape: (3000000, 4)
Total compressed:  281.3 KB
Total decompressed (Arrow):   72.6 MB
Total pandas:   71.8 MB

Per-column breakdown:
Column               Dtype             Compressed        Arrow       Pandas Dtype details
x*                   double                2.8 KB      23.2 MB      22.9 MB
y*                   double                6.6 KB      23.2 MB      22.9 MB
depth                double              136.1 KB      23.2 MB      22.9 MB
depth_as_category    dictionary<>        135.8 KB       2.9 MB       3.2 MB values=string, indices=int8, ordered=0

Accessing the structured dictionary

Useful for programmatic use

print("\nAccessing the structured dictionary:")
print({k: v for k, v in report["columns"].items() if k in subset_cols})

Accessing the structured dictionary:
{'depth': {'compressed_bytes': 139375, 'decompressed_bytes': 24375000, 'pandas_bytes': 24000000, 'dtype': 'double', 'is_index': False}, 'depth_as_category': {'compressed_bytes': 139040, 'decompressed_bytes': 3028096, 'pandas_bytes': 3305472, 'dtype': 'dictionary<values=string, indices=int8, ordered=0>', 'is_index': False}, 'x': {'compressed_bytes': 2820, 'decompressed_bytes': 24375000, 'pandas_bytes': 24000000, 'dtype': 'double', 'is_index': True}, 'y': {'compressed_bytes': 6804, 'decompressed_bytes': 24375000, 'pandas_bytes': 24000000, 'dtype': 'double', 'is_index': True}}

Total running time of the script: (0 minutes 5.267 seconds)

Gallery generated by Sphinx-Gallery