Note
Go to the end to download the full example code.
Memory Usage
Parquet files are compressed columnar data files. This is great for storage and performance, but it can be useful to understand memory by column usage when working with large datasets. This enables the user to optimise their data processing and storage strategies.
import tempfile
import pandas as pd
from pathlib import Path
from parq_tools import ParquetProfileReport
from parq_tools.utils.demo_block_model import create_demo_blockmodel
from parq_tools.utils.memory_utils import parquet_memory_usage, print_parquet_memory_usage
Create a Parquet file for profiling
temp_dir = Path(tempfile.gettempdir()) / "memory_usage_example"
temp_dir.mkdir(parents=True, exist_ok=True)
parquet_file_path: Path = temp_dir / "test_blockmodel.parquet"
# Create a reasonably large model example
df: pd.DataFrame = create_demo_blockmodel(shape=(300, 100, 100), block_size=(10, 10, 5),
corner=(0, 0, 0))
# Add a categorical column and a string column
df["depth_as_string"] = df["depth"].astype(str)
df["depth_as_category"] = pd.Categorical(df["depth"].astype(str))
df.to_parquet(parquet_file_path)
print("Shape:", df.shape)
Shape: (3000000, 5)
Memory usage reports
Generate a memory usage report for the Parquet file, various ways.
Full report with index marking
report = parquet_memory_usage(parquet_file_path, index_columns=["x", "y", "z"])
print("\nFull memory usage report (with pandas):")
print_parquet_memory_usage(report)
Full memory usage report (with pandas):
Shape: (3000000, 8)
Total compressed: 27.9 MB
Total decompressed (Arrow): 167.5 MB
Total pandas: 294.4 MB
Per-column breakdown:
Column Dtype Compressed Arrow Pandas Dtype details
c_order_xyz int64 12.2 MB 23.2 MB 22.9 MB
f_order_zyx int64 15.2 MB 23.2 MB 22.9 MB
depth double 136.1 KB 23.2 MB 22.9 MB
depth_as_string string 135.9 KB 25.1 MB 153.9 MB
depth_as_category dictionary<> 135.8 KB 2.9 MB 3.2 MB values=string, indices=int8, ordered=0
x* double 2.8 KB 23.2 MB 22.9 MB
y* double 6.6 KB 23.2 MB 22.9 MB
z* double 136.1 KB 23.2 MB 22.9 MB
Report without pandas memory usage
report_no_pandas = parquet_memory_usage(parquet_file_path, report_pandas=False)
print("\nMemory usage report (Arrow only, no pandas):")
print_parquet_memory_usage(report_no_pandas)
Memory usage report (Arrow only, no pandas):
Shape: (3000000, 8)
Total compressed: 27.9 MB
Total decompressed (Arrow): 167.5 MB
Per-column breakdown:
Column Dtype Compressed Arrow Pandas Dtype details
c_order_xyz int64 12.2 MB 23.2 MB -
f_order_zyx int64 15.2 MB 23.2 MB -
depth double 136.1 KB 23.2 MB -
depth_as_string string 135.9 KB 25.1 MB -
depth_as_category dictionary<> 135.8 KB 2.9 MB - values=string, indices=int8, ordered=0
x double 2.8 KB 23.2 MB -
y double 6.6 KB 23.2 MB -
z double 136.1 KB 23.2 MB -
Report for a subset of columns
subset_cols = ["x", "y", "depth", "depth_as_category"]
report_subset = parquet_memory_usage(parquet_file_path, columns=subset_cols, index_columns=["x", "y"])
print("\nMemory usage report (subset of columns):")
print_parquet_memory_usage(report_subset)
Memory usage report (subset of columns):
Shape: (3000000, 4)
Total compressed: 281.3 KB
Total decompressed (Arrow): 72.6 MB
Total pandas: 71.8 MB
Per-column breakdown:
Column Dtype Compressed Arrow Pandas Dtype details
x* double 2.8 KB 23.2 MB 22.9 MB
y* double 6.6 KB 23.2 MB 22.9 MB
depth double 136.1 KB 23.2 MB 22.9 MB
depth_as_category dictionary<> 135.8 KB 2.9 MB 3.2 MB values=string, indices=int8, ordered=0
Accessing the structured dictionary
Useful for programmatic use
print("\nAccessing the structured dictionary:")
print({k: v for k, v in report["columns"].items() if k in subset_cols})
Accessing the structured dictionary:
{'depth': {'compressed_bytes': 139375, 'decompressed_bytes': 24375000, 'pandas_bytes': 24000000, 'dtype': 'double', 'is_index': False}, 'depth_as_category': {'compressed_bytes': 139040, 'decompressed_bytes': 3028096, 'pandas_bytes': 3305472, 'dtype': 'dictionary<values=string, indices=int8, ordered=0>', 'is_index': False}, 'x': {'compressed_bytes': 2820, 'decompressed_bytes': 24375000, 'pandas_bytes': 24000000, 'dtype': 'double', 'is_index': True}, 'y': {'compressed_bytes': 6804, 'decompressed_bytes': 24375000, 'pandas_bytes': 24000000, 'dtype': 'double', 'is_index': True}}
Total running time of the script: (0 minutes 5.267 seconds)