.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/08_memory_usage.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_08_memory_usage.py: Memory Usage ============ Parquet files are compressed columnar data files. This is great for storage and performance, but it can be useful to understand memory by column usage when working with large datasets. This enables the user to optimise their data processing and storage strategies. .. GENERATED FROM PYTHON SOURCE LINES 10-19 .. code-block:: Python import tempfile import pandas as pd from pathlib import Path from parq_tools import ParquetProfileReport from parq_tools.utils.demo_block_model import create_demo_blockmodel from parq_tools.utils.memory_utils import parquet_memory_usage, print_parquet_memory_usage .. GENERATED FROM PYTHON SOURCE LINES 20-22 Create a Parquet file for profiling ----------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 22-37 .. code-block:: Python temp_dir = Path(tempfile.gettempdir()) / "memory_usage_example" temp_dir.mkdir(parents=True, exist_ok=True) parquet_file_path: Path = temp_dir / "test_blockmodel.parquet" # Create a reasonably large model example df: pd.DataFrame = create_demo_blockmodel(shape=(300, 100, 100), block_size=(10, 10, 5), corner=(0, 0, 0)) # Add a categorical column and a string column df["depth_as_string"] = df["depth"].astype(str) df["depth_as_category"] = pd.Categorical(df["depth"].astype(str)) df.to_parquet(parquet_file_path) print("Shape:", df.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none Shape: (3000000, 5) .. GENERATED FROM PYTHON SOURCE LINES 38-41 Memory usage reports -------------------- Generate a memory usage report for the Parquet file, various ways. .. GENERATED FROM PYTHON SOURCE LINES 43-45 Full report with index marking ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 45-49 .. code-block:: Python report = parquet_memory_usage(parquet_file_path, index_columns=["x", "y", "z"]) print("\nFull memory usage report (with pandas):") print_parquet_memory_usage(report) .. rst-class:: sphx-glr-script-out .. code-block:: none Full memory usage report (with pandas): Shape: (3000000, 8) Total compressed: 27.9 MB Total decompressed (Arrow): 167.5 MB Total pandas: 294.4 MB Per-column breakdown: Column Dtype Compressed Arrow Pandas Dtype details c_order_xyz int64 12.2 MB 23.2 MB 22.9 MB f_order_zyx int64 15.2 MB 23.2 MB 22.9 MB depth double 136.1 KB 23.2 MB 22.9 MB depth_as_string string 135.9 KB 25.1 MB 153.9 MB depth_as_category dictionary<> 135.8 KB 2.9 MB 3.2 MB values=string, indices=int8, ordered=0 x* double 2.8 KB 23.2 MB 22.9 MB y* double 6.6 KB 23.2 MB 22.9 MB z* double 136.1 KB 23.2 MB 22.9 MB .. GENERATED FROM PYTHON SOURCE LINES 50-52 Report without pandas memory usage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 52-56 .. code-block:: Python report_no_pandas = parquet_memory_usage(parquet_file_path, report_pandas=False) print("\nMemory usage report (Arrow only, no pandas):") print_parquet_memory_usage(report_no_pandas) .. rst-class:: sphx-glr-script-out .. code-block:: none Memory usage report (Arrow only, no pandas): Shape: (3000000, 8) Total compressed: 27.9 MB Total decompressed (Arrow): 167.5 MB Per-column breakdown: Column Dtype Compressed Arrow Pandas Dtype details c_order_xyz int64 12.2 MB 23.2 MB - f_order_zyx int64 15.2 MB 23.2 MB - depth double 136.1 KB 23.2 MB - depth_as_string string 135.9 KB 25.1 MB - depth_as_category dictionary<> 135.8 KB 2.9 MB - values=string, indices=int8, ordered=0 x double 2.8 KB 23.2 MB - y double 6.6 KB 23.2 MB - z double 136.1 KB 23.2 MB - .. GENERATED FROM PYTHON SOURCE LINES 57-59 Report for a subset of columns ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 59-64 .. code-block:: Python subset_cols = ["x", "y", "depth", "depth_as_category"] report_subset = parquet_memory_usage(parquet_file_path, columns=subset_cols, index_columns=["x", "y"]) print("\nMemory usage report (subset of columns):") print_parquet_memory_usage(report_subset) .. rst-class:: sphx-glr-script-out .. code-block:: none Memory usage report (subset of columns): Shape: (3000000, 4) Total compressed: 281.3 KB Total decompressed (Arrow): 72.6 MB Total pandas: 71.8 MB Per-column breakdown: Column Dtype Compressed Arrow Pandas Dtype details x* double 2.8 KB 23.2 MB 22.9 MB y* double 6.6 KB 23.2 MB 22.9 MB depth double 136.1 KB 23.2 MB 22.9 MB depth_as_category dictionary<> 135.8 KB 2.9 MB 3.2 MB values=string, indices=int8, ordered=0 .. GENERATED FROM PYTHON SOURCE LINES 65-68 Accessing the structured dictionary ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Useful for programmatic use .. GENERATED FROM PYTHON SOURCE LINES 68-70 .. code-block:: Python print("\nAccessing the structured dictionary:") print({k: v for k, v in report["columns"].items() if k in subset_cols}) .. rst-class:: sphx-glr-script-out .. code-block:: none Accessing the structured dictionary: {'depth': {'compressed_bytes': 139375, 'decompressed_bytes': 24375000, 'pandas_bytes': 24000000, 'dtype': 'double', 'is_index': False}, 'depth_as_category': {'compressed_bytes': 139040, 'decompressed_bytes': 3028096, 'pandas_bytes': 3305472, 'dtype': 'dictionary', 'is_index': False}, 'x': {'compressed_bytes': 2820, 'decompressed_bytes': 24375000, 'pandas_bytes': 24000000, 'dtype': 'double', 'is_index': True}, 'y': {'compressed_bytes': 6804, 'decompressed_bytes': 24375000, 'pandas_bytes': 24000000, 'dtype': 'double', 'is_index': True}} .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 5.267 seconds) .. _sphx_glr_download_auto_examples_08_memory_usage.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 08_memory_usage.ipynb <08_memory_usage.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 08_memory_usage.py <08_memory_usage.py>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_