.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/09_lazy_parquet_df.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_09_lazy_parquet_df.py: LazyParquetDF ============= A focused example demonstrating the LazyParquetDF "lazyframe" API for indexed Parquet loading, lazy column access, filtering, and chunked iteration / saving. This complements the filtering and memory-usage examples by showing how to work interactively with a Parquet-backed DataFrame-like object without loading all columns into memory at once. .. GENERATED FROM PYTHON SOURCE LINES 13-21 .. code-block:: Python from pathlib import Path import tempfile import pandas as pd from parq_tools.lazy_parquet import LazyParquetDF .. GENERATED FROM PYTHON SOURCE LINES 22-29 Create a Parquet file --------------------- We first build a small DataFrame. In practice this could be a much larger dataset. Here we keep ``"i"`` as a regular data column rather than an index so that it can be referenced directly in lazy operations like ``filter`` and ``iter_row_chunks``. .. GENERATED FROM PYTHON SOURCE LINES 29-45 .. code-block:: Python def create_parquet(path: Path) -> None: df = pd.DataFrame( { "i": [0, 1, 2, 3], "j": [10, 11, 12, 13], "value": [1.0, 2.0, 3.0, 4.0], } ) df.to_parquet(path) parquet_path = Path(tempfile.gettempdir()) / "lazyparquetdf_example.parquet" create_parquet(parquet_path) .. GENERATED FROM PYTHON SOURCE LINES 46-53 Construct a LazyParquetDF ------------------------- When no ``index_col`` is given, LazyParquetDF reconstructs the logical index using the pandas metadata stored in the Parquet file. For this file it will just be a simple RangeIndex, and ``"i"`` remains a regular data column. .. GENERATED FROM PYTHON SOURCE LINES 53-59 .. code-block:: Python lazy = LazyParquetDF(parquet_path) print("Shape:", lazy.shape) print("Columns:", lazy.columns) print("Index name:", lazy.index.name) .. rst-class:: sphx-glr-script-out .. code-block:: none Shape: (4, 3) Columns: ['i', 'j', 'value'] Index name: None .. GENERATED FROM PYTHON SOURCE LINES 60-65 Lazy column access ------------------- Columns are loaded on demand. The ``dtypes`` property only reports columns that have been materialised so far. .. GENERATED FROM PYTHON SOURCE LINES 65-77 .. code-block:: Python print("Dtypes before loading any column:") print(lazy.dtypes) # Access a single column; only this column is loaded into memory. value_series = lazy["value"] print("Loaded 'value' column:") print(value_series) print("Dtypes after loading 'value':") print(lazy.dtypes) .. rst-class:: sphx-glr-script-out .. code-block:: none Dtypes before loading any column: Series([], dtype: object) Loaded 'value' column: 0 1.0 1 2.0 2 3.0 3 4.0 Name: value, dtype: float64 Dtypes after loading 'value': value float64 dtype: object .. GENERATED FROM PYTHON SOURCE LINES 78-84 Filtering and query ------------------- ``filter`` uses a PyArrow-style predicate and reads only the columns needed for the filter, while ``query`` operates on a materialised pandas DataFrame. .. GENERATED FROM PYTHON SOURCE LINES 84-93 .. code-block:: Python filtered = lazy.filter(("i", ">", 1)) print("\nFiltered rows where i > 1 (filter):") print(filtered) queried = lazy.query("value > 2.0") print("\nFiltered rows where value > 2.0 (query):") print(queried) .. rst-class:: sphx-glr-script-out .. code-block:: none Filtered rows where i > 1 (filter): i 0 2 1 3 Filtered rows where value > 2.0 (query): i j value 2 2 12 3.0 3 3 13 4.0 .. GENERATED FROM PYTHON SOURCE LINES 94-100 Chunked iteration and saving ---------------------------- We can add derived columns, then iterate over the dataset in row-wise chunks and save back to Parquet without holding the full DataFrame in memory. .. GENERATED FROM PYTHON SOURCE LINES 100-116 .. code-block:: Python lazy["double"] = lazy["value"] * 2 print("\nIterating in chunks of size 2 (columns i and double):") for chunk in lazy.iter_row_chunks(chunk_size=2, columns=["i", "double"]): print(chunk) out_path = parquet_path.with_name("lazyparquetdf_example_out.parquet") # Save using chunked write-back. lazy.to_parquet(out_path, allow_overwrite=True, chunk_size=2) print("\nRound-trip check:") roundtrip = pd.read_parquet(out_path) print(roundtrip) .. rst-class:: sphx-glr-script-out .. code-block:: none Iterating in chunks of size 2 (columns i and double): i double 0 0 2.0 1 1 4.0 i double 2 2 6.0 3 3 8.0 Round-trip check: i j value double 0 0 10 1.0 2.0 1 1 11 2.0 4.0 2 2 12 3.0 6.0 3 3 13 4.0 8.0 .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.048 seconds) .. _sphx_glr_download_auto_examples_09_lazy_parquet_df.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 09_lazy_parquet_df.ipynb <09_lazy_parquet_df.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 09_lazy_parquet_df.py <09_lazy_parquet_df.py>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_