Note
Go to the end to download the full example code.
LazyParquetDF
A focused example demonstrating the LazyParquetDF “lazyframe” API for indexed Parquet loading, lazy column access, filtering, and chunked iteration / saving.
This complements the filtering and memory-usage examples by showing how to work interactively with a Parquet-backed DataFrame-like object without loading all columns into memory at once.
from pathlib import Path
import tempfile
import pandas as pd
from parq_tools.lazy_parquet import LazyParquetDF
Create a Parquet file
We first build a small DataFrame. In practice this could be a much larger
dataset. Here we keep "i" as a regular data column rather than an index
so that it can be referenced directly in lazy operations like filter
and iter_row_chunks.
def create_parquet(path: Path) -> None:
df = pd.DataFrame(
{
"i": [0, 1, 2, 3],
"j": [10, 11, 12, 13],
"value": [1.0, 2.0, 3.0, 4.0],
}
)
df.to_parquet(path)
parquet_path = Path(tempfile.gettempdir()) / "lazyparquetdf_example.parquet"
create_parquet(parquet_path)
Construct a LazyParquetDF
When no index_col is given, LazyParquetDF reconstructs the logical
index using the pandas metadata stored in the Parquet file. For this
file it will just be a simple RangeIndex, and "i" remains a regular
data column.
lazy = LazyParquetDF(parquet_path)
print("Shape:", lazy.shape)
print("Columns:", lazy.columns)
print("Index name:", lazy.index.name)
Shape: (4, 3)
Columns: ['i', 'j', 'value']
Index name: None
Lazy column access
Columns are loaded on demand. The dtypes property only reports
columns that have been materialised so far.
print("Dtypes before loading any column:")
print(lazy.dtypes)
# Access a single column; only this column is loaded into memory.
value_series = lazy["value"]
print("Loaded 'value' column:")
print(value_series)
print("Dtypes after loading 'value':")
print(lazy.dtypes)
Dtypes before loading any column:
Series([], dtype: object)
Loaded 'value' column:
0 1.0
1 2.0
2 3.0
3 4.0
Name: value, dtype: float64
Dtypes after loading 'value':
value float64
dtype: object
Filtering and query
filter uses a PyArrow-style predicate and reads only the columns
needed for the filter, while query operates on a materialised
pandas DataFrame.
filtered = lazy.filter(("i", ">", 1))
print("\nFiltered rows where i > 1 (filter):")
print(filtered)
queried = lazy.query("value > 2.0")
print("\nFiltered rows where value > 2.0 (query):")
print(queried)
Filtered rows where i > 1 (filter):
i
0 2
1 3
Filtered rows where value > 2.0 (query):
i j value
2 2 12 3.0
3 3 13 4.0
Chunked iteration and saving
We can add derived columns, then iterate over the dataset in row-wise chunks and save back to Parquet without holding the full DataFrame in memory.
lazy["double"] = lazy["value"] * 2
print("\nIterating in chunks of size 2 (columns i and double):")
for chunk in lazy.iter_row_chunks(chunk_size=2, columns=["i", "double"]):
print(chunk)
out_path = parquet_path.with_name("lazyparquetdf_example_out.parquet")
# Save using chunked write-back.
lazy.to_parquet(out_path, allow_overwrite=True, chunk_size=2)
print("\nRound-trip check:")
roundtrip = pd.read_parquet(out_path)
print(roundtrip)
Iterating in chunks of size 2 (columns i and double):
i double
0 0 2.0
1 1 4.0
i double
2 2 6.0
3 3 8.0
Round-trip check:
i j value double
0 0 10 1.0 2.0
1 1 11 2.0 4.0
2 2 12 3.0 6.0
3 3 13 4.0 8.0
Total running time of the script: (0 minutes 0.048 seconds)