parq_tools.lazy_parquet.LazyParquetDF

class parq_tools.lazy_parquet.LazyParquetDF(path, index_col=None)[source]

Lazy, column-on-demand DataFrame backed by a Parquet file.

This lightweight, DataFrame-like object exposes a familiar subset of the pandas API, but loads data lazily from a Parquet file. Columns are only materialized into memory when they are first accessed.

Parameters:

path (Path) – Path to the Parquet file.
index_col (str or sequence of str, optional) – Optional column(s) to use as the index. If provided, those columns are eagerly loaded and set as the index (supporting both single index and MultiIndex).

__init__(path, index_col=None)[source]

Methods

`__init__`(path[, index_col])
`add_column`(name, data)	Explicit helper for adding a new column (`df[name] = data`).
`describe`([percentiles, include, exclude, ...])	Generate descriptive statistics of the dataset.
`filter`(*predicates)	Filter rows using explicit PyArrow-style predicate tuples.
`head`([n])	Return the first n rows as a pandas DataFrame.
`info`([buf])	Print a concise summary of the lazy Parquet-backed DataFrame.
`iter_row_chunks`([chunk_size, columns])	Iterate over the dataset in row-wise chunks.
`load_columns`(columns)	Eagerly load one or more columns into the internal cache.
`query`(expr)	Evaluate a boolean expression using pandas-style query syntax.
`save`(*[, allow_overwrite, chunk_size])	Save the logical DataFrame back to its original Parquet path.
`to_pandas`()	Materialize all columns as a pandas DataFrame.
`to_parquet`(path, *[, allow_overwrite, ...])	Write the logical DataFrame to a Parquet file.

Attributes

`columns`	List of all logical column names.
`dtypes`	Return dtypes for columns currently materialised in the cache.
`index`	Index for the dataset.
`shape`	Tuple of (number of rows, number of columns).

add_column(name, data)[source]

Explicit helper for adding a new column (df[name] = data).

Return type:: None

property columns: List[str]

List of all logical column names.

This includes Parquet schema columns in schema order, minus any index columns (when the index is constructed from pandas metadata or via the explicit index_col argument), followed by any new columns that have been added via assignment. The order is preserved across operations and is used by chunked iteration and write-back.

describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)[source]

Generate descriptive statistics of the dataset.

Return type:: DataFrame

property dtypes: Series

Return dtypes for columns currently materialised in the cache.

This mirrors pandas.DataFrame.dtypes: only columns that actually exist in the internal pandas DataFrame are reported. Lazily-backed columns that have not yet been loaded are not included; use info() to inspect all available columns and their lazy/loaded status.

filter(*predicates)[source]

Filter rows using explicit PyArrow-style predicate tuples.

Return type:: DataFrame

head(n=5)[source]

Return the first n rows as a pandas DataFrame.

Return type:: DataFrame

property index: Index

Index for the dataset.

Returns a RangeIndex when no data has been loaded yet, or the index of the internal cached DataFrame (which may be a MultiIndex).

info(buf=None)[source]

Print a concise summary of the lazy Parquet-backed DataFrame.

Return type:: None

iter_row_chunks(chunk_size=100000, columns=None)[source]

Iterate over the dataset in row-wise chunks.

Return type:: Iterable[DataFrame]

load_columns(columns)[source]

Eagerly load one or more columns into the internal cache.

Return type:: None

query(expr)[source]

Evaluate a boolean expression using pandas-style query syntax.

Return type:: DataFrame

save(*, allow_overwrite=False, chunk_size=100000, **pq_write_kwargs)[source]

Save the logical DataFrame back to its original Parquet path.

Return type:: None

property shape: tuple[int, int]

Tuple of (number of rows, number of columns).

The column count reflects the logical data columns exposed via columns, excluding any index columns that are represented solely in the index (either reconstructed from pandas metadata or via index_col).

to_pandas()[source]

Materialize all columns as a pandas DataFrame.

Return type:: DataFrame

to_parquet(path, *, allow_overwrite=False, chunk_size=None, **pq_write_kwargs)[source]

Write the logical DataFrame to a Parquet file.

Return type:: None