parq_tools.lazy_parquet.LazyParquetDF
- class parq_tools.lazy_parquet.LazyParquetDF(path, index_col=None)[source]
Lazy, column-on-demand DataFrame backed by a Parquet file.
This lightweight, DataFrame-like object exposes a familiar subset of the pandas API, but loads data lazily from a Parquet file. Columns are only materialized into memory when they are first accessed.
- Parameters:
path (Path) – Path to the Parquet file.
index_col (str or sequence of str, optional) – Optional column(s) to use as the index. If provided, those columns are eagerly loaded and set as the index (supporting both single index and MultiIndex).
Methods
__init__(path[, index_col])add_column(name, data)Explicit helper for adding a new column (
df[name] = data).describe([percentiles, include, exclude, ...])Generate descriptive statistics of the dataset.
filter(*predicates)Filter rows using explicit PyArrow-style predicate tuples.
head([n])Return the first n rows as a pandas DataFrame.
info([buf])Print a concise summary of the lazy Parquet-backed DataFrame.
iter_row_chunks([chunk_size, columns])Iterate over the dataset in row-wise chunks.
load_columns(columns)Eagerly load one or more columns into the internal cache.
query(expr)Evaluate a boolean expression using pandas-style query syntax.
save(*[, allow_overwrite, chunk_size])Save the logical DataFrame back to its original Parquet path.
Materialize all columns as a pandas DataFrame.
to_parquet(path, *[, allow_overwrite, ...])Write the logical DataFrame to a Parquet file.
Attributes
List of all logical column names.
Return dtypes for columns currently materialised in the cache.
Index for the dataset.
Tuple of (number of rows, number of columns).
- add_column(name, data)[source]
Explicit helper for adding a new column (
df[name] = data).- Return type:
None
- property columns: List[str]
List of all logical column names.
This includes Parquet schema columns in schema order, minus any index columns (when the index is constructed from pandas metadata or via the explicit index_col argument), followed by any new columns that have been added via assignment. The order is preserved across operations and is used by chunked iteration and write-back.
- describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)[source]
Generate descriptive statistics of the dataset.
- Return type:
DataFrame
- property dtypes: Series
Return dtypes for columns currently materialised in the cache.
This mirrors
pandas.DataFrame.dtypes: only columns that actually exist in the internal pandas DataFrame are reported. Lazily-backed columns that have not yet been loaded are not included; useinfo()to inspect all available columns and their lazy/loaded status.
- filter(*predicates)[source]
Filter rows using explicit PyArrow-style predicate tuples.
- Return type:
DataFrame
- property index: Index
Index for the dataset.
Returns a RangeIndex when no data has been loaded yet, or the index of the internal cached DataFrame (which may be a MultiIndex).
- info(buf=None)[source]
Print a concise summary of the lazy Parquet-backed DataFrame.
- Return type:
None
- iter_row_chunks(chunk_size=100000, columns=None)[source]
Iterate over the dataset in row-wise chunks.
- Return type:
Iterable[DataFrame]
- load_columns(columns)[source]
Eagerly load one or more columns into the internal cache.
- Return type:
None
- query(expr)[source]
Evaluate a boolean expression using pandas-style query syntax.
- Return type:
DataFrame
- save(*, allow_overwrite=False, chunk_size=100000, **pq_write_kwargs)[source]
Save the logical DataFrame back to its original Parquet path.
- Return type:
None