API Reference

register_constant(name, value)[source]

Parameters:

name (str) – The name to register the constant under.
value (Any) – The constant value.

Return type:

register_resolver(name, resolver)[source]

Registered resolvers can be referenced by name from expressions via the lookup() helper, for example:

engine.register_resolver("prices", price_resolver)
schema = {"price": "lookup(product, prices)"}

Parameters:

name (str) – Name to register the resolver under.
resolver (Resolver) – Resolver instance (e.g., DictResolver).

Return type:

register_pipeline_function(name, func)[source]

Pipeline functions are invoked by higher-level orchestration layers (for example, Pandera-driven schemas) based on column metadata rather than being called directly from df-eval expression strings. A pipeline function typically accepts a pandas.DataFrame slice and optional keyword arguments, and returns either a Series or DataFrame aligned with the input index.

Return type:: None

evaluate(df, expr, dtype=None)[source]

Evaluate an expression on a DataFrame.

Parameters:

df (DataFrame) – The DataFrame to evaluate the expression on.
expr (str | Expression) – The expression to evaluate (string or Expression object).
dtype (Optional[str]) – Optional dtype to cast the result to.

Return type:

Returns:

The result of evaluating the expression.

Raises:

ValueError – If the expression is invalid.

evaluate_many(df, expressions)[source]

Evaluate multiple expressions and add them as columns.

This is an alias for apply_schema for batch evaluation.

Parameters:

df (DataFrame) – The input DataFrame.
expressions (Dict[str, str | Expression]) – A dictionary mapping column names to expressions.

Return type:

DataFrame

Returns:

A new DataFrame with the evaluated columns added.

apply_schema(df, schema, dtypes=None)[source]

Apply a schema of derived columns to a DataFrame with topological ordering.

This method automatically handles dependencies between columns and detects cycles in the dependency graph.

Parameters:

df (DataFrame) – The input DataFrame.
schema (Dict[str, Union[str, Expression, Dict[str, Any]]]) – A dictionary mapping column names to expressions or specs. Spec format: {“expr”: <expr>, “decimals”: <int>}.
dtypes (Optional[Dict[str, str]]) – Optional dictionary mapping column names to dtypes.

Return type:

DataFrame

Returns:

A new DataFrame with the derived columns added.

Raises:

CycleDetectedError – If a cycle is detected in dependencies.

apply_operations(df, operations, dtypes=None)[source]

Apply a set of operations (expr, lookup, function) to a DataFrame.

operations is a mapping from column name to a spec with keys:

{
    "kind": "expr" | "lookup" | "function",
    "expr": str | None,
    "lookup": dict | None,
    "function": dict | None,
    "decimals": int | None,
}

This is intended to be used by higher-level integrations such as the Pandera helpers, which translate column metadata into this structure.

Return type:: DataFrame

apply_pandera_schema(df, schema, **kwargs)[source]

Apply a Pandera schema and derive df-eval columns from metadata.

This is a thin convenience wrapper around df_eval.pandera.apply_pandera_schema that forwards the current engine instance so registered functions/constants and provenance settings are honored.

Return type:: DataFrame

iter_apply_schema_parquet_chunks(input_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None)[source]

Yield transformed chunks from a Parquet file or dataset.

Parameters:

input_path (str | Path) – Source Parquet file or directory-backed dataset.
schema (Dict[str, str | Expression]) – Mapping of derived column names to expressions.
dtypes (Optional[Dict[str, str]]) – Optional mapping of derived column names to pandas dtypes.
chunk_size (int) – Maximum rows to scan and transform per chunk.
input_columns (Sequence[str] | None) – Optional input column projection for scan efficiency.
output_columns (Sequence[str] | None) – Optional ordered subset of output columns to keep.

Yields:

Transformed DataFrame chunks.

Return type:

Iterator[DataFrame]

apply_schema_parquet_to_df(input_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None)[source]

Transform a Parquet dataset chunk-by-chunk and return one DataFrame.

Parameters:

input_path (str | Path) – Source Parquet file or directory-backed dataset.
schema (Dict[str, str | Expression]) – Mapping of derived column names to expressions.
dtypes (Optional[Dict[str, str]]) – Optional mapping of derived column names to pandas dtypes.
chunk_size (int) – Maximum rows to process per chunk.
input_columns (Sequence[str] | None) – Optional input column projection for scan efficiency.
output_columns (Sequence[str] | None) – Optional ordered subset of output columns to keep.

Return type:

DataFrame

Returns:

A DataFrame containing all transformed rows. Returns an empty DataFrame when the input yields no row chunks.

apply_schema_parquet_to_parquet(input_path, output_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None, compression='snappy')[source]

Transform a Parquet dataset chunk-by-chunk and write Parquet output.

This method is optimized for out-of-memory processing: source data is streamed in row chunks, transformed with the same expression engine used for in-memory DataFrames, and written incrementally to output_path.

Parameters:

input_path (str | Path) – Source Parquet file or directory-backed dataset.
output_path (str | Path) – Destination Parquet file.
schema (Dict[str, str | Expression]) – Mapping of derived column names to expressions.
dtypes (Optional[Dict[str, str]]) – Optional mapping of derived column names to pandas dtypes.
chunk_size (int) – Maximum rows to process per chunk.
input_columns (Sequence[str] | None) – Optional input column projection for scan efficiency.
output_columns (Sequence[str] | None) – Optional ordered subset of output columns to keep.
compression (str) – Parquet compression codec used for output.

Return type:

Returns:

The normalized output_path.

apply_pandera_schema_parquet_to_parquet(input_path, output_path, schema, **kwargs)[source]

Apply a Pandera schema to Parquet input and write Parquet output.

Return type:: Path

class df_eval.Expression(expr_str)[source]

Bases: object

Represents a parsed expression that can be evaluated on a DataFrame.

expr_str: The string representation of the expression.

dependencies: Set of column names referenced in the expression.

__init__(expr_str)[source]

Initialize an Expression.

Parameters:: expr_str (str) – The expression string to parse.

static parse(expr_str)[source]

Parse an expression string into an Expression object.

Parameters:: expr_str (str) – The expression string to parse.
Return type:: Expression
Returns:: An Expression object.
Raises:: ValueError – If the expression is invalid.

exception df_eval.CycleDetectedError[source]

Bases: Exception

Raised when a cycle is detected in column dependencies.

df_eval.lookup(series, resolver, on_missing='null')[source]

Lookup values using a resolver.

Parameters:

series (Series) – The series containing keys to lookup.
resolver (Resolver) – The resolver to use for lookups.
on_missing (str) – How to handle missing values (“null”, “raise”, “keep”). - “null”: Return None/NaN for missing values - “raise”: Raise an exception for missing values - “keep”: Keep the original key value

Return type:

Series

Returns:

A series with resolved values.

Raises:

ValueError – If on_missing is “raise” and a key cannot be resolved.

class df_eval.Resolver[source]

Bases: ABC

Abstract base class for resolvers.

abstractmethod resolve(key)[source]

Resolve a key to a value.

Parameters:: key (Any) – The key to resolve.
Return type:: Any
Returns:: The resolved value.

class df_eval.CachedResolver(resolver, ttl_seconds=300.0)[source]

Bases: Resolver

Resolver with TTL cache support.

__init__(resolver, ttl_seconds=300.0)[source]

Initialize a cached resolver.

Parameters:

resolver (Resolver) – The underlying resolver.
ttl_seconds (float) – Time-to-live for cache entries in seconds.

resolve(key)[source]

Resolve with caching.

Return type:: Any

clear_cache()[source]

Clear the cache.

Return type:: None

class df_eval.DictResolver(mapping, default=None)[source]

Bases: Resolver

Simple dictionary-based resolver.

__init__(mapping, default=None)[source]

Initialize a dictionary resolver.

Parameters:

mapping (Dict[Any, Any]) – Dictionary mapping keys to values.
default (Any) – Default value if key not found.

resolve(key)[source]

Resolve from dictionary.

Return type:: Any

class df_eval.FileResolver(filepath, key_column, value_column)[source]

Bases: Resolver

File-based resolver that reads from CSV/JSON files.

__init__(filepath, key_column, value_column)[source]

Initialize a file resolver.

Parameters:

filepath (str) – Path to the file.
key_column (str) – Name of the key column.
value_column (str) – Name of the value column.

resolve(key)[source]

Resolve from file.

Return type:: Any

class df_eval.DatabaseResolver(connection_string, table, key_column, value_column)[source]

Bases: Resolver

Database resolver (placeholder for SQL database lookups).

__init__(connection_string, table, key_column, value_column)[source]

Initialize a database resolver.

Parameters:

connection_string (str) – Database connection string.
table (str) – Table name.
key_column (str) – Name of the key column.
value_column (str) – Name of the value column.

resolve(key)[source]

Resolve from database.

Note: This is a placeholder. Actual implementation would require a database connection library like sqlalchemy.

Return type:: Any

class df_eval.HTTPResolver(base_url, key_param='key')[source]

Bases: Resolver

HTTP API resolver (placeholder for REST API lookups).

__init__(base_url, key_param='key')[source]

Initialize an HTTP resolver.

Parameters:

base_url (str) – Base URL for the API.
key_param (str) – Query parameter name for the key.

resolve(key)[source]

Resolve from HTTP API.

Note: This is a placeholder. Actual implementation would require an HTTP library like requests.

Return type:: Any

df_eval.df_eval_schema_from_pandera(schema, meta_key='df-eval', expr_key='expr')[source]

Build a df-eval schema mapping from Pandera per-column metadata.

Return type:: dict[str, str]

df_eval.apply_pandera_schema(df, schema, *, meta_key='df-eval', coerce=True, validate=True, validate_post=True, engine=None, error_on_overwrite=True)[source]

Run the Pandera + df-eval pipeline and optionally post-validate.

Pipeline order:

pre-validate base input columns (excluding operation/alias targets)
apply alias transforms from metadata
apply decimals transforms for existing columns
apply df-eval operations
optional post-validation against the full schema

The df-eval metadata for each operation column may define one of the following keys:

{"expr": "a + b"} {"lookup": {"resolver": "prices", "key": "product"}} {"function": {"name": "my_fn", "inputs": ["a"], "outputs": ["y"]}}

Any operation may include "decimals": <int> to round the derived output. Transform-stage decimals are also supported for any column that already exists by stage (3), including aliased base columns.

These are translated into an operations mapping consumed by df_eval.engine.Engine.apply_operations().

Return type:: DataFrame

df_eval.apply_pandera_schema_parquet_to_parquet(input_path, output_path, schema, *, meta_key='df-eval', expr_key='expr', engine=None, chunk_size=100000, compression='snappy')[source]

Apply a Pandera-driven schema to Parquet input and write Parquet output.

The input scan is projected to only required source columns, and output columns are restricted to the Pandera schema order.

Parameters:

input_path (str | Path) – Source Parquet file or directory-backed dataset.
output_path (str | Path) – Destination Parquet file.
schema (Any) – Pandera SchemaModel/DataFrameModel class or DataFrameSchema.
meta_key (str) – Metadata section containing df-eval expressions.
expr_key (str) – Metadata key containing the expression text.
engine (Engine | None) – Optional Engine instance.
chunk_size (int) – Maximum rows processed per chunk.
compression (str) – Parquet compression codec used for output.

Return type:

Returns:

The normalized output path.

df_eval.load_pandera_schema_yaml(source)[source]

Load a Pandera DataFrameSchema from YAML, preserving column and schema metadata.

Uses df-eval’s own schema serialization, which preserves the metadata field at both the column and the dataframe level through a full IO round-trip.

Parameters:: source (str | Path) – Path to a YAML schema file or a YAML string.
Return type:: Any
Returns:: A Pandera DataFrameSchema.

df_eval.dump_pandera_schema_yaml(schema, stream=None)[source]

Dump a Pandera DataFrameSchema to YAML, preserving column and schema metadata.

Uses df-eval’s own schema serialization so that both column-level and dataframe-level metadata survive a full IO round-trip.

Parameters:

schema (Any) – A Pandera SchemaModel/DataFrameModel class or DataFrameSchema.
stream (str | Path | None) – Optional path or file-like to write to. If None, the YAML representation is returned as a string.

Return type:

Returns:

The YAML string if stream is None, otherwise None.

df_eval.load_pandera_schema_json(source)[source]

Load a Pandera DataFrameSchema from JSON, preserving column and schema metadata.

This mirrors load_pandera_schema_yaml() but for JSON input.

Return type:: Any

df_eval.dump_pandera_schema_json(schema, target=None, **kwargs)[source]

Dump a Pandera DataFrameSchema to JSON, preserving column and schema metadata.

Parameters:

schema (Any) – A Pandera SchemaModel/DataFrameModel class or DataFrameSchema.
target (str | Path | None) – Optional path or file-like to write to. If None, the JSON representation is returned as a string.
**kwargs (Any) – Extra keyword arguments forwarded to json.dump().

Return type:

df_eval.iter_parquet_row_chunks(path, *, chunk_size=100000, columns=None)[source]

Yield Parquet rows as pandas DataFrame chunks.

This treats a Parquet file or directory-backed Parquet dataset as an out-of-memory DataFrame and streams it into manageable in-memory chunks.

Parameters:

path (str | Path) – Path to a Parquet file or a directory containing a Parquet dataset.
chunk_size (int) – Maximum number of rows to include per yielded chunk.
columns (Sequence[str] | None) – Optional subset of columns to project while scanning.

Yields:

DataFrame chunks with at most chunk_size rows.

Raises:

FileNotFoundError – If path does not exist.
TypeError – If path, chunk_size, or columns have invalid types.
ValueError – If chunk_size is less than 1.
ImportError – If pyarrow is not installed.

Return type:

Iterator[DataFrame]

df_eval.write_parquet_row_chunks(chunks, output_path, *, compression='snappy')[source]

Write DataFrame chunks to a Parquet file.

Parameters:

chunks (Iterable[DataFrame]) – DataFrame chunks to write sequentially.
output_path (str | Path) – Destination Parquet file path.
compression (str) – Parquet compression codec.

Return type:

Returns:

The normalized output path.

Raises:

TypeError – If output_path, compression, or chunk values are invalid.
ValueError – If compression is empty or no chunks are provided.
ImportError – If pyarrow is not installed.

Expression Module#

Expression parsing and representation module.

This module provides the Expression class for parsing and representing expressions that can be evaluated on pandas DataFrames.

class df_eval.expr.Expression(expr_str)[source]#

Bases: object

Represents a parsed expression that can be evaluated on a DataFrame.

expr_str#: The string representation of the expression.

dependencies#: Set of column names referenced in the expression.

__init__(expr_str)[source]#

Initialize an Expression.

Parameters:: expr_str (str) – The expression string to parse.

static parse(expr_str)[source]#

Parse an expression string into an Expression object.

Parameters:: expr_str (str) – The expression string to parse.
Return type:: Expression
Returns:: An Expression object.
Raises:: ValueError – If the expression is invalid.

Engine Module#

Evaluation engine module.

This module provides the Engine class for evaluating expressions on pandas DataFrames with support for UDF registry, schema-driven derived columns with topological ordering, and provenance tracking.

exception df_eval.engine.CycleDetectedError[source]#

Bases: Exception

Raised when a cycle is detected in column dependencies.

class df_eval.engine.Engine[source]#

Bases: object

Engine for evaluating expressions on pandas DataFrames.

The Engine class provides methods to evaluate expressions, apply transformations, and manage UDF/constant registries.

__init__()[source]#: Initialize the evaluation engine.

enable_provenance(enabled=True)[source]#

Enable or disable provenance tracking.

Parameters:: enabled (bool) – Whether to track provenance in df.attrs.
Return type:: None

register_function(name, func)[source]#

Parameters:

name (str) – The name to register the function under.
func (Callable[..., Any]) – The function to register.

Return type:

register_constant(name, value)[source]#

Parameters:

name (str) – The name to register the constant under.
value (Any) – The constant value.

Return type:

register_resolver(name, resolver)[source]#

Registered resolvers can be referenced by name from expressions via the lookup() helper, for example:

engine.register_resolver("prices", price_resolver)
schema = {"price": "lookup(product, prices)"}

Parameters:

name (str) – Name to register the resolver under.
resolver (Resolver) – Resolver instance (e.g., DictResolver).

Return type:

register_pipeline_function(name, func)[source]#

Return type:: None

evaluate(df, expr, dtype=None)[source]#

Evaluate an expression on a DataFrame.

Parameters:

df (DataFrame) – The DataFrame to evaluate the expression on.
expr (str | Expression) – The expression to evaluate (string or Expression object).
dtype (Optional[str]) – Optional dtype to cast the result to.

Return type:

Returns:

The result of evaluating the expression.

Raises:

ValueError – If the expression is invalid.

evaluate_many(df, expressions)[source]#

Evaluate multiple expressions and add them as columns.

This is an alias for apply_schema for batch evaluation.

Parameters:

df (DataFrame) – The input DataFrame.
expressions (Dict[str, str | Expression]) – A dictionary mapping column names to expressions.

Return type:

DataFrame

Returns:

A new DataFrame with the evaluated columns added.

apply_schema(df, schema, dtypes=None)[source]#

Apply a schema of derived columns to a DataFrame with topological ordering.

This method automatically handles dependencies between columns and detects cycles in the dependency graph.

Parameters:

df (DataFrame) – The input DataFrame.
schema (Dict[str, Union[str, Expression, Dict[str, Any]]]) – A dictionary mapping column names to expressions or specs. Spec format: {“expr”: <expr>, “decimals”: <int>}.
dtypes (Optional[Dict[str, str]]) – Optional dictionary mapping column names to dtypes.

Return type:

DataFrame

Returns:

A new DataFrame with the derived columns added.

Raises:

CycleDetectedError – If a cycle is detected in dependencies.

apply_operations(df, operations, dtypes=None)[source]#

Apply a set of operations (expr, lookup, function) to a DataFrame.

operations is a mapping from column name to a spec with keys:

{
    "kind": "expr" | "lookup" | "function",
    "expr": str | None,
    "lookup": dict | None,
    "function": dict | None,
    "decimals": int | None,
}

This is intended to be used by higher-level integrations such as the Pandera helpers, which translate column metadata into this structure.

Return type:: DataFrame

apply_pandera_schema(df, schema, **kwargs)[source]#

Apply a Pandera schema and derive df-eval columns from metadata.

This is a thin convenience wrapper around df_eval.pandera.apply_pandera_schema that forwards the current engine instance so registered functions/constants and provenance settings are honored.

Return type:: DataFrame

iter_apply_schema_parquet_chunks(input_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None)[source]#

Yield transformed chunks from a Parquet file or dataset.

Parameters:

input_path (str | Path) – Source Parquet file or directory-backed dataset.
schema (Dict[str, str | Expression]) – Mapping of derived column names to expressions.
dtypes (Optional[Dict[str, str]]) – Optional mapping of derived column names to pandas dtypes.
chunk_size (int) – Maximum rows to scan and transform per chunk.
input_columns (Sequence[str] | None) – Optional input column projection for scan efficiency.
output_columns (Sequence[str] | None) – Optional ordered subset of output columns to keep.

Yields:

Transformed DataFrame chunks.

Return type:

Iterator[DataFrame]

apply_schema_parquet_to_df(input_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None)[source]#

Transform a Parquet dataset chunk-by-chunk and return one DataFrame.

Parameters:

input_path (str | Path) – Source Parquet file or directory-backed dataset.
schema (Dict[str, str | Expression]) – Mapping of derived column names to expressions.
dtypes (Optional[Dict[str, str]]) – Optional mapping of derived column names to pandas dtypes.
chunk_size (int) – Maximum rows to process per chunk.
input_columns (Sequence[str] | None) – Optional input column projection for scan efficiency.
output_columns (Sequence[str] | None) – Optional ordered subset of output columns to keep.

Return type:

DataFrame

Returns:

A DataFrame containing all transformed rows. Returns an empty DataFrame when the input yields no row chunks.

apply_schema_parquet_to_parquet(input_path, output_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None, compression='snappy')[source]#

Transform a Parquet dataset chunk-by-chunk and write Parquet output.

Parameters:

input_path (str | Path) – Source Parquet file or directory-backed dataset.
output_path (str | Path) – Destination Parquet file.
schema (Dict[str, str | Expression]) – Mapping of derived column names to expressions.
dtypes (Optional[Dict[str, str]]) – Optional mapping of derived column names to pandas dtypes.
chunk_size (int) – Maximum rows to process per chunk.
input_columns (Sequence[str] | None) – Optional input column projection for scan efficiency.
output_columns (Sequence[str] | None) – Optional ordered subset of output columns to keep.
compression (str) – Parquet compression codec used for output.

Return type:

Returns:

The normalized output_path.

apply_pandera_schema_parquet_to_parquet(input_path, output_path, schema, **kwargs)[source]#

Apply a Pandera schema to Parquet input and write Parquet output.

Return type:: Path

Functions Module#

Built-in functions for expression evaluation.

This module provides built-in functions that can be used in expressions. These are safe, vectorized functions that are allow-listed for use in expressions.

df_eval.functions.safe_divide(a, b)[source]#

Safely divide two values, returning NaN for division by zero.

Parameters:

a (Any) – The numerator.
b (Any) – The denominator.

Return type:

Returns:

The result of a / b, or NaN if b is zero.

df_eval.functions.coalesce(*args)[source]#

Return the first non-null value from the arguments.

Parameters:: *args (Any) – Values to check.
Return type:: Any
Returns:: The first non-null value, or None if all are null.

df_eval.functions.clip_value(value, min_val=None, max_val=None)[source]#

Clip values to a specified range.

Parameters:

value (Any) – The value to clip.
min_val (float | None) – The minimum value (optional).
max_val (float | None) – The maximum value (optional).

Return type:

Returns:

The clipped value.

df_eval.functions.safe_abs(value)[source]#

Absolute value function.

Return type:: Any

df_eval.functions.safe_log(value)[source]#

Natural logarithm function.

Return type:: Any

df_eval.functions.safe_exp(value)[source]#

Exponential function.

Return type:: Any

df_eval.functions.safe_sqrt(value)[source]#

Square root function.

Return type:: Any

df_eval.functions.safe_clip(value, a_min, a_max)[source]#

Clip values to a range.

Return type:: Any

df_eval.functions.safe_where(condition, x, y)[source]#

Return elements from x or y depending on condition.

Return type:: Any

df_eval.functions.safe_isna(value)[source]#

Check for NaN/None values.

Return type:: Any

df_eval.functions.safe_fillna(value, fill_value)[source]#

Fill NaN/None values with a specified value.

Return type:: Any

df_eval.functions.safe_round(value, decimals=0)[source]#

Round a scalar or Series to the requested decimal places.

Return type:: Any

df_eval.functions.safe_ceil(value)[source]#

Ceiling function for scalars and Series.

Return type:: Any

df_eval.functions.safe_floor(value)[source]#

Floor function for scalars and Series.

Return type:: Any

Lookup Module#

Lookup functionality with resolvers for external data sources.

This module provides lookup capabilities with various resolvers (database, HTTP, file) and TTL caching.

class df_eval.lookup.Resolver[source]#

Bases: ABC

Abstract base class for resolvers.

abstractmethod resolve(key)[source]#

Resolve a key to a value.

Parameters:: key (Any) – The key to resolve.
Return type:: Any
Returns:: The resolved value.

class df_eval.lookup.CachedResolver(resolver, ttl_seconds=300.0)[source]#

Bases: Resolver

Resolver with TTL cache support.

__init__(resolver, ttl_seconds=300.0)[source]#

Initialize a cached resolver.

Parameters:

resolver (Resolver) – The underlying resolver.
ttl_seconds (float) – Time-to-live for cache entries in seconds.

resolve(key)[source]#

Resolve with caching.

Return type:: Any

clear_cache()[source]#

Clear the cache.

Return type:: None

class df_eval.lookup.DictResolver(mapping, default=None)[source]#

Bases: Resolver

Simple dictionary-based resolver.

__init__(mapping, default=None)[source]#

Initialize a dictionary resolver.

Parameters:

mapping (Dict[Any, Any]) – Dictionary mapping keys to values.
default (Any) – Default value if key not found.

resolve(key)[source]#

Resolve from dictionary.

Return type:: Any

class df_eval.lookup.FileResolver(filepath, key_column, value_column)[source]#

Bases: Resolver

File-based resolver that reads from CSV/JSON files.

__init__(filepath, key_column, value_column)[source]#

Initialize a file resolver.

Parameters:

filepath (str) – Path to the file.
key_column (str) – Name of the key column.
value_column (str) – Name of the value column.

resolve(key)[source]#

Resolve from file.

Return type:: Any

class df_eval.lookup.DatabaseResolver(connection_string, table, key_column, value_column)[source]#

Bases: Resolver

Database resolver (placeholder for SQL database lookups).

__init__(connection_string, table, key_column, value_column)[source]#

Initialize a database resolver.

Parameters:

connection_string (str) – Database connection string.
table (str) – Table name.
key_column (str) – Name of the key column.
value_column (str) – Name of the value column.

resolve(key)[source]#

Resolve from database.

Note: This is a placeholder. Actual implementation would require a database connection library like sqlalchemy.

Return type:: Any

class df_eval.lookup.HTTPResolver(base_url, key_param='key')[source]#

Bases: Resolver

HTTP API resolver (placeholder for REST API lookups).

__init__(base_url, key_param='key')[source]#

Initialize an HTTP resolver.

Parameters:

base_url (str) – Base URL for the API.
key_param (str) – Query parameter name for the key.

resolve(key)[source]#

Resolve from HTTP API.

Note: This is a placeholder. Actual implementation would require an HTTP library like requests.

Return type:: Any

df_eval.lookup.lookup(series, resolver, on_missing='null')[source]#

Lookup values using a resolver.

Parameters:

series (Series) – The series containing keys to lookup.
resolver (Resolver) – The resolver to use for lookups.
on_missing (str) – How to handle missing values (“null”, “raise”, “keep”). - “null”: Return None/NaN for missing values - “raise”: Raise an exception for missing values - “keep”: Keep the original key value

Return type:

Series

Returns:

A series with resolved values.

Raises:

ValueError – If on_missing is “raise” and a key cannot be resolved.

Pandera Module#

Pandera integration helpers for df-eval.

This module keeps Pandera support optional and layered on top of the core Engine API by translating Pandera column metadata into a df-eval schema map.

df_eval.pandera.df_eval_schema_from_pandera(schema, meta_key='df-eval', expr_key='expr')[source]#

Build a df-eval schema mapping from Pandera per-column metadata.

Return type:: dict[str, str]

df_eval.pandera.apply_aliases(df, schema, meta_key='df-eval')[source]#

Apply alias transforms from Pandera metadata before operation evaluation.

Return type:: DataFrame

df_eval.pandera.apply_decimals(df, schema, meta_key='df-eval')[source]#

Apply decimals transform to existing columns using Pandera metadata.

Return type:: DataFrame

df_eval.pandera.apply_pandera_schema(df, schema, *, meta_key='df-eval', coerce=True, validate=True, validate_post=True, engine=None, error_on_overwrite=True)[source]#

Run the Pandera + df-eval pipeline and optionally post-validate.

Pipeline order:

pre-validate base input columns (excluding operation/alias targets)
apply alias transforms from metadata
apply decimals transforms for existing columns
apply df-eval operations
optional post-validation against the full schema

The df-eval metadata for each operation column may define one of the following keys:

{"expr": "a + b"} {"lookup": {"resolver": "prices", "key": "product"}} {"function": {"name": "my_fn", "inputs": ["a"], "outputs": ["y"]}}

Any operation may include "decimals": <int> to round the derived output. Transform-stage decimals are also supported for any column that already exists by stage (3), including aliased base columns.

These are translated into an operations mapping consumed by df_eval.engine.Engine.apply_operations().

Return type:: DataFrame

df_eval.pandera.apply_pandera_schema_parquet_to_parquet(input_path, output_path, schema, *, meta_key='df-eval', expr_key='expr', engine=None, chunk_size=100000, compression='snappy')[source]#

Apply a Pandera-driven schema to Parquet input and write Parquet output.

The input scan is projected to only required source columns, and output columns are restricted to the Pandera schema order.

Parameters:

input_path (str | Path) – Source Parquet file or directory-backed dataset.
output_path (str | Path) – Destination Parquet file.
schema (Any) – Pandera SchemaModel/DataFrameModel class or DataFrameSchema.
meta_key (str) – Metadata section containing df-eval expressions.
expr_key (str) – Metadata key containing the expression text.
engine (Engine | None) – Optional Engine instance.
chunk_size (int) – Maximum rows processed per chunk.
compression (str) – Parquet compression codec used for output.

Return type:

Returns:

The normalized output path.

df_eval.pandera.df_eval_operations_from_pandera(schema, meta_key='df-eval')[source]#

Build a rich df-eval operations mapping from Pandera column metadata.

Each column may define one of the following under metadata[meta_key]:

{"expr": "a + b"}
{"lookup": {"resolver": "prices", "key": "product"}}
{"function": {"name": "churn_model_v1", "inputs": ["age"]}}

Any operation may also include an optional rounding directive:

{"expr": "price * quantity", "decimals": 2}

The returned mapping has the shape:

{
    "column_name": {
        "kind": "expr" | "lookup" | "function",
        "expr": str | None,
        "lookup": dict | None,
        "function": dict | None,
        "decimals": int | None,
    },
}

Return type:: dict[str, dict[str, Any]]

df_eval.pandera.load_pandera_schema_yaml(source)[source]#

Load a Pandera DataFrameSchema from YAML, preserving column and schema metadata.

Uses df-eval’s own schema serialization, which preserves the metadata field at both the column and the dataframe level through a full IO round-trip.

Parameters:: source (str | Path) – Path to a YAML schema file or a YAML string.
Return type:: Any
Returns:: A Pandera DataFrameSchema.

df_eval.pandera.dump_pandera_schema_yaml(schema, stream=None)[source]#

Dump a Pandera DataFrameSchema to YAML, preserving column and schema metadata.

Uses df-eval’s own schema serialization so that both column-level and dataframe-level metadata survive a full IO round-trip.

Parameters:

schema (Any) – A Pandera SchemaModel/DataFrameModel class or DataFrameSchema.
stream (str | Path | None) – Optional path or file-like to write to. If None, the YAML representation is returned as a string.

Return type:

Returns:

The YAML string if stream is None, otherwise None.

df_eval.pandera.load_pandera_schema_json(source)[source]#

Load a Pandera DataFrameSchema from JSON, preserving column and schema metadata.

This mirrors load_pandera_schema_yaml() but for JSON input.

Return type:: Any

df_eval.pandera.dump_pandera_schema_json(schema, target=None, **kwargs)[source]#

Dump a Pandera DataFrameSchema to JSON, preserving column and schema metadata.

Parameters:

schema (Any) – A Pandera SchemaModel/DataFrameModel class or DataFrameSchema.
target (str | Path | None) – Optional path or file-like to write to. If None, the JSON representation is returned as a string.
**kwargs (Any) – Extra keyword arguments forwarded to json.dump().

Return type:

Parquet Module#

df_eval.parquet.iter_parquet_row_chunks(path, *, chunk_size=100000, columns=None)[source]#

Yield Parquet rows as pandas DataFrame chunks.

This treats a Parquet file or directory-backed Parquet dataset as an out-of-memory DataFrame and streams it into manageable in-memory chunks.

Parameters:

path (str | Path) – Path to a Parquet file or a directory containing a Parquet dataset.
chunk_size (int) – Maximum number of rows to include per yielded chunk.
columns (Sequence[str] | None) – Optional subset of columns to project while scanning.

Yields:

DataFrame chunks with at most chunk_size rows.

Raises:

FileNotFoundError – If path does not exist.
TypeError – If path, chunk_size, or columns have invalid types.
ValueError – If chunk_size is less than 1.
ImportError – If pyarrow is not installed.

Return type:

Iterator[DataFrame]

df_eval.parquet.write_parquet_row_chunks(chunks, output_path, *, compression='snappy')[source]#

Write DataFrame chunks to a Parquet file.

Parameters:

chunks (Iterable[DataFrame]) – DataFrame chunks to write sequentially.
output_path (str | Path) – Destination Parquet file path.
compression (str) – Parquet compression codec.

Return type: