API Reference

API Reference#

Core Modules#

df-eval: A lightweight expression evaluation engine for pandas DataFrames.

This package provides tools for evaluating expressions on pandas DataFrames, supporting schema-driven derived columns and external lookups.

class df_eval.Engine[source]

Bases: object

Engine for evaluating expressions on pandas DataFrames.

The Engine class provides methods to evaluate expressions, apply transformations, and manage UDF/constant registries.

__init__()[source]

Initialize the evaluation engine.

enable_provenance(enabled=True)[source]

Enable or disable provenance tracking.

Parameters:

enabled (bool) – Whether to track provenance in df.attrs.

Return type:

None

register_function(name, func)[source]

Register a custom function (UDF) for use in expressions.

Parameters:
  • name (str) – The name to register the function under.

  • func (Callable[..., Any]) – The function to register.

Return type:

None

register_constant(name, value)[source]

Register a constant for use in expressions.

Parameters:
  • name (str) – The name to register the constant under.

  • value (Any) – The constant value.

Return type:

None

register_resolver(name, resolver)[source]

Register a lookup resolver for use in expressions.

Registered resolvers can be referenced by name from expressions via the lookup() helper, for example:

engine.register_resolver("prices", price_resolver)
schema = {"price": "lookup(product, prices)"}
Parameters:
  • name (str) – Name to register the resolver under.

  • resolver (Resolver) – Resolver instance (e.g., DictResolver).

Return type:

None

register_pipeline_function(name, func)[source]

Register a named pipeline function for metadata-driven workflows.

Pipeline functions are invoked by higher-level orchestration layers (for example, Pandera-driven schemas) based on column metadata rather than being called directly from df-eval expression strings. A pipeline function typically accepts a pandas.DataFrame slice and optional keyword arguments, and returns either a Series or DataFrame aligned with the input index.

Return type:

None

evaluate(df, expr, dtype=None)[source]

Evaluate an expression on a DataFrame.

Parameters:
  • df (DataFrame) – The DataFrame to evaluate the expression on.

  • expr (str | Expression) – The expression to evaluate (string or Expression object).

  • dtype (Optional[str]) – Optional dtype to cast the result to.

Return type:

Any

Returns:

The result of evaluating the expression.

Raises:

ValueError – If the expression is invalid.

evaluate_many(df, expressions)[source]

Evaluate multiple expressions and add them as columns.

This is an alias for apply_schema for batch evaluation.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • expressions (Dict[str, str | Expression]) – A dictionary mapping column names to expressions.

Return type:

DataFrame

Returns:

A new DataFrame with the evaluated columns added.

apply_schema(df, schema, dtypes=None)[source]

Apply a schema of derived columns to a DataFrame with topological ordering.

This method automatically handles dependencies between columns and detects cycles in the dependency graph.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • schema (Dict[str, str | Expression]) – A dictionary mapping column names to expressions.

  • dtypes (Optional[Dict[str, str]]) – Optional dictionary mapping column names to dtypes.

Return type:

DataFrame

Returns:

A new DataFrame with the derived columns added.

Raises:

CycleDetectedError – If a cycle is detected in dependencies.

apply_operations(df, operations, dtypes=None)[source]

Apply a set of operations (expr, lookup, function) to a DataFrame.

operations is a mapping from column name to a spec with keys:

{
    "kind": "expr" | "lookup" | "function",
    "expr": str | None,
    "lookup": dict | None,
    "function": dict | None,
}

This is intended to be used by higher-level integrations such as the Pandera helpers, which translate column metadata into this structure.

Return type:

DataFrame

apply_pandera_schema(df, schema, **kwargs)[source]

Apply a Pandera schema and derive df-eval columns from metadata.

This is a thin convenience wrapper around df_eval.pandera.apply_pandera_schema that forwards the current engine instance so registered functions/constants and provenance settings are honored.

Return type:

DataFrame

iter_apply_schema_parquet_chunks(input_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None)[source]

Yield transformed chunks from a Parquet file or dataset.

Parameters:
  • input_path (str | Path) – Source Parquet file or directory-backed dataset.

  • schema (Dict[str, str | Expression]) – Mapping of derived column names to expressions.

  • dtypes (Optional[Dict[str, str]]) – Optional mapping of derived column names to pandas dtypes.

  • chunk_size (int) – Maximum rows to scan and transform per chunk.

  • input_columns (Sequence[str] | None) – Optional input column projection for scan efficiency.

  • output_columns (Sequence[str] | None) – Optional ordered subset of output columns to keep.

Yields:

Transformed DataFrame chunks.

Return type:

Iterator[DataFrame]

apply_schema_parquet_to_df(input_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None)[source]

Transform a Parquet dataset chunk-by-chunk and return one DataFrame.

Parameters:
  • input_path (str | Path) – Source Parquet file or directory-backed dataset.

  • schema (Dict[str, str | Expression]) – Mapping of derived column names to expressions.

  • dtypes (Optional[Dict[str, str]]) – Optional mapping of derived column names to pandas dtypes.

  • chunk_size (int) – Maximum rows to process per chunk.

  • input_columns (Sequence[str] | None) – Optional input column projection for scan efficiency.

  • output_columns (Sequence[str] | None) – Optional ordered subset of output columns to keep.

Return type:

DataFrame

Returns:

A DataFrame containing all transformed rows. Returns an empty DataFrame when the input yields no row chunks.

apply_schema_parquet_to_parquet(input_path, output_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None, compression='snappy')[source]

Transform a Parquet dataset chunk-by-chunk and write Parquet output.

This method is optimized for out-of-memory processing: source data is streamed in row chunks, transformed with the same expression engine used for in-memory DataFrames, and written incrementally to output_path.

Parameters:
  • input_path (str | Path) – Source Parquet file or directory-backed dataset.

  • output_path (str | Path) – Destination Parquet file.

  • schema (Dict[str, str | Expression]) – Mapping of derived column names to expressions.

  • dtypes (Optional[Dict[str, str]]) – Optional mapping of derived column names to pandas dtypes.

  • chunk_size (int) – Maximum rows to process per chunk.

  • input_columns (Sequence[str] | None) – Optional input column projection for scan efficiency.

  • output_columns (Sequence[str] | None) – Optional ordered subset of output columns to keep.

  • compression (str) – Parquet compression codec used for output.

Return type:

Path

Returns:

The normalized output_path.

apply_pandera_schema_parquet_to_parquet(input_path, output_path, schema, **kwargs)[source]

Apply a Pandera schema to Parquet input and write Parquet output.

Return type:

Path

class df_eval.Expression(expr_str)[source]

Bases: object

Represents a parsed expression that can be evaluated on a DataFrame.

expr_str

The string representation of the expression.

dependencies

Set of column names referenced in the expression.

__init__(expr_str)[source]

Initialize an Expression.

Parameters:

expr_str (str) – The expression string to parse.

static parse(expr_str)[source]

Parse an expression string into an Expression object.

Parameters:

expr_str (str) – The expression string to parse.

Return type:

Expression

Returns:

An Expression object.

Raises:

ValueError – If the expression is invalid.

exception df_eval.CycleDetectedError[source]

Bases: Exception

Raised when a cycle is detected in column dependencies.

df_eval.lookup(series, resolver, on_missing='null')[source]

Lookup values using a resolver.

Parameters:
  • series (Series) – The series containing keys to lookup.

  • resolver (Resolver) – The resolver to use for lookups.

  • on_missing (str) – How to handle missing values (“null”, “raise”, “keep”). - “null”: Return None/NaN for missing values - “raise”: Raise an exception for missing values - “keep”: Keep the original key value

Return type:

Series

Returns:

A series with resolved values.

Raises:

ValueError – If on_missing is “raise” and a key cannot be resolved.

class df_eval.Resolver[source]

Bases: ABC

Abstract base class for resolvers.

abstractmethod resolve(key)[source]

Resolve a key to a value.

Parameters:

key (Any) – The key to resolve.

Return type:

Any

Returns:

The resolved value.

class df_eval.CachedResolver(resolver, ttl_seconds=300.0)[source]

Bases: Resolver

Resolver with TTL cache support.

__init__(resolver, ttl_seconds=300.0)[source]

Initialize a cached resolver.

Parameters:
  • resolver (Resolver) – The underlying resolver.

  • ttl_seconds (float) – Time-to-live for cache entries in seconds.

resolve(key)[source]

Resolve with caching.

Return type:

Any

clear_cache()[source]

Clear the cache.

Return type:

None

class df_eval.DictResolver(mapping, default=None)[source]

Bases: Resolver

Simple dictionary-based resolver.

__init__(mapping, default=None)[source]

Initialize a dictionary resolver.

Parameters:
  • mapping (Dict[Any, Any]) – Dictionary mapping keys to values.

  • default (Any) – Default value if key not found.

resolve(key)[source]

Resolve from dictionary.

Return type:

Any

class df_eval.FileResolver(filepath, key_column, value_column)[source]

Bases: Resolver

File-based resolver that reads from CSV/JSON files.

__init__(filepath, key_column, value_column)[source]

Initialize a file resolver.

Parameters:
  • filepath (str) – Path to the file.

  • key_column (str) – Name of the key column.

  • value_column (str) – Name of the value column.

resolve(key)[source]

Resolve from file.

Return type:

Any

class df_eval.DatabaseResolver(connection_string, table, key_column, value_column)[source]

Bases: Resolver

Database resolver (placeholder for SQL database lookups).

__init__(connection_string, table, key_column, value_column)[source]

Initialize a database resolver.

Parameters:
  • connection_string (str) – Database connection string.

  • table (str) – Table name.

  • key_column (str) – Name of the key column.

  • value_column (str) – Name of the value column.

resolve(key)[source]

Resolve from database.

Note: This is a placeholder. Actual implementation would require a database connection library like sqlalchemy.

Return type:

Any

class df_eval.HTTPResolver(base_url, key_param='key')[source]

Bases: Resolver

HTTP API resolver (placeholder for REST API lookups).

__init__(base_url, key_param='key')[source]

Initialize an HTTP resolver.

Parameters:
  • base_url (str) – Base URL for the API.

  • key_param (str) – Query parameter name for the key.

resolve(key)[source]

Resolve from HTTP API.

Note: This is a placeholder. Actual implementation would require an HTTP library like requests.

Return type:

Any

df_eval.df_eval_schema_from_pandera(schema, meta_key='df-eval', expr_key='expr')[source]

Build a df-eval schema mapping from Pandera per-column metadata.

Return type:

dict[str, str]

df_eval.apply_pandera_schema(df, schema, *, meta_key='df-eval', coerce=True, validate=True, validate_post=True, engine=None, error_on_overwrite=True)[source]

Validate with Pandera, apply df-eval operations, then optionally revalidate.

Columns that define df-eval metadata under meta_key are considered derived and are excluded from pre-validation. This allows input frames that do not yet include derived columns.

The df-eval metadata for each column may currently define exactly one of the following keys:

{"expr": "a + b"} {"lookup": {"resolver": "prices", "key": "product"}} {"function": {"name": "my_fn", "inputs": ["a"], "outputs": ["y"]}}

These are translated into an operations mapping consumed by df_eval.engine.Engine.apply_operations().

Return type:

DataFrame

df_eval.apply_pandera_schema_parquet_to_parquet(input_path, output_path, schema, *, meta_key='df-eval', expr_key='expr', engine=None, chunk_size=100000, compression='snappy')[source]

Apply a Pandera-driven schema to Parquet input and write Parquet output.

The input scan is projected to only required source columns, and output columns are restricted to the Pandera schema order.

Parameters:
  • input_path (str | Path) – Source Parquet file or directory-backed dataset.

  • output_path (str | Path) – Destination Parquet file.

  • schema (Any) – Pandera SchemaModel/DataFrameModel class or DataFrameSchema.

  • meta_key (str) – Metadata section containing df-eval expressions.

  • expr_key (str) – Metadata key containing the expression text.

  • engine (Engine | None) – Optional Engine instance.

  • chunk_size (int) – Maximum rows processed per chunk.

  • compression (str) – Parquet compression codec used for output.

Return type:

Path

Returns:

The normalized output path.

df_eval.load_pandera_schema_yaml(source)[source]

Load a Pandera DataFrameSchema from YAML, preserving column metadata.

This is a thin, public wrapper around df-eval’s temporary fork of pandera.io.pandas_io. It exists to work around unionai-oss/pandera#1301, where column metadata is not round-tripped by Pandera’s YAML/JSON IO helpers.

Parameters:

source (str | Path) – Path to a YAML schema file or a YAML string.

Return type:

Any

Returns:

A Pandera DataFrameSchema.

df_eval.dump_pandera_schema_yaml(schema, stream=None)[source]

Dump a Pandera DataFrameSchema to YAML, preserving column metadata.

This uses df-eval’s temporary fork of pandera.io.pandas_io so that column metadata survives a full IO round-trip. Once Pandera fixes unionai-oss/pandera#1301, this helper may be simplified to delegate directly to Pandera’s built-in IO.

Parameters:
  • schema (Any) – A Pandera SchemaModel/DataFrameModel class or DataFrameSchema.

  • stream (str | Path | None) – Optional path or file-like to write to. If None, the YAML representation is returned as a string.

Return type:

str | None

Returns:

The YAML string if stream is None, otherwise None.

df_eval.load_pandera_schema_json(source)[source]

Load a Pandera DataFrameSchema from JSON, preserving column metadata.

This mirrors load_pandera_schema_yaml() but for JSON input.

Return type:

Any

df_eval.dump_pandera_schema_json(schema, target=None, **kwargs)[source]

Dump a Pandera DataFrameSchema to JSON, preserving column metadata.

Parameters:
  • schema (Any) – A Pandera SchemaModel/DataFrameModel class or DataFrameSchema.

  • target (str | Path | None) – Optional path or file-like to write to. If None, the JSON representation is returned as a string.

  • **kwargs (Any) – Extra keyword arguments forwarded to json.dump().

Return type:

str | None

df_eval.iter_parquet_row_chunks(path, *, chunk_size=100000, columns=None)[source]

Yield Parquet rows as pandas DataFrame chunks.

This treats a Parquet file or directory-backed Parquet dataset as an out-of-memory DataFrame and streams it into manageable in-memory chunks.

Parameters:
  • path (str | Path) – Path to a Parquet file or a directory containing a Parquet dataset.

  • chunk_size (int) – Maximum number of rows to include per yielded chunk.

  • columns (Sequence[str] | None) – Optional subset of columns to project while scanning.

Yields:

DataFrame chunks with at most chunk_size rows.

Raises:
Return type:

Iterator[DataFrame]

df_eval.write_parquet_row_chunks(chunks, output_path, *, compression='snappy')[source]

Write DataFrame chunks to a Parquet file.

Parameters:
  • chunks (Iterable[DataFrame]) – DataFrame chunks to write sequentially.

  • output_path (str | Path) – Destination Parquet file path.

  • compression (str) – Parquet compression codec.

Return type:

Path

Returns:

The normalized output path.

Raises:
  • TypeError – If output_path, compression, or chunk values are invalid.

  • ValueError – If compression is empty or no chunks are provided.

  • ImportError – If pyarrow is not installed.

Expression Module#

Expression parsing and representation module.

This module provides the Expression class for parsing and representing expressions that can be evaluated on pandas DataFrames.

class df_eval.expr.Expression(expr_str)[source]#

Bases: object

Represents a parsed expression that can be evaluated on a DataFrame.

expr_str#

The string representation of the expression.

dependencies#

Set of column names referenced in the expression.

__init__(expr_str)[source]#

Initialize an Expression.

Parameters:

expr_str (str) – The expression string to parse.

static parse(expr_str)[source]#

Parse an expression string into an Expression object.

Parameters:

expr_str (str) – The expression string to parse.

Return type:

Expression

Returns:

An Expression object.

Raises:

ValueError – If the expression is invalid.

Engine Module#

Evaluation engine module.

This module provides the Engine class for evaluating expressions on pandas DataFrames with support for UDF registry, schema-driven derived columns with topological ordering, and provenance tracking.

exception df_eval.engine.CycleDetectedError[source]#

Bases: Exception

Raised when a cycle is detected in column dependencies.

class df_eval.engine.Engine[source]#

Bases: object

Engine for evaluating expressions on pandas DataFrames.

The Engine class provides methods to evaluate expressions, apply transformations, and manage UDF/constant registries.

__init__()[source]#

Initialize the evaluation engine.

enable_provenance(enabled=True)[source]#

Enable or disable provenance tracking.

Parameters:

enabled (bool) – Whether to track provenance in df.attrs.

Return type:

None

register_function(name, func)[source]#

Register a custom function (UDF) for use in expressions.

Parameters:
  • name (str) – The name to register the function under.

  • func (Callable[..., Any]) – The function to register.

Return type:

None

register_constant(name, value)[source]#

Register a constant for use in expressions.

Parameters:
  • name (str) – The name to register the constant under.

  • value (Any) – The constant value.

Return type:

None

register_resolver(name, resolver)[source]#

Register a lookup resolver for use in expressions.

Registered resolvers can be referenced by name from expressions via the lookup() helper, for example:

engine.register_resolver("prices", price_resolver)
schema = {"price": "lookup(product, prices)"}
Parameters:
  • name (str) – Name to register the resolver under.

  • resolver (Resolver) – Resolver instance (e.g., DictResolver).

Return type:

None

register_pipeline_function(name, func)[source]#

Register a named pipeline function for metadata-driven workflows.

Pipeline functions are invoked by higher-level orchestration layers (for example, Pandera-driven schemas) based on column metadata rather than being called directly from df-eval expression strings. A pipeline function typically accepts a pandas.DataFrame slice and optional keyword arguments, and returns either a Series or DataFrame aligned with the input index.

Return type:

None

evaluate(df, expr, dtype=None)[source]#

Evaluate an expression on a DataFrame.

Parameters:
  • df (DataFrame) – The DataFrame to evaluate the expression on.

  • expr (str | Expression) – The expression to evaluate (string or Expression object).

  • dtype (Optional[str]) – Optional dtype to cast the result to.

Return type:

Any

Returns:

The result of evaluating the expression.

Raises:

ValueError – If the expression is invalid.

evaluate_many(df, expressions)[source]#

Evaluate multiple expressions and add them as columns.

This is an alias for apply_schema for batch evaluation.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • expressions (Dict[str, str | Expression]) – A dictionary mapping column names to expressions.

Return type:

DataFrame

Returns:

A new DataFrame with the evaluated columns added.

apply_schema(df, schema, dtypes=None)[source]#

Apply a schema of derived columns to a DataFrame with topological ordering.

This method automatically handles dependencies between columns and detects cycles in the dependency graph.

Parameters:
  • df (DataFrame) – The input DataFrame.

  • schema (Dict[str, str | Expression]) – A dictionary mapping column names to expressions.

  • dtypes (Optional[Dict[str, str]]) – Optional dictionary mapping column names to dtypes.

Return type:

DataFrame

Returns:

A new DataFrame with the derived columns added.

Raises:

CycleDetectedError – If a cycle is detected in dependencies.

apply_operations(df, operations, dtypes=None)[source]#

Apply a set of operations (expr, lookup, function) to a DataFrame.

operations is a mapping from column name to a spec with keys:

{
    "kind": "expr" | "lookup" | "function",
    "expr": str | None,
    "lookup": dict | None,
    "function": dict | None,
}

This is intended to be used by higher-level integrations such as the Pandera helpers, which translate column metadata into this structure.

Return type:

DataFrame

apply_pandera_schema(df, schema, **kwargs)[source]#

Apply a Pandera schema and derive df-eval columns from metadata.

This is a thin convenience wrapper around df_eval.pandera.apply_pandera_schema that forwards the current engine instance so registered functions/constants and provenance settings are honored.

Return type:

DataFrame

iter_apply_schema_parquet_chunks(input_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None)[source]#

Yield transformed chunks from a Parquet file or dataset.

Parameters:
  • input_path (str | Path) – Source Parquet file or directory-backed dataset.

  • schema (Dict[str, str | Expression]) – Mapping of derived column names to expressions.

  • dtypes (Optional[Dict[str, str]]) – Optional mapping of derived column names to pandas dtypes.

  • chunk_size (int) – Maximum rows to scan and transform per chunk.

  • input_columns (Sequence[str] | None) – Optional input column projection for scan efficiency.

  • output_columns (Sequence[str] | None) – Optional ordered subset of output columns to keep.

Yields:

Transformed DataFrame chunks.

Return type:

Iterator[DataFrame]

apply_schema_parquet_to_df(input_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None)[source]#

Transform a Parquet dataset chunk-by-chunk and return one DataFrame.

Parameters:
  • input_path (str | Path) – Source Parquet file or directory-backed dataset.

  • schema (Dict[str, str | Expression]) – Mapping of derived column names to expressions.

  • dtypes (Optional[Dict[str, str]]) – Optional mapping of derived column names to pandas dtypes.

  • chunk_size (int) – Maximum rows to process per chunk.

  • input_columns (Sequence[str] | None) – Optional input column projection for scan efficiency.

  • output_columns (Sequence[str] | None) – Optional ordered subset of output columns to keep.

Return type:

DataFrame

Returns:

A DataFrame containing all transformed rows. Returns an empty DataFrame when the input yields no row chunks.

apply_schema_parquet_to_parquet(input_path, output_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None, compression='snappy')[source]#

Transform a Parquet dataset chunk-by-chunk and write Parquet output.

This method is optimized for out-of-memory processing: source data is streamed in row chunks, transformed with the same expression engine used for in-memory DataFrames, and written incrementally to output_path.

Parameters:
  • input_path (str | Path) – Source Parquet file or directory-backed dataset.

  • output_path (str | Path) – Destination Parquet file.

  • schema (Dict[str, str | Expression]) – Mapping of derived column names to expressions.

  • dtypes (Optional[Dict[str, str]]) – Optional mapping of derived column names to pandas dtypes.

  • chunk_size (int) – Maximum rows to process per chunk.

  • input_columns (Sequence[str] | None) – Optional input column projection for scan efficiency.

  • output_columns (Sequence[str] | None) – Optional ordered subset of output columns to keep.

  • compression (str) – Parquet compression codec used for output.

Return type:

Path

Returns:

The normalized output_path.

apply_pandera_schema_parquet_to_parquet(input_path, output_path, schema, **kwargs)[source]#

Apply a Pandera schema to Parquet input and write Parquet output.

Return type:

Path

Functions Module#

Built-in functions for expression evaluation.

This module provides built-in functions that can be used in expressions. These are safe, vectorized functions that are allow-listed for use in expressions.

df_eval.functions.safe_divide(a, b)[source]#

Safely divide two values, returning NaN for division by zero.

Parameters:
  • a (Any) – The numerator.

  • b (Any) – The denominator.

Return type:

Any

Returns:

The result of a / b, or NaN if b is zero.

df_eval.functions.coalesce(*args)[source]#

Return the first non-null value from the arguments.

Parameters:

*args (Any) – Values to check.

Return type:

Any

Returns:

The first non-null value, or None if all are null.

df_eval.functions.clip_value(value, min_val=None, max_val=None)[source]#

Clip values to a specified range.

Parameters:
  • value (Any) – The value to clip.

  • min_val (float | None) – The minimum value (optional).

  • max_val (float | None) – The maximum value (optional).

Return type:

Any

Returns:

The clipped value.

df_eval.functions.safe_abs(value)[source]#

Absolute value function.

Return type:

Any

df_eval.functions.safe_log(value)[source]#

Natural logarithm function.

Return type:

Any

df_eval.functions.safe_exp(value)[source]#

Exponential function.

Return type:

Any

df_eval.functions.safe_sqrt(value)[source]#

Square root function.

Return type:

Any

df_eval.functions.safe_clip(value, a_min, a_max)[source]#

Clip values to a range.

Return type:

Any

df_eval.functions.safe_where(condition, x, y)[source]#

Return elements from x or y depending on condition.

Return type:

Any

df_eval.functions.safe_isna(value)[source]#

Check for NaN/None values.

Return type:

Any

df_eval.functions.safe_fillna(value, fill_value)[source]#

Fill NaN/None values with a specified value.

Return type:

Any

Lookup Module#

Lookup functionality with resolvers for external data sources.

This module provides lookup capabilities with various resolvers (database, HTTP, file) and TTL caching.

class df_eval.lookup.Resolver[source]#

Bases: ABC

Abstract base class for resolvers.

abstractmethod resolve(key)[source]#

Resolve a key to a value.

Parameters:

key (Any) – The key to resolve.

Return type:

Any

Returns:

The resolved value.

class df_eval.lookup.CachedResolver(resolver, ttl_seconds=300.0)[source]#

Bases: Resolver

Resolver with TTL cache support.

__init__(resolver, ttl_seconds=300.0)[source]#

Initialize a cached resolver.

Parameters:
  • resolver (Resolver) – The underlying resolver.

  • ttl_seconds (float) – Time-to-live for cache entries in seconds.

resolve(key)[source]#

Resolve with caching.

Return type:

Any

clear_cache()[source]#

Clear the cache.

Return type:

None

class df_eval.lookup.DictResolver(mapping, default=None)[source]#

Bases: Resolver

Simple dictionary-based resolver.

__init__(mapping, default=None)[source]#

Initialize a dictionary resolver.

Parameters:
  • mapping (Dict[Any, Any]) – Dictionary mapping keys to values.

  • default (Any) – Default value if key not found.

resolve(key)[source]#

Resolve from dictionary.

Return type:

Any

class df_eval.lookup.FileResolver(filepath, key_column, value_column)[source]#

Bases: Resolver

File-based resolver that reads from CSV/JSON files.

__init__(filepath, key_column, value_column)[source]#

Initialize a file resolver.

Parameters:
  • filepath (str) – Path to the file.

  • key_column (str) – Name of the key column.

  • value_column (str) – Name of the value column.

resolve(key)[source]#

Resolve from file.

Return type:

Any

class df_eval.lookup.DatabaseResolver(connection_string, table, key_column, value_column)[source]#

Bases: Resolver

Database resolver (placeholder for SQL database lookups).

__init__(connection_string, table, key_column, value_column)[source]#

Initialize a database resolver.

Parameters:
  • connection_string (str) – Database connection string.

  • table (str) – Table name.

  • key_column (str) – Name of the key column.

  • value_column (str) – Name of the value column.

resolve(key)[source]#

Resolve from database.

Note: This is a placeholder. Actual implementation would require a database connection library like sqlalchemy.

Return type:

Any

class df_eval.lookup.HTTPResolver(base_url, key_param='key')[source]#

Bases: Resolver

HTTP API resolver (placeholder for REST API lookups).

__init__(base_url, key_param='key')[source]#

Initialize an HTTP resolver.

Parameters:
  • base_url (str) – Base URL for the API.

  • key_param (str) – Query parameter name for the key.

resolve(key)[source]#

Resolve from HTTP API.

Note: This is a placeholder. Actual implementation would require an HTTP library like requests.

Return type:

Any

df_eval.lookup.lookup(series, resolver, on_missing='null')[source]#

Lookup values using a resolver.

Parameters:
  • series (Series) – The series containing keys to lookup.

  • resolver (Resolver) – The resolver to use for lookups.

  • on_missing (str) – How to handle missing values (“null”, “raise”, “keep”). - “null”: Return None/NaN for missing values - “raise”: Raise an exception for missing values - “keep”: Keep the original key value

Return type:

Series

Returns:

A series with resolved values.

Raises:

ValueError – If on_missing is “raise” and a key cannot be resolved.

Pandera Module#

Pandera integration helpers for df-eval.

This module keeps Pandera support optional and layered on top of the core Engine API by translating Pandera column metadata into a df-eval schema map.

df_eval.pandera.df_eval_schema_from_pandera(schema, meta_key='df-eval', expr_key='expr')[source]#

Build a df-eval schema mapping from Pandera per-column metadata.

Return type:

dict[str, str]

df_eval.pandera.apply_pandera_schema(df, schema, *, meta_key='df-eval', coerce=True, validate=True, validate_post=True, engine=None, error_on_overwrite=True)[source]#

Validate with Pandera, apply df-eval operations, then optionally revalidate.

Columns that define df-eval metadata under meta_key are considered derived and are excluded from pre-validation. This allows input frames that do not yet include derived columns.

The df-eval metadata for each column may currently define exactly one of the following keys:

{"expr": "a + b"} {"lookup": {"resolver": "prices", "key": "product"}} {"function": {"name": "my_fn", "inputs": ["a"], "outputs": ["y"]}}

These are translated into an operations mapping consumed by df_eval.engine.Engine.apply_operations().

Return type:

DataFrame

df_eval.pandera.apply_pandera_schema_parquet_to_parquet(input_path, output_path, schema, *, meta_key='df-eval', expr_key='expr', engine=None, chunk_size=100000, compression='snappy')[source]#

Apply a Pandera-driven schema to Parquet input and write Parquet output.

The input scan is projected to only required source columns, and output columns are restricted to the Pandera schema order.

Parameters:
  • input_path (str | Path) – Source Parquet file or directory-backed dataset.

  • output_path (str | Path) – Destination Parquet file.

  • schema (Any) – Pandera SchemaModel/DataFrameModel class or DataFrameSchema.

  • meta_key (str) – Metadata section containing df-eval expressions.

  • expr_key (str) – Metadata key containing the expression text.

  • engine (Engine | None) – Optional Engine instance.

  • chunk_size (int) – Maximum rows processed per chunk.

  • compression (str) – Parquet compression codec used for output.

Return type:

Path

Returns:

The normalized output path.

df_eval.pandera.df_eval_operations_from_pandera(schema, meta_key='df-eval')[source]#

Build a rich df-eval operations mapping from Pandera column metadata.

Each column may define one of the following under metadata[meta_key]:

{"expr": "a + b"}
{"lookup": {"resolver": "prices", "key": "product"}}
{"function": {"name": "churn_model_v1", "inputs": ["age"]}}

The returned mapping has the shape:

{
    "column_name": {
        "kind": "expr" | "lookup" | "function",
        "expr": str | None,
        "lookup": dict | None,
        "function": dict | None,
    },
}
Return type:

dict[str, dict[str, Any]]

df_eval.pandera.load_pandera_schema_yaml(source)[source]#

Load a Pandera DataFrameSchema from YAML, preserving column metadata.

This is a thin, public wrapper around df-eval’s temporary fork of pandera.io.pandas_io. It exists to work around unionai-oss/pandera#1301, where column metadata is not round-tripped by Pandera’s YAML/JSON IO helpers.

Parameters:

source (str | Path) – Path to a YAML schema file or a YAML string.

Return type:

Any

Returns:

A Pandera DataFrameSchema.

df_eval.pandera.dump_pandera_schema_yaml(schema, stream=None)[source]#

Dump a Pandera DataFrameSchema to YAML, preserving column metadata.

This uses df-eval’s temporary fork of pandera.io.pandas_io so that column metadata survives a full IO round-trip. Once Pandera fixes unionai-oss/pandera#1301, this helper may be simplified to delegate directly to Pandera’s built-in IO.

Parameters:
  • schema (Any) – A Pandera SchemaModel/DataFrameModel class or DataFrameSchema.

  • stream (str | Path | None) – Optional path or file-like to write to. If None, the YAML representation is returned as a string.

Return type:

str | None

Returns:

The YAML string if stream is None, otherwise None.

df_eval.pandera.load_pandera_schema_json(source)[source]#

Load a Pandera DataFrameSchema from JSON, preserving column metadata.

This mirrors load_pandera_schema_yaml() but for JSON input.

Return type:

Any

df_eval.pandera.dump_pandera_schema_json(schema, target=None, **kwargs)[source]#

Dump a Pandera DataFrameSchema to JSON, preserving column metadata.

Parameters:
  • schema (Any) – A Pandera SchemaModel/DataFrameModel class or DataFrameSchema.

  • target (str | Path | None) – Optional path or file-like to write to. If None, the JSON representation is returned as a string.

  • **kwargs (Any) – Extra keyword arguments forwarded to json.dump().

Return type:

str | None

Parquet Module#

df_eval.parquet.iter_parquet_row_chunks(path, *, chunk_size=100000, columns=None)[source]#

Yield Parquet rows as pandas DataFrame chunks.

This treats a Parquet file or directory-backed Parquet dataset as an out-of-memory DataFrame and streams it into manageable in-memory chunks.

Parameters:
  • path (str | Path) – Path to a Parquet file or a directory containing a Parquet dataset.

  • chunk_size (int) – Maximum number of rows to include per yielded chunk.

  • columns (Sequence[str] | None) – Optional subset of columns to project while scanning.

Yields:

DataFrame chunks with at most chunk_size rows.

Raises:
Return type:

Iterator[DataFrame]

df_eval.parquet.write_parquet_row_chunks(chunks, output_path, *, compression='snappy')[source]#

Write DataFrame chunks to a Parquet file.

Parameters:
  • chunks (Iterable[DataFrame]) – DataFrame chunks to write sequentially.

  • output_path (str | Path) – Destination Parquet file path.

  • compression (str) – Parquet compression codec.

Return type:

Path

Returns:

The normalized output path.

Raises:
  • TypeError – If output_path, compression, or chunk values are invalid.

  • ValueError – If compression is empty or no chunks are provided.

  • ImportError – If pyarrow is not installed.