API Reference#
Core Modules#
df-eval: A lightweight expression evaluation engine for pandas DataFrames.
This package provides tools for evaluating expressions on pandas DataFrames, supporting schema-driven derived columns and external lookups.
- class df_eval.Engine[source]
Bases:
objectEngine for evaluating expressions on pandas DataFrames.
The Engine class provides methods to evaluate expressions, apply transformations, and manage UDF/constant registries.
- __init__()[source]
Initialize the evaluation engine.
- enable_provenance(enabled=True)[source]
Enable or disable provenance tracking.
- register_function(name, func)[source]
Register a custom function (UDF) for use in expressions.
- register_constant(name, value)[source]
Register a constant for use in expressions.
- register_resolver(name, resolver)[source]
Register a lookup resolver for use in expressions.
Registered resolvers can be referenced by name from expressions via the
lookup()helper, for example:engine.register_resolver("prices", price_resolver) schema = {"price": "lookup(product, prices)"}
- register_pipeline_function(name, func)[source]
Register a named pipeline function for metadata-driven workflows.
Pipeline functions are invoked by higher-level orchestration layers (for example, Pandera-driven schemas) based on column metadata rather than being called directly from df-eval expression strings. A pipeline function typically accepts a
pandas.DataFrameslice and optional keyword arguments, and returns either aSeriesorDataFramealigned with the input index.- Return type:
- evaluate(df, expr, dtype=None)[source]
Evaluate an expression on a DataFrame.
- Parameters:
df (
DataFrame) – The DataFrame to evaluate the expression on.expr (
str|Expression) – The expression to evaluate (string or Expression object).dtype (
Optional[str]) – Optional dtype to cast the result to.
- Return type:
- Returns:
The result of evaluating the expression.
- Raises:
ValueError – If the expression is invalid.
- evaluate_many(df, expressions)[source]
Evaluate multiple expressions and add them as columns.
This is an alias for apply_schema for batch evaluation.
- Parameters:
df (
DataFrame) – The input DataFrame.expressions (
Dict[str,str|Expression]) – A dictionary mapping column names to expressions.
- Return type:
DataFrame- Returns:
A new DataFrame with the evaluated columns added.
- apply_schema(df, schema, dtypes=None)[source]
Apply a schema of derived columns to a DataFrame with topological ordering.
This method automatically handles dependencies between columns and detects cycles in the dependency graph.
- Parameters:
- Return type:
DataFrame- Returns:
A new DataFrame with the derived columns added.
- Raises:
CycleDetectedError – If a cycle is detected in dependencies.
- apply_operations(df, operations, dtypes=None)[source]
Apply a set of operations (expr, lookup, function) to a DataFrame.
operationsis a mapping from column name to a spec with keys:{ "kind": "expr" | "lookup" | "function", "expr": str | None, "lookup": dict | None, "function": dict | None, }
This is intended to be used by higher-level integrations such as the Pandera helpers, which translate column metadata into this structure.
- Return type:
DataFrame
- apply_pandera_schema(df, schema, **kwargs)[source]
Apply a Pandera schema and derive df-eval columns from metadata.
This is a thin convenience wrapper around
df_eval.pandera.apply_pandera_schemathat forwards the current engine instance so registered functions/constants and provenance settings are honored.- Return type:
DataFrame
- iter_apply_schema_parquet_chunks(input_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None)[source]
Yield transformed chunks from a Parquet file or dataset.
- Parameters:
input_path (
str|Path) – Source Parquet file or directory-backed dataset.schema (
Dict[str,str|Expression]) – Mapping of derived column names to expressions.dtypes (
Optional[Dict[str,str]]) – Optional mapping of derived column names to pandas dtypes.chunk_size (
int) – Maximum rows to scan and transform per chunk.input_columns (
Sequence[str] |None) – Optional input column projection for scan efficiency.output_columns (
Sequence[str] |None) – Optional ordered subset of output columns to keep.
- Yields:
Transformed DataFrame chunks.
- Return type:
Iterator[DataFrame]
- apply_schema_parquet_to_df(input_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None)[source]
Transform a Parquet dataset chunk-by-chunk and return one DataFrame.
- Parameters:
input_path (
str|Path) – Source Parquet file or directory-backed dataset.schema (
Dict[str,str|Expression]) – Mapping of derived column names to expressions.dtypes (
Optional[Dict[str,str]]) – Optional mapping of derived column names to pandas dtypes.chunk_size (
int) – Maximum rows to process per chunk.input_columns (
Sequence[str] |None) – Optional input column projection for scan efficiency.output_columns (
Sequence[str] |None) – Optional ordered subset of output columns to keep.
- Return type:
DataFrame- Returns:
A DataFrame containing all transformed rows. Returns an empty DataFrame when the input yields no row chunks.
- apply_schema_parquet_to_parquet(input_path, output_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None, compression='snappy')[source]
Transform a Parquet dataset chunk-by-chunk and write Parquet output.
This method is optimized for out-of-memory processing: source data is streamed in row chunks, transformed with the same expression engine used for in-memory DataFrames, and written incrementally to
output_path.- Parameters:
input_path (
str|Path) – Source Parquet file or directory-backed dataset.schema (
Dict[str,str|Expression]) – Mapping of derived column names to expressions.dtypes (
Optional[Dict[str,str]]) – Optional mapping of derived column names to pandas dtypes.chunk_size (
int) – Maximum rows to process per chunk.input_columns (
Sequence[str] |None) – Optional input column projection for scan efficiency.output_columns (
Sequence[str] |None) – Optional ordered subset of output columns to keep.compression (
str) – Parquet compression codec used for output.
- Return type:
- Returns:
The normalized
output_path.
- class df_eval.Expression(expr_str)[source]
Bases:
objectRepresents a parsed expression that can be evaluated on a DataFrame.
- expr_str
The string representation of the expression.
- dependencies
Set of column names referenced in the expression.
- __init__(expr_str)[source]
Initialize an Expression.
- Parameters:
expr_str (
str) – The expression string to parse.
- static parse(expr_str)[source]
Parse an expression string into an Expression object.
- Parameters:
expr_str (
str) – The expression string to parse.- Return type:
- Returns:
An Expression object.
- Raises:
ValueError – If the expression is invalid.
- exception df_eval.CycleDetectedError[source]
Bases:
ExceptionRaised when a cycle is detected in column dependencies.
- df_eval.lookup(series, resolver, on_missing='null')[source]
Lookup values using a resolver.
- Parameters:
series (
Series) – The series containing keys to lookup.resolver (
Resolver) – The resolver to use for lookups.on_missing (
str) – How to handle missing values (“null”, “raise”, “keep”). - “null”: Return None/NaN for missing values - “raise”: Raise an exception for missing values - “keep”: Keep the original key value
- Return type:
Series- Returns:
A series with resolved values.
- Raises:
ValueError – If on_missing is “raise” and a key cannot be resolved.
- class df_eval.CachedResolver(resolver, ttl_seconds=300.0)[source]
Bases:
ResolverResolver with TTL cache support.
- __init__(resolver, ttl_seconds=300.0)[source]
Initialize a cached resolver.
- class df_eval.DictResolver(mapping, default=None)[source]
Bases:
ResolverSimple dictionary-based resolver.
- __init__(mapping, default=None)[source]
Initialize a dictionary resolver.
- class df_eval.FileResolver(filepath, key_column, value_column)[source]
Bases:
ResolverFile-based resolver that reads from CSV/JSON files.
- __init__(filepath, key_column, value_column)[source]
Initialize a file resolver.
- class df_eval.DatabaseResolver(connection_string, table, key_column, value_column)[source]
Bases:
ResolverDatabase resolver (placeholder for SQL database lookups).
- __init__(connection_string, table, key_column, value_column)[source]
Initialize a database resolver.
- class df_eval.HTTPResolver(base_url, key_param='key')[source]
Bases:
ResolverHTTP API resolver (placeholder for REST API lookups).
- __init__(base_url, key_param='key')[source]
Initialize an HTTP resolver.
- df_eval.df_eval_schema_from_pandera(schema, meta_key='df-eval', expr_key='expr')[source]
Build a df-eval schema mapping from Pandera per-column metadata.
- df_eval.apply_pandera_schema(df, schema, *, meta_key='df-eval', coerce=True, validate=True, validate_post=True, engine=None, error_on_overwrite=True)[source]
Validate with Pandera, apply df-eval operations, then optionally revalidate.
Columns that define df-eval metadata under
meta_keyare considered derived and are excluded from pre-validation. This allows input frames that do not yet include derived columns.The df-eval metadata for each column may currently define exactly one of the following keys:
{"expr": "a + b"}{"lookup": {"resolver": "prices", "key": "product"}}{"function": {"name": "my_fn", "inputs": ["a"], "outputs": ["y"]}}These are translated into an operations mapping consumed by
df_eval.engine.Engine.apply_operations().- Return type:
DataFrame
- df_eval.apply_pandera_schema_parquet_to_parquet(input_path, output_path, schema, *, meta_key='df-eval', expr_key='expr', engine=None, chunk_size=100000, compression='snappy')[source]
Apply a Pandera-driven schema to Parquet input and write Parquet output.
The input scan is projected to only required source columns, and output columns are restricted to the Pandera schema order.
- Parameters:
input_path (
str|Path) – Source Parquet file or directory-backed dataset.schema (
Any) – Pandera SchemaModel/DataFrameModel class or DataFrameSchema.meta_key (
str) – Metadata section containing df-eval expressions.expr_key (
str) – Metadata key containing the expression text.chunk_size (
int) – Maximum rows processed per chunk.compression (
str) – Parquet compression codec used for output.
- Return type:
- Returns:
The normalized output path.
- df_eval.load_pandera_schema_yaml(source)[source]
Load a Pandera DataFrameSchema from YAML, preserving column metadata.
This is a thin, public wrapper around df-eval’s temporary fork of
pandera.io.pandas_io. It exists to work around unionai-oss/pandera#1301, where columnmetadatais not round-tripped by Pandera’s YAML/JSON IO helpers.
- df_eval.dump_pandera_schema_yaml(schema, stream=None)[source]
Dump a Pandera DataFrameSchema to YAML, preserving column metadata.
This uses df-eval’s temporary fork of
pandera.io.pandas_ioso that columnmetadatasurvives a full IO round-trip. Once Pandera fixes unionai-oss/pandera#1301, this helper may be simplified to delegate directly to Pandera’s built-in IO.- Parameters:
- Return type:
- Returns:
The YAML string if
streamisNone, otherwiseNone.
- df_eval.load_pandera_schema_json(source)[source]
Load a Pandera DataFrameSchema from JSON, preserving column metadata.
This mirrors
load_pandera_schema_yaml()but for JSON input.- Return type:
- df_eval.dump_pandera_schema_json(schema, target=None, **kwargs)[source]
Dump a Pandera DataFrameSchema to JSON, preserving column metadata.
- Parameters:
- Return type:
- df_eval.iter_parquet_row_chunks(path, *, chunk_size=100000, columns=None)[source]
Yield Parquet rows as pandas DataFrame chunks.
This treats a Parquet file or directory-backed Parquet dataset as an out-of-memory DataFrame and streams it into manageable in-memory chunks.
- Parameters:
- Yields:
DataFrame chunks with at most
chunk_sizerows.- Raises:
FileNotFoundError – If
pathdoes not exist.TypeError – If
path,chunk_size, orcolumnshave invalid types.ValueError – If
chunk_sizeis less than 1.ImportError – If
pyarrowis not installed.
- Return type:
Iterator[DataFrame]
- df_eval.write_parquet_row_chunks(chunks, output_path, *, compression='snappy')[source]
Write DataFrame chunks to a Parquet file.
- Parameters:
- Return type:
- Returns:
The normalized output path.
- Raises:
TypeError – If
output_path,compression, or chunk values are invalid.ValueError – If
compressionis empty or no chunks are provided.ImportError – If
pyarrowis not installed.
Expression Module#
Expression parsing and representation module.
This module provides the Expression class for parsing and representing expressions that can be evaluated on pandas DataFrames.
- class df_eval.expr.Expression(expr_str)[source]#
Bases:
objectRepresents a parsed expression that can be evaluated on a DataFrame.
- expr_str#
The string representation of the expression.
- dependencies#
Set of column names referenced in the expression.
- __init__(expr_str)[source]#
Initialize an Expression.
- Parameters:
expr_str (
str) – The expression string to parse.
- static parse(expr_str)[source]#
Parse an expression string into an Expression object.
- Parameters:
expr_str (
str) – The expression string to parse.- Return type:
- Returns:
An Expression object.
- Raises:
ValueError – If the expression is invalid.
Engine Module#
Evaluation engine module.
This module provides the Engine class for evaluating expressions on pandas DataFrames with support for UDF registry, schema-driven derived columns with topological ordering, and provenance tracking.
- exception df_eval.engine.CycleDetectedError[source]#
Bases:
ExceptionRaised when a cycle is detected in column dependencies.
- class df_eval.engine.Engine[source]#
Bases:
objectEngine for evaluating expressions on pandas DataFrames.
The Engine class provides methods to evaluate expressions, apply transformations, and manage UDF/constant registries.
- register_resolver(name, resolver)[source]#
Register a lookup resolver for use in expressions.
Registered resolvers can be referenced by name from expressions via the
lookup()helper, for example:engine.register_resolver("prices", price_resolver) schema = {"price": "lookup(product, prices)"}
- register_pipeline_function(name, func)[source]#
Register a named pipeline function for metadata-driven workflows.
Pipeline functions are invoked by higher-level orchestration layers (for example, Pandera-driven schemas) based on column metadata rather than being called directly from df-eval expression strings. A pipeline function typically accepts a
pandas.DataFrameslice and optional keyword arguments, and returns either aSeriesorDataFramealigned with the input index.- Return type:
- evaluate(df, expr, dtype=None)[source]#
Evaluate an expression on a DataFrame.
- Parameters:
df (
DataFrame) – The DataFrame to evaluate the expression on.expr (
str|Expression) – The expression to evaluate (string or Expression object).dtype (
Optional[str]) – Optional dtype to cast the result to.
- Return type:
- Returns:
The result of evaluating the expression.
- Raises:
ValueError – If the expression is invalid.
- evaluate_many(df, expressions)[source]#
Evaluate multiple expressions and add them as columns.
This is an alias for apply_schema for batch evaluation.
- Parameters:
df (
DataFrame) – The input DataFrame.expressions (
Dict[str,str|Expression]) – A dictionary mapping column names to expressions.
- Return type:
DataFrame- Returns:
A new DataFrame with the evaluated columns added.
- apply_schema(df, schema, dtypes=None)[source]#
Apply a schema of derived columns to a DataFrame with topological ordering.
This method automatically handles dependencies between columns and detects cycles in the dependency graph.
- Parameters:
- Return type:
DataFrame- Returns:
A new DataFrame with the derived columns added.
- Raises:
CycleDetectedError – If a cycle is detected in dependencies.
- apply_operations(df, operations, dtypes=None)[source]#
Apply a set of operations (expr, lookup, function) to a DataFrame.
operationsis a mapping from column name to a spec with keys:{ "kind": "expr" | "lookup" | "function", "expr": str | None, "lookup": dict | None, "function": dict | None, }
This is intended to be used by higher-level integrations such as the Pandera helpers, which translate column metadata into this structure.
- Return type:
DataFrame
- apply_pandera_schema(df, schema, **kwargs)[source]#
Apply a Pandera schema and derive df-eval columns from metadata.
This is a thin convenience wrapper around
df_eval.pandera.apply_pandera_schemathat forwards the current engine instance so registered functions/constants and provenance settings are honored.- Return type:
DataFrame
- iter_apply_schema_parquet_chunks(input_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None)[source]#
Yield transformed chunks from a Parquet file or dataset.
- Parameters:
input_path (
str|Path) – Source Parquet file or directory-backed dataset.schema (
Dict[str,str|Expression]) – Mapping of derived column names to expressions.dtypes (
Optional[Dict[str,str]]) – Optional mapping of derived column names to pandas dtypes.chunk_size (
int) – Maximum rows to scan and transform per chunk.input_columns (
Sequence[str] |None) – Optional input column projection for scan efficiency.output_columns (
Sequence[str] |None) – Optional ordered subset of output columns to keep.
- Yields:
Transformed DataFrame chunks.
- Return type:
Iterator[DataFrame]
- apply_schema_parquet_to_df(input_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None)[source]#
Transform a Parquet dataset chunk-by-chunk and return one DataFrame.
- Parameters:
input_path (
str|Path) – Source Parquet file or directory-backed dataset.schema (
Dict[str,str|Expression]) – Mapping of derived column names to expressions.dtypes (
Optional[Dict[str,str]]) – Optional mapping of derived column names to pandas dtypes.chunk_size (
int) – Maximum rows to process per chunk.input_columns (
Sequence[str] |None) – Optional input column projection for scan efficiency.output_columns (
Sequence[str] |None) – Optional ordered subset of output columns to keep.
- Return type:
DataFrame- Returns:
A DataFrame containing all transformed rows. Returns an empty DataFrame when the input yields no row chunks.
- apply_schema_parquet_to_parquet(input_path, output_path, schema, *, dtypes=None, chunk_size=100000, input_columns=None, output_columns=None, compression='snappy')[source]#
Transform a Parquet dataset chunk-by-chunk and write Parquet output.
This method is optimized for out-of-memory processing: source data is streamed in row chunks, transformed with the same expression engine used for in-memory DataFrames, and written incrementally to
output_path.- Parameters:
input_path (
str|Path) – Source Parquet file or directory-backed dataset.schema (
Dict[str,str|Expression]) – Mapping of derived column names to expressions.dtypes (
Optional[Dict[str,str]]) – Optional mapping of derived column names to pandas dtypes.chunk_size (
int) – Maximum rows to process per chunk.input_columns (
Sequence[str] |None) – Optional input column projection for scan efficiency.output_columns (
Sequence[str] |None) – Optional ordered subset of output columns to keep.compression (
str) – Parquet compression codec used for output.
- Return type:
- Returns:
The normalized
output_path.
Functions Module#
Built-in functions for expression evaluation.
This module provides built-in functions that can be used in expressions. These are safe, vectorized functions that are allow-listed for use in expressions.
- df_eval.functions.safe_divide(a, b)[source]#
Safely divide two values, returning NaN for division by zero.
- df_eval.functions.clip_value(value, min_val=None, max_val=None)[source]#
Clip values to a specified range.
Lookup Module#
Lookup functionality with resolvers for external data sources.
This module provides lookup capabilities with various resolvers (database, HTTP, file) and TTL caching.
- class df_eval.lookup.CachedResolver(resolver, ttl_seconds=300.0)[source]#
Bases:
ResolverResolver with TTL cache support.
- class df_eval.lookup.DictResolver(mapping, default=None)[source]#
Bases:
ResolverSimple dictionary-based resolver.
- class df_eval.lookup.FileResolver(filepath, key_column, value_column)[source]#
Bases:
ResolverFile-based resolver that reads from CSV/JSON files.
- class df_eval.lookup.DatabaseResolver(connection_string, table, key_column, value_column)[source]#
Bases:
ResolverDatabase resolver (placeholder for SQL database lookups).
- class df_eval.lookup.HTTPResolver(base_url, key_param='key')[source]#
Bases:
ResolverHTTP API resolver (placeholder for REST API lookups).
- df_eval.lookup.lookup(series, resolver, on_missing='null')[source]#
Lookup values using a resolver.
- Parameters:
series (
Series) – The series containing keys to lookup.resolver (
Resolver) – The resolver to use for lookups.on_missing (
str) – How to handle missing values (“null”, “raise”, “keep”). - “null”: Return None/NaN for missing values - “raise”: Raise an exception for missing values - “keep”: Keep the original key value
- Return type:
Series- Returns:
A series with resolved values.
- Raises:
ValueError – If on_missing is “raise” and a key cannot be resolved.
Pandera Module#
Pandera integration helpers for df-eval.
This module keeps Pandera support optional and layered on top of the core Engine API by translating Pandera column metadata into a df-eval schema map.
- df_eval.pandera.df_eval_schema_from_pandera(schema, meta_key='df-eval', expr_key='expr')[source]#
Build a df-eval schema mapping from Pandera per-column metadata.
- df_eval.pandera.apply_pandera_schema(df, schema, *, meta_key='df-eval', coerce=True, validate=True, validate_post=True, engine=None, error_on_overwrite=True)[source]#
Validate with Pandera, apply df-eval operations, then optionally revalidate.
Columns that define df-eval metadata under
meta_keyare considered derived and are excluded from pre-validation. This allows input frames that do not yet include derived columns.The df-eval metadata for each column may currently define exactly one of the following keys:
{"expr": "a + b"}{"lookup": {"resolver": "prices", "key": "product"}}{"function": {"name": "my_fn", "inputs": ["a"], "outputs": ["y"]}}These are translated into an operations mapping consumed by
df_eval.engine.Engine.apply_operations().- Return type:
DataFrame
- df_eval.pandera.apply_pandera_schema_parquet_to_parquet(input_path, output_path, schema, *, meta_key='df-eval', expr_key='expr', engine=None, chunk_size=100000, compression='snappy')[source]#
Apply a Pandera-driven schema to Parquet input and write Parquet output.
The input scan is projected to only required source columns, and output columns are restricted to the Pandera schema order.
- Parameters:
input_path (
str|Path) – Source Parquet file or directory-backed dataset.schema (
Any) – Pandera SchemaModel/DataFrameModel class or DataFrameSchema.meta_key (
str) – Metadata section containing df-eval expressions.expr_key (
str) – Metadata key containing the expression text.chunk_size (
int) – Maximum rows processed per chunk.compression (
str) – Parquet compression codec used for output.
- Return type:
- Returns:
The normalized output path.
- df_eval.pandera.df_eval_operations_from_pandera(schema, meta_key='df-eval')[source]#
Build a rich df-eval operations mapping from Pandera column metadata.
Each column may define one of the following under
metadata[meta_key]:{"expr": "a + b"} {"lookup": {"resolver": "prices", "key": "product"}} {"function": {"name": "churn_model_v1", "inputs": ["age"]}}
The returned mapping has the shape:
{ "column_name": { "kind": "expr" | "lookup" | "function", "expr": str | None, "lookup": dict | None, "function": dict | None, }, }
- df_eval.pandera.load_pandera_schema_yaml(source)[source]#
Load a Pandera DataFrameSchema from YAML, preserving column metadata.
This is a thin, public wrapper around df-eval’s temporary fork of
pandera.io.pandas_io. It exists to work around unionai-oss/pandera#1301, where columnmetadatais not round-tripped by Pandera’s YAML/JSON IO helpers.
- df_eval.pandera.dump_pandera_schema_yaml(schema, stream=None)[source]#
Dump a Pandera DataFrameSchema to YAML, preserving column metadata.
This uses df-eval’s temporary fork of
pandera.io.pandas_ioso that columnmetadatasurvives a full IO round-trip. Once Pandera fixes unionai-oss/pandera#1301, this helper may be simplified to delegate directly to Pandera’s built-in IO.- Parameters:
- Return type:
- Returns:
The YAML string if
streamisNone, otherwiseNone.
- df_eval.pandera.load_pandera_schema_json(source)[source]#
Load a Pandera DataFrameSchema from JSON, preserving column metadata.
This mirrors
load_pandera_schema_yaml()but for JSON input.- Return type:
Parquet Module#
- df_eval.parquet.iter_parquet_row_chunks(path, *, chunk_size=100000, columns=None)[source]#
Yield Parquet rows as pandas DataFrame chunks.
This treats a Parquet file or directory-backed Parquet dataset as an out-of-memory DataFrame and streams it into manageable in-memory chunks.
- Parameters:
- Yields:
DataFrame chunks with at most
chunk_sizerows.- Raises:
FileNotFoundError – If
pathdoes not exist.TypeError – If
path,chunk_size, orcolumnshave invalid types.ValueError – If
chunk_sizeis less than 1.ImportError – If
pyarrowis not installed.
- Return type:
Iterator[DataFrame]
- df_eval.parquet.write_parquet_row_chunks(chunks, output_path, *, compression='snappy')[source]#
Write DataFrame chunks to a Parquet file.
- Parameters:
- Return type:
- Returns:
The normalized output path.
- Raises:
TypeError – If
output_path,compression, or chunk values are invalid.ValueError – If
compressionis empty or no chunks are provided.ImportError – If
pyarrowis not installed.