Basic usage

A simple example demonstrating how to use pandera_utils.

import inspect

import pandas as pd
from pathlib import Path
import yaml
from elphick.pandera_utils.pandera_utils import load_schema_from_yaml, DataFrameMetaProcessor

__file__ = Path(inspect.getfile(inspect.currentframe())).resolve()

Load Schema

Load the schema from the YAML file

yaml_path = __file__.parents[1] / "assets/example_schema.yaml"
schema = load_schema_from_yaml(yaml_path)

# Print the YAML file in a nicely formatted way
with open(yaml_path, "r", encoding="utf-8") as f:
    schema_yaml = yaml.safe_load(f)
    print(yaml.dump(schema_yaml, sort_keys=False, indent=2))
columns:
  column1:
    dtype: int
    checks:
      greater_than: 0
    required: false
    coerce: true
    metadata:
      pandera_utils:
        unit_of_measure: m
        aliases:
        - col1
  column2:
    dtype: float
    checks:
      greater_than_or_equal_to: 0.0
    required: false
    metadata:
      pandera_utils:
        decimals: 1
  column3:
    dtype: float
    checks:
      less_than: 100.0
    required: false
    metadata:
      pandera_utils:
        calculation: column1 + column2
        inputs:
        - column1
        - column2
        decimals: 2

Create a sample DataFrame

dataframe = pd.DataFrame({
    "col1": [1, 2, 3],
    "column2": [0.546, 1.568, 2.578],
})

# preserve a copy for comparison later
dataframe_copy = dataframe.copy(deep=True)

dataframe
col1 column2
0 1 0.546
1 2 1.568
2 3 2.578


Initialize

Initialize the DataFrameMetaProcessor with the schema

processor = DataFrameMetaProcessor(schema)

Rename Aliases

df_with_alias = processor.apply_rename_from_alias(dataframe)
df_with_alias
column1 column2
0 1 0.546
1 2 1.568
2 3 2.578


Apply calculations

df_with_calculations = processor.apply_calculations(df_with_alias)
df_with_calculations
column1 column2 column3
0 1 0.546 1.546
1 2 1.568 3.568
2 3 2.578 5.578


Apply rounding

df_with_decimals = processor.apply_rounding(df_with_calculations)
df_with_decimals
column1 column2 column3
0 1 0.5 1.55
1 2 1.6 3.57
2 3 2.6 5.58


One Step Preprocessing

Preprocess the DataFrame, with alias renaming, rounding, and calculations. If metadata: decimals is not null, the column will be rounded to that number of decimal places after the other preprocessing steps. When set to True, the round_before_calc argument will round the DataFrame before applying calculations, as well as after.

processed_df = processor.preprocess(dataframe_copy, round_before_calc=False)
processed_df
column1 column2 column3
0 1 0.5 1.55
1 2 1.6 3.57
2 3 2.6 5.58


We can check that the individual steps are equivalent to the one-step preprocessing

assert processed_df.equals(df_with_decimals)

Validate

Validate the DataFrame using Pandera

validated_df = processor.validate(processed_df)
validated_df
column1 column2 column3
0 1 0.5 1.55
1 2 1.6 3.57
2 3 2.6 5.58


Total running time of the script: (0 minutes 0.069 seconds)

Gallery generated by Sphinx-Gallery