Calculations
Rather than storing redundant data, it may make sense to store only the essential data and calculate the rest. This can be done in the schema.
The calculation key is used to store a string that can be evaluated by pandas, using the eval function. This can be used to calculate a new column based on the values of other columns.
In this example, the moisture content of a sample is calculated from the wet and dry mass.
columns:
wet_mass:
dtype: float32
coerce: true
description: 'The mass of the sample when wet'
checks:
greater_than: 0
metadata:
pandera_utils:
unit_of_measure: 'kg'
dry_mass:
dtype: float32
coerce: true
description: 'The mass of the sample when oven dried'
checks:
greater_than: 0
metadata:
pandera_utils:
unit_of_measure: 'kg'
moisture:
dtype: float32
coerce: true
description: 'The percentage of water in the sample'
checks:
greater_than: 0
metadata:
pandera_utils:
unit_of_measure: '%'
calculation: '(wet_mass - dry_mass) / dry_mass * 100'
inputs: [ 'wet_mass', 'dry_mass' ]
For more information see the pandas documentation.