Categoricals
Categoricals
enable compact storage of objects (strings). Sure they come with a
learning curve, but they can be very useful.
A categorical column is essentially a map of an integer (called cat_codes
) to each string, which enables repeated rows of strings
to be stored with less memory. The user sees the string they expect, but the memory overhead is reduced.
The pandera_utils package allows you to add additional maps to each category. Typical use cases include adding a label or description.
columns:
my_color_column:
dtype: category
checks:
isin: ['R', 'G', 'B']
metadata:
pandera_utils:
category:
add_all_categories: true
ordered: false
label:
map: {'R': 'Red', 'G': 'Green', 'B': 'Blue'}
dtype: category
description:
map: {'R': 'The color red', 'G': 'The color green', 'B': 'The color blue'}
dtype: category
wavelength:
map: {'R': 700, 'G': 546, 'B': 435}
dtype: int
The isin
check is used to define the categories that are allowed in the column, in the case that
the add_all_categories
key is not set to false. If the key is set to false, only categories present in the
loaded data will define the allowable categories.
For more information see the pandas documentation.