Category Feature Analysis

It is common to model across estimation domains using categorical features. This example demonstrates how to use the ModelSelection class to compare the performance of the source model against models fitted independently on the category values.

import logging
from functools import partial
from typing import Dict

import pandas as pd
import plotly
import sklearn
from sklearn.compose import ColumnTransformer
from sklearn.datasets import load_diabetes
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from elphick.sklearn_viz.model_selection import ModelSelection, metrics

logging.basicConfig(level=logging.DEBUG,
                    format='%(asctime)s %(levelname)s %(module)s - %(funcName)s: %(message)s',
                    datefmt='%Y-%m-%dT%H:%M:%S%z')

Load Regression Data

We prepare a group variable (a pd.Series) in order to test the performance of modelling independently.

diabetes = load_diabetes(as_frame=True)
x, y = diabetes.data.copy(), diabetes.target
x['sex'] = pd.Categorical(x['sex'].apply(lambda x: 'M' if x < 0 else 'F'))  # assumed mock classes.
y.name = "progression"
xy: pd.DataFrame = pd.concat([x, y], axis=1)
group: pd.Series = x['sex']

Define the pipeline

numerical_cols = x.select_dtypes(include=[float]).columns.to_list()
categorical_cols = x.select_dtypes(include=[object, 'category']).columns.to_list()

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
numerical_preprocessor = StandardScaler()
preprocessor = ColumnTransformer(
    [
        ("one-hot-encoder", categorical_preprocessor, categorical_cols),
        ("standard_scaler", numerical_preprocessor, numerical_cols),
    ]
)

pp: Pipeline = make_pipeline(preprocessor)
models_to_test: Dict = {'LR': sklearn.linear_model.LinearRegression(),
                        'LASSO': sklearn.linear_model.LassoCV()}

ms: ModelSelection = ModelSelection(estimators=models_to_test, datasets=xy, target='progression', pre_processor=pp,
                                    k_folds=10, scorer='r2', group=group,
                                    metrics={'r2_score': metrics.r2_score, 'moe': metrics.moe_95,
                                             'rmse': metrics.rmse,
                                             'me': metrics.mean_error},
                                    random_state=123)

Next we’ll view the plot, but we will not (yet) leverage the group variable.

fig = ms.plot(metrics=['moe', 'me'])
fig.update_layout(height=700)
fig

Now, we will re-plot using group. This is fast, since the fitting metrics were calculated when the first plot was created, and do not need to be calculated again.

Plotting by group can (hopefully) provide evidence that metrics are consistent across groups.

fig = ms.plot(metrics=['moe', 'me'], show_group=True)
fig.update_layout(height=700)
fig

Categorical Feature Analysis

This analysis will test whether better performance can be achieved by modelling the specified categorical class separately, rather than passing it as a feature to the model.

fig = ms.plot_category_analysis(algorithm='LR')
fig.update_layout(height=700)
plotly.io.show(fig)

We can view more metrics…

fig = ms.plot_category_analysis(algorithm='LR', dataset=None,
                                metrics=['r2_score', 'moe', 'rmse', 'me'],
                                col_wrap=2)
fig.update_layout(height=800)
fig

Info

We can see from the notch positions of the comparative boxplots, that modelling by group would offer no benefit for either the F or M classes.

Total running time of the script: ( 0 minutes 1.349 seconds)

Gallery generated by Sphinx-Gallery