.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/category_feature_analysis.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_category_feature_analysis.py: Category Feature Analysis ========================= It is common to model across estimation domains using categorical features. This example demonstrates how to use the ModelSelection class to compare the performance of the source model against models fitted independently on the category values. .. GENERATED FROM PYTHON SOURCE LINES 10-28 .. code-block:: default import logging from functools import partial from typing import Dict import pandas as pd import plotly import sklearn from sklearn.compose import ColumnTransformer from sklearn.datasets import load_diabetes from sklearn.pipeline import make_pipeline, Pipeline from sklearn.preprocessing import OneHotEncoder, StandardScaler from elphick.sklearn_viz.model_selection import ModelSelection, metrics logging.basicConfig(level=logging.DEBUG, format='%(asctime)s %(levelname)s %(module)s - %(funcName)s: %(message)s', datefmt='%Y-%m-%dT%H:%M:%S%z') .. GENERATED FROM PYTHON SOURCE LINES 29-33 Load Regression Data -------------------- We prepare a `group` variable (a pd.Series) in order to test the performance of modelling independently. .. GENERATED FROM PYTHON SOURCE LINES 33-41 .. code-block:: default diabetes = load_diabetes(as_frame=True) x, y = diabetes.data.copy(), diabetes.target x['sex'] = pd.Categorical(x['sex'].apply(lambda x: 'M' if x < 0 else 'F')) # assumed mock classes. y.name = "progression" xy: pd.DataFrame = pd.concat([x, y], axis=1) group: pd.Series = x['sex'] .. GENERATED FROM PYTHON SOURCE LINES 42-44 Define the pipeline ------------------- .. GENERATED FROM PYTHON SOURCE LINES 44-67 .. code-block:: default numerical_cols = x.select_dtypes(include=[float]).columns.to_list() categorical_cols = x.select_dtypes(include=[object, 'category']).columns.to_list() categorical_preprocessor = OneHotEncoder(handle_unknown="ignore", sparse_output=False) numerical_preprocessor = StandardScaler() preprocessor = ColumnTransformer( [ ("one-hot-encoder", categorical_preprocessor, categorical_cols), ("standard_scaler", numerical_preprocessor, numerical_cols), ] ) pp: Pipeline = make_pipeline(preprocessor) models_to_test: Dict = {'LR': sklearn.linear_model.LinearRegression(), 'LASSO': sklearn.linear_model.LassoCV()} ms: ModelSelection = ModelSelection(estimators=models_to_test, datasets=xy, target='progression', pre_processor=pp, k_folds=10, scorer='r2', group=group, metrics={'r2_score': metrics.r2_score, 'moe': metrics.moe_95, 'rmse': metrics.rmse, 'me': metrics.mean_error}, random_state=123) .. GENERATED FROM PYTHON SOURCE LINES 68-69 Next we'll view the plot, but we will not (yet) leverage the group variable. .. GENERATED FROM PYTHON SOURCE LINES 69-74 .. code-block:: default fig = ms.plot(metrics=['moe', 'me']) fig.update_layout(height=700) fig .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 75-79 Now, we will re-plot using group. This is fast, since the fitting metrics were calculated when the first plot was created, and do not need to be calculated again. Plotting by group can (hopefully) provide evidence that metrics are consistent across groups. .. GENERATED FROM PYTHON SOURCE LINES 79-84 .. code-block:: default fig = ms.plot(metrics=['moe', 'me'], show_group=True) fig.update_layout(height=700) fig .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 85-90 Categorical Feature Analysis ---------------------------- This analysis will test whether better performance can be achieved by modelling the specified categorical class separately, rather than passing it as a feature to the model. .. GENERATED FROM PYTHON SOURCE LINES 90-95 .. code-block:: default fig = ms.plot_category_analysis(algorithm='LR') fig.update_layout(height=700) plotly.io.show(fig) .. raw:: html :file: images/sphx_glr_category_feature_analysis_001.html .. GENERATED FROM PYTHON SOURCE LINES 96-97 We can view more metrics... .. GENERATED FROM PYTHON SOURCE LINES 97-104 .. code-block:: default fig = ms.plot_category_analysis(algorithm='LR', dataset=None, metrics=['r2_score', 'moe', 'rmse', 'me'], col_wrap=2) fig.update_layout(height=800) fig .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 105-109 .. admonition:: Info We can see from the notch positions of the comparative boxplots, that modelling by group would offer no benefit for either the F or M classes. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 1.349 seconds) .. _sphx_glr_download_auto_examples_category_feature_analysis.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: category_feature_analysis.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: category_feature_analysis.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_