Partition Estimator

There are times when the modeller or subject-matter expert feels the need to test estimation domains (data partitions). The partitions are defined by setting a criteria string per estimation domain. The criteria string is used to filter the incoming feature dataframe before fitting the model for that partition.

The idea supporting partitioning is that each partition, having a specific structure can be fitted better. The trade-off of course is that when partitioned, less data is available for fitting.

import pandas as pd
import plotly
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.utils.validation import check_is_fitted

from elphick.sklearn_viz.components.estimators import PartitionRegressor
from elphick.sklearn_viz.features import plot_feature_importance, OutlierDetection
from elphick.sklearn_viz.model_selection import ModelSelection

Load Regression Data

The California housing dataset will be loaded to demonstrate the regression.

x, y = fetch_california_housing(return_X_y=True, as_frame=True)

Remove Outliers

# We will remove the outliers from the dataset.  This is not a necessary step, but it will help to
# demonstrate the partitioning.

od: OutlierDetection = OutlierDetection(x=x, pca_spec=2)
detected_outliers: pd.Series = od.data['outlier']

x = x.query('@detected_outliers == False')
y = y.loc[x.index]

Split the data

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
xy_train: pd.DataFrame = pd.concat([x_train, y_train], axis=1)

Define the pipeline

numerical_cols = x.select_dtypes(include=[float]).columns.to_list()
categorical_cols = x.select_dtypes(include=[object, 'category']).columns.to_list()

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
numerical_preprocessor = StandardScaler()
preprocessor = ColumnTransformer(
    [
        ("one-hot-encoder", categorical_preprocessor, categorical_cols),
        ("standard_scaler", numerical_preprocessor, numerical_cols),
    ], verbose_feature_names_out=False
)

pp: Pipeline = make_pipeline(preprocessor).set_output(transform='pandas')

Demo Fit and Predict

base_mdl.fit(X=x_train, y=y_train)
check_is_fitted(base_mdl)
est_base = pd.Series(base_mdl.predict(X=x_test), index=x_test.index, name='base_est')

partition_mdl.fit(X=x_train, y=y_train)
check_is_fitted(partition_mdl)
est_partition = partition_mdl.predict(X=x_test)

est: pd.DataFrame = pd.concat([est_base, est_partition], axis=1)
est.columns = ['base_est', 'partition_est']
est.head()

	base_est	partition_est
20086	0.803608	0.703745
16347	0.715652	0.642847
13663	2.148621	2.364534
16714	2.515524	2.298311
2163	1.593148	1.431354

Cross Validation

ms: ModelSelection = ModelSelection(estimators={'base-model': base_mdl,
                                                'partition-model': partition_mdl},
                                    datasets=xy_train,
                                    target='MedHouseVal',
                                    group=partition_mdl[-1].domains_, random_state=123)
fig = ms.plot(show_group=True, metrics=['r2_score'])
fig.update_layout(height=600)
plotly.io.show(fig)

To some extent the error margin (notch width) will be driven by the number of samples. Let’s check the sample counts.

partition_mdl[-1].domains_.value_counts()

domains
medium    8490
large     4145
small     3640
Name: count, dtype: int64

Note

In this case the partitioning did not deliver any statistically significant improvement, but a directional improvement.
It appears that (for the small class) the error margins are wider for the partitioned model, caused by lower sample count for that class in the fitted model.

Feature Importance

We can check that the feature imports works as expected on our new Estimator.

fig = plot_feature_importance(partition_mdl, permute=True, x_test=x_train, y_test=y_train)
fig

Total running time of the script: ( 0 minutes 5.780 seconds)

Gallery generated by Sphinx-Gallery

Partition Estimator

Load Regression Data

Remove Outliers

Split the data

Define the pipeline

Define the Estimators

Baseline Model

Partitioned Model

Demo Fit and Predict

Cross Validation

Feature Importance