Partition Estimator

There are times when the modeller or subject-matter expert feels the need to test estimation domains (data partitions). The partitions are defined by setting a criteria string per estimation domain. The criteria string is used to filter the incoming feature dataframe before fitting the model for that partition.

The idea supporting partitioning is that each partition, having a specific structure can be fitted better. The trade-off of course is that when partitioned, less data is available for fitting.

import pandas as pd
import plotly
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.utils.validation import check_is_fitted

from elphick.sklearn_viz.components.estimators import PartitionRegressor
from elphick.sklearn_viz.features import plot_feature_importance, OutlierDetection
from elphick.sklearn_viz.model_selection import ModelSelection

Load Regression Data

The California housing dataset will be loaded to demonstrate the regression.

x, y = fetch_california_housing(return_X_y=True, as_frame=True)

Remove Outliers

# We will remove the outliers from the dataset.  This is not a necessary step, but it will help to
# demonstrate the partitioning.

od: OutlierDetection = OutlierDetection(x=x, pca_spec=2)
detected_outliers: pd.Series = od.data['outlier']

x = x.query('@detected_outliers == False')
y = y.loc[x.index]

Split the data

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
xy_train: pd.DataFrame = pd.concat([x_train, y_train], axis=1)

Define the pipeline

numerical_cols = x.select_dtypes(include=[float]).columns.to_list()
categorical_cols = x.select_dtypes(include=[object, 'category']).columns.to_list()

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
numerical_preprocessor = StandardScaler()
preprocessor = ColumnTransformer(
    [
        ("one-hot-encoder", categorical_preprocessor, categorical_cols),
        ("standard_scaler", numerical_preprocessor, numerical_cols),
    ], verbose_feature_names_out=False
)

pp: Pipeline = make_pipeline(preprocessor).set_output(transform='pandas')

Define the Estimators

Baseline Model

# The baseline model will simply be fitted as normal - no partitions will be applied.

base_mdl: Pipeline = make_pipeline(pp, LinearRegression())
base_mdl
Pipeline(steps=[('pipeline',
                 Pipeline(steps=[('columntransformer',
                                  ColumnTransformer(transformers=[('one-hot-encoder',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False),
                                                                   []),
                                                                  ('standard_scaler',
                                                                   StandardScaler(),
                                                                   ['MedInc',
                                                                    'HouseAge',
                                                                    'AveRooms',
                                                                    'AveBedrms',
                                                                    'Population',
                                                                    'AveOccup',
                                                                    'Latitude',
                                                                    'Longitude'])],
                                                    verbose_feature_names_out=False))])),
                ('linearregression', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Partitioned Model

We will create the criteria for 3 arbitrary partitions of room size at the lower and upper quartile. We’d like to work in the incoming feature space, but the PartitionRegressor will need the criteria in the post-processed space, since that is the only data it sees.

x_train['AveRooms'].describe().T
count    16275.000000
mean         5.429862
std          2.514777
min          0.846154
25%          4.440403
50%          5.227139
75%          6.054429
max        141.909091
Name: AveRooms, dtype: float64
partition_criteria: dict = {'small': 'AveRooms < 4.4',
                            'medium': '(AveRooms >= 4.4) and (AveRooms < 6.0)',
                            'large': 'AveRooms >= 6.0'}

For now this is conversion must be done by the user, manually

pp.fit_transform(x_train)['AveRooms'].describe()
count    1.627500e+04
mean     3.173976e-16
std      1.000031e+00
min     -1.822766e+00
25%     -3.934702e-01
50%     -8.061541e-02
75%      2.483664e-01
max      5.427257e+01
Name: AveRooms, dtype: float64
partition_criteria: dict = {'small': 'AveRooms < -0.43',
                            'medium': '(AveRooms >= -0.43) and (AveRooms < 0.24)',
                            'large': 'AveRooms >= 0.24'}

partition_mdl: Pipeline = make_pipeline(pp, PartitionRegressor(LinearRegression(),
                                                               partition_defs=partition_criteria))
partition_mdl
Pipeline(steps=[('pipeline',
                 Pipeline(steps=[('columntransformer',
                                  ColumnTransformer(transformers=[('one-hot-encoder',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False),
                                                                   []),
                                                                  ('standard_scaler',
                                                                   StandardScaler(),
                                                                   ['MedInc',
                                                                    'HouseAge',
                                                                    'AveRooms',
                                                                    'AveBedrms',
                                                                    'Population',
                                                                    'AveOccup',
                                                                    'Latitude',
                                                                    'Longitude'])],
                                                    verbose_feature_names_out=False))])),
                ('partitionregressor',
                 PartitionRegressor(estimator=LinearRegression(),
                                    partition_defs={'large': 'AveRooms >= 0.24',
                                                    'medium': '(AveRooms >= '
                                                              '-0.43) and '
                                                              '(AveRooms < '
                                                              '0.24)',
                                                    'small': 'AveRooms < '
                                                             '-0.43'}))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


In the model visualisation above, expand the arrow next to the partition names (small, medium, large) to see the criteria.

Tip

A trick to avoid transforming the partition criteria values is to embed the preprocessor into every model, rather than having a common preprocessor. This will create additional computational overhead but is perhaps a nice way of simplifying the workflow.

Demo Fit and Predict

base_mdl.fit(X=x_train, y=y_train)
check_is_fitted(base_mdl)
est_base = pd.Series(base_mdl.predict(X=x_test), index=x_test.index, name='base_est')

partition_mdl.fit(X=x_train, y=y_train)
check_is_fitted(partition_mdl)
est_partition = partition_mdl.predict(X=x_test)

est: pd.DataFrame = pd.concat([est_base, est_partition], axis=1)
est.columns = ['base_est', 'partition_est']
est.head()
base_est partition_est
20086 0.803608 0.703745
16347 0.715652 0.642847
13663 2.148621 2.364534
16714 2.515524 2.298311
2163 1.593148 1.431354


Cross Validation

ms: ModelSelection = ModelSelection(estimators={'base-model': base_mdl,
                                                'partition-model': partition_mdl},
                                    datasets=xy_train,
                                    target='MedHouseVal',
                                    group=partition_mdl[-1].domains_, random_state=123)
fig = ms.plot(show_group=True, metrics=['r2_score'])
fig.update_layout(height=600)
plotly.io.show(fig)

To some extent the error margin (notch width) will be driven by the number of samples. Let’s check the sample counts.

partition_mdl[-1].domains_.value_counts()
domains
medium    8490
large     4145
small     3640
Name: count, dtype: int64

Note

  1. In this case the partitioning did not deliver any statistically significant improvement, but a directional improvement.

  2. It appears that (for the small class) the error margins are wider for the partitioned model, caused by lower sample count for that class in the fitted model.

Feature Importance

We can check that the feature imports works as expected on our new Estimator.

fig = plot_feature_importance(partition_mdl, permute=True, x_test=x_train, y_test=y_train)
fig


Total running time of the script: ( 0 minutes 5.780 seconds)

Gallery generated by Sphinx-Gallery