.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/partition_regressor.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_partition_regressor.py: Partition Estimator =================== There are times when the modeller or subject-matter expert feels the need to test estimation domains (data partitions). The partitions are defined by setting a criteria string per estimation domain. The criteria string is used to filter the incoming feature dataframe before fitting the model for that partition. The idea supporting partitioning is that each partition, having a specific structure can be fitted better. The trade-off of course is that when partitioned, less data is available for fitting. .. GENERATED FROM PYTHON SOURCE LINES 13-28 .. code-block:: default import pandas as pd import plotly from sklearn.compose import ColumnTransformer from sklearn.datasets import fetch_california_housing from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline, make_pipeline from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.utils.validation import check_is_fitted from elphick.sklearn_viz.components.estimators import PartitionRegressor from elphick.sklearn_viz.features import plot_feature_importance, OutlierDetection from elphick.sklearn_viz.model_selection import ModelSelection .. GENERATED FROM PYTHON SOURCE LINES 29-33 Load Regression Data -------------------- The California housing dataset will be loaded to demonstrate the regression. .. GENERATED FROM PYTHON SOURCE LINES 33-36 .. code-block:: default x, y = fetch_california_housing(return_X_y=True, as_frame=True) .. GENERATED FROM PYTHON SOURCE LINES 37-39 Remove Outliers --------------- .. GENERATED FROM PYTHON SOURCE LINES 39-49 .. code-block:: default # We will remove the outliers from the dataset. This is not a necessary step, but it will help to # demonstrate the partitioning. od: OutlierDetection = OutlierDetection(x=x, pca_spec=2) detected_outliers: pd.Series = od.data['outlier'] x = x.query('@detected_outliers == False') y = y.loc[x.index] .. GENERATED FROM PYTHON SOURCE LINES 50-52 Split the data -------------- .. GENERATED FROM PYTHON SOURCE LINES 52-56 .. code-block:: default x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42) xy_train: pd.DataFrame = pd.concat([x_train, y_train], axis=1) .. GENERATED FROM PYTHON SOURCE LINES 57-59 Define the pipeline ------------------- .. GENERATED FROM PYTHON SOURCE LINES 59-73 .. code-block:: default numerical_cols = x.select_dtypes(include=[float]).columns.to_list() categorical_cols = x.select_dtypes(include=[object, 'category']).columns.to_list() categorical_preprocessor = OneHotEncoder(handle_unknown="ignore", sparse_output=False) numerical_preprocessor = StandardScaler() preprocessor = ColumnTransformer( [ ("one-hot-encoder", categorical_preprocessor, categorical_cols), ("standard_scaler", numerical_preprocessor, numerical_cols), ], verbose_feature_names_out=False ) pp: Pipeline = make_pipeline(preprocessor).set_output(transform='pandas') .. GENERATED FROM PYTHON SOURCE LINES 74-76 Define the Estimators --------------------- .. GENERATED FROM PYTHON SOURCE LINES 78-81 Baseline Model ~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 81-87 .. code-block:: default # The baseline model will simply be fitted as normal - no partitions will be applied. base_mdl: Pipeline = make_pipeline(pp, LinearRegression()) base_mdl .. raw:: html
Pipeline(steps=[('pipeline',
                     Pipeline(steps=[('columntransformer',
                                      ColumnTransformer(transformers=[('one-hot-encoder',
                                                                       OneHotEncoder(handle_unknown='ignore',
                                                                                     sparse_output=False),
                                                                       []),
                                                                      ('standard_scaler',
                                                                       StandardScaler(),
                                                                       ['MedInc',
                                                                        'HouseAge',
                                                                        'AveRooms',
                                                                        'AveBedrms',
                                                                        'Population',
                                                                        'AveOccup',
                                                                        'Latitude',
                                                                        'Longitude'])],
                                                        verbose_feature_names_out=False))])),
                    ('linearregression', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 88-94 Partitioned Model ~~~~~~~~~~~~~~~~~ We will create the criteria for 3 arbitrary partitions of room size at the lower and upper quartile. We'd like to work in the incoming feature space, but the PartitionRegressor will need the criteria in the post-processed space, since that is the only data it sees. .. GENERATED FROM PYTHON SOURCE LINES 94-97 .. code-block:: default x_train['AveRooms'].describe().T .. rst-class:: sphx-glr-script-out .. code-block:: none count 16275.000000 mean 5.429862 std 2.514777 min 0.846154 25% 4.440403 50% 5.227139 75% 6.054429 max 141.909091 Name: AveRooms, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 98-102 .. code-block:: default partition_criteria: dict = {'small': 'AveRooms < 4.4', 'medium': '(AveRooms >= 4.4) and (AveRooms < 6.0)', 'large': 'AveRooms >= 6.0'} .. GENERATED FROM PYTHON SOURCE LINES 103-104 For now this is conversion must be done by the user, manually .. GENERATED FROM PYTHON SOURCE LINES 104-106 .. code-block:: default pp.fit_transform(x_train)['AveRooms'].describe() .. rst-class:: sphx-glr-script-out .. code-block:: none count 1.627500e+04 mean 3.173976e-16 std 1.000031e+00 min -1.822766e+00 25% -3.934702e-01 50% -8.061541e-02 75% 2.483664e-01 max 5.427257e+01 Name: AveRooms, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 107-115 .. code-block:: default partition_criteria: dict = {'small': 'AveRooms < -0.43', 'medium': '(AveRooms >= -0.43) and (AveRooms < 0.24)', 'large': 'AveRooms >= 0.24'} partition_mdl: Pipeline = make_pipeline(pp, PartitionRegressor(LinearRegression(), partition_defs=partition_criteria)) partition_mdl .. raw:: html
Pipeline(steps=[('pipeline',
                     Pipeline(steps=[('columntransformer',
                                      ColumnTransformer(transformers=[('one-hot-encoder',
                                                                       OneHotEncoder(handle_unknown='ignore',
                                                                                     sparse_output=False),
                                                                       []),
                                                                      ('standard_scaler',
                                                                       StandardScaler(),
                                                                       ['MedInc',
                                                                        'HouseAge',
                                                                        'AveRooms',
                                                                        'AveBedrms',
                                                                        'Population',
                                                                        'AveOccup',
                                                                        'Latitude',
                                                                        'Longitude'])],
                                                        verbose_feature_names_out=False))])),
                    ('partitionregressor',
                     PartitionRegressor(estimator=LinearRegression(),
                                        partition_defs={'large': 'AveRooms >= 0.24',
                                                        'medium': '(AveRooms >= '
                                                                  '-0.43) and '
                                                                  '(AveRooms < '
                                                                  '0.24)',
                                                        'small': 'AveRooms < '
                                                                 '-0.43'}))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 116-118 In the model visualisation above, expand the arrow next to the partition names (small, medium, large) to see the criteria. .. GENERATED FROM PYTHON SOURCE LINES 120-125 .. tip:: A trick to avoid transforming the partition criteria values is to embed the preprocessor into every model, rather than having a common preprocessor. This will create additional computational overhead but is perhaps a nice way of simplifying the workflow. .. GENERATED FROM PYTHON SOURCE LINES 129-131 Demo Fit and Predict -------------------- .. GENERATED FROM PYTHON SOURCE LINES 131-144 .. code-block:: default base_mdl.fit(X=x_train, y=y_train) check_is_fitted(base_mdl) est_base = pd.Series(base_mdl.predict(X=x_test), index=x_test.index, name='base_est') partition_mdl.fit(X=x_train, y=y_train) check_is_fitted(partition_mdl) est_partition = partition_mdl.predict(X=x_test) est: pd.DataFrame = pd.concat([est_base, est_partition], axis=1) est.columns = ['base_est', 'partition_est'] est.head() .. raw:: html
base_est partition_est
20086 0.803608 0.703745
16347 0.715652 0.642847
13663 2.148621 2.364534
16714 2.515524 2.298311
2163 1.593148 1.431354


.. GENERATED FROM PYTHON SOURCE LINES 145-147 Cross Validation ---------------- .. GENERATED FROM PYTHON SOURCE LINES 147-156 .. code-block:: default ms: ModelSelection = ModelSelection(estimators={'base-model': base_mdl, 'partition-model': partition_mdl}, datasets=xy_train, target='MedHouseVal', group=partition_mdl[-1].domains_, random_state=123) fig = ms.plot(show_group=True, metrics=['r2_score']) fig.update_layout(height=600) plotly.io.show(fig) .. raw:: html :file: images/sphx_glr_partition_regressor_001.html .. GENERATED FROM PYTHON SOURCE LINES 157-159 To some extent the error margin (notch width) will be driven by the number of samples. Let's check the sample counts. .. GENERATED FROM PYTHON SOURCE LINES 159-162 .. code-block:: default partition_mdl[-1].domains_.value_counts() .. rst-class:: sphx-glr-script-out .. code-block:: none domains medium 8490 large 4145 small 3640 Name: count, dtype: int64 .. GENERATED FROM PYTHON SOURCE LINES 163-168 .. note:: 1. In this case the partitioning did not deliver any statistically significant improvement, but a directional improvement. 2. It appears that (for the small class) the error margins are wider for the partitioned model, caused by lower sample count for that class in the fitted model. .. GENERATED FROM PYTHON SOURCE LINES 171-175 Feature Importance ------------------ We can check that the feature imports works as expected on our new Estimator. .. GENERATED FROM PYTHON SOURCE LINES 175-178 .. code-block:: default fig = plot_feature_importance(partition_mdl, permute=True, x_test=x_train, y_test=y_train) fig .. raw:: html


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 5.780 seconds) .. _sphx_glr_download_auto_examples_partition_regressor.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: partition_regressor.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: partition_regressor.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_