Learning Curve

This example demonstrates the learning curve, which helps answer two questions:

Is my model over-fitted?
Will my model benefit from more data?

Code has been adapted from the sklearn example.

This machinelearningmastery article is a great resource for interpretation of learning curves.

import logging

import plotly
from sklearn.datasets import load_digits, load_diabetes
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import ShuffleSplit
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler

from elphick.sklearn_viz.model_selection import LearningCurve, plot_learning_curve, metrics

logging.basicConfig(level=logging.DEBUG,
                    format='%(asctime)s %(levelname)s %(module)s - %(funcName)s: %(message)s',
                    datefmt='%Y-%m-%dT%H:%M:%S%z')

Load Data

X, y = load_digits(return_X_y=True)

Create a Classifier Pipeline

The pipeline will likely include some pre-processing.

pipe: Pipeline = make_pipeline(StandardScaler(), GaussianNB()).set_output(transform='pandas')
pipe

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('gaussiannb', GaussianNB())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Plot using the function

cv = ShuffleSplit(n_splits=50, test_size=0.2, random_state=0)
fig = plot_learning_curve(pipe, x=X, y=y, cv=cv)
fig.update_layout(height=600)
# noinspection PyTypeChecker
plotly.io.show(fig)

Plot using the object

Plotting using the object allows access to the underlying data.

Tip

You can use n_jobs to parallelize the computation.

lc: LearningCurve = LearningCurve(pipe, x=X, y=y, cv=5, n_jobs=5)
fig = lc.plot(title='Learning Curve').update_layout(height=600)
fig

View the data

lc.results

LearningCurveResult(training_scores=array([[0.97902098, 0.98601399, 0.98601399, 0.98601399, 0.98601399],
       [0.85010707, 0.87366167, 0.91648822, 0.91648822, 0.91648822],
       [0.85063291, 0.8556962 , 0.86582278, 0.81392405, 0.81392405],
       [0.87780773, 0.88679245, 0.87241689, 0.79964061, 0.79245283],
       [0.82602644, 0.84620738, 0.83716075, 0.78705637, 0.84133612]]), validation_scores=array([[0.70277778, 0.59722222, 0.56267409, 0.64345404, 0.55153203],
       [0.69722222, 0.71388889, 0.7632312 , 0.84122563, 0.76044568],
       [0.69722222, 0.73055556, 0.78272981, 0.80779944, 0.76044568],
       [0.76388889, 0.75277778, 0.77715877, 0.80779944, 0.72423398],
       [0.75277778, 0.74444444, 0.77715877, 0.7994429 , 0.76044568]]), training_sizes=array([ 143,  467,  790, 1113, 1437]), metrics=None)

Results as a dataframe

df = lc.results.get_results()
df.head(10)

	train_count_143	train_count_467	train_count_790	train_count_1113	train_count_1437	dataset
0	0.979021	0.850107	0.850633	0.877808	0.826026	training
1	0.986014	0.873662	0.855696	0.886792	0.846207	training
2	0.986014	0.916488	0.865823	0.872417	0.837161	training
3	0.986014	0.916488	0.813924	0.799641	0.787056	training
4	0.986014	0.916488	0.813924	0.792453	0.841336	training
5	0.702778	0.697222	0.697222	0.763889	0.752778	validation
6	0.597222	0.713889	0.730556	0.752778	0.744444	validation
7	0.562674	0.763231	0.782730	0.777159	0.777159	validation
8	0.643454	0.841226	0.807799	0.807799	0.799443	validation
9	0.551532	0.760446	0.760446	0.724234	0.760446	validation

Regressor Learning Curve

This example uses a regression model.

diabetes = load_diabetes(as_frame=True)
X, y = diabetes.data, diabetes.target
y.name = "progression"

pipe: Pipeline = make_pipeline(StandardScaler(), RidgeCV()).set_output(transform='pandas')
pipe

Pipeline(steps=[('standardscaler', StandardScaler()), ('ridgecv', RidgeCV())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

lc: LearningCurve = LearningCurve(pipe, x=X, y=y, cv=5)
fig = lc.plot(title='Learning Curve').update_layout(height=600)
fig

Learning Curve with Metrics

While a model is fitted based on the defined scorer, we may be interested in other metrics. The metrics parameter allows us to define additional metrics to calculate.

lc: LearningCurve = LearningCurve(pipe, x=X, y=y,
                                  metrics={'mse': metrics.mean_squared_error, 'moe': metrics.moe_95},
                                  cv=5, n_jobs=5)
fig = lc.plot(title='Learning Curve with Metrics', metrics=['mse', 'moe'], col_wrap=2).update_layout(height=800)
fig

Learning Curve for a metric without the scorer

fig = lc.plot(title='Learning Curve - Metric, no scorer', metrics=['moe'], plot_scorer=False).update_layout(height=700)
fig

Total running time of the script: ( 0 minutes 4.105 seconds)

Gallery generated by Sphinx-Gallery