Learning Curve

This example demonstrates the learning curve, which helps answer two questions:

  1. Is my model over-fitted?

  2. Will my model benefit from more data?

Code has been adapted from the sklearn example.

This machinelearningmastery article is a great resource for interpretation of learning curves.

import logging

import plotly
from sklearn.datasets import load_digits, load_diabetes
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import ShuffleSplit
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler

from elphick.sklearn_viz.model_selection import LearningCurve, plot_learning_curve, metrics

logging.basicConfig(level=logging.DEBUG,
                    format='%(asctime)s %(levelname)s %(module)s - %(funcName)s: %(message)s',
                    datefmt='%Y-%m-%dT%H:%M:%S%z')

Load Data

X, y = load_digits(return_X_y=True)

Create a Classifier Pipeline

The pipeline will likely include some pre-processing.

pipe: Pipeline = make_pipeline(StandardScaler(), GaussianNB()).set_output(transform='pandas')
pipe
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('gaussiannb', GaussianNB())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Plot using the function

cv = ShuffleSplit(n_splits=50, test_size=0.2, random_state=0)
fig = plot_learning_curve(pipe, x=X, y=y, cv=cv)
fig.update_layout(height=600)
# noinspection PyTypeChecker
plotly.io.show(fig)

Plot using the object

Plotting using the object allows access to the underlying data.

Tip

You can use n_jobs to parallelize the computation.

lc: LearningCurve = LearningCurve(pipe, x=X, y=y, cv=5, n_jobs=5)
fig = lc.plot(title='Learning Curve').update_layout(height=600)
fig


View the data

lc.results
LearningCurveResult(training_scores=array([[0.97902098, 0.98601399, 0.98601399, 0.98601399, 0.98601399],
       [0.85010707, 0.87366167, 0.91648822, 0.91648822, 0.91648822],
       [0.85063291, 0.8556962 , 0.86582278, 0.81392405, 0.81392405],
       [0.87780773, 0.88679245, 0.87241689, 0.79964061, 0.79245283],
       [0.82602644, 0.84620738, 0.83716075, 0.78705637, 0.84133612]]), validation_scores=array([[0.70277778, 0.59722222, 0.56267409, 0.64345404, 0.55153203],
       [0.69722222, 0.71388889, 0.7632312 , 0.84122563, 0.76044568],
       [0.69722222, 0.73055556, 0.78272981, 0.80779944, 0.76044568],
       [0.76388889, 0.75277778, 0.77715877, 0.80779944, 0.72423398],
       [0.75277778, 0.74444444, 0.77715877, 0.7994429 , 0.76044568]]), training_sizes=array([ 143,  467,  790, 1113, 1437]), metrics=None)

Results as a dataframe

df = lc.results.get_results()
df.head(10)
train_count_143 train_count_467 train_count_790 train_count_1113 train_count_1437 dataset
0 0.979021 0.850107 0.850633 0.877808 0.826026 training
1 0.986014 0.873662 0.855696 0.886792 0.846207 training
2 0.986014 0.916488 0.865823 0.872417 0.837161 training
3 0.986014 0.916488 0.813924 0.799641 0.787056 training
4 0.986014 0.916488 0.813924 0.792453 0.841336 training
5 0.702778 0.697222 0.697222 0.763889 0.752778 validation
6 0.597222 0.713889 0.730556 0.752778 0.744444 validation
7 0.562674 0.763231 0.782730 0.777159 0.777159 validation
8 0.643454 0.841226 0.807799 0.807799 0.799443 validation
9 0.551532 0.760446 0.760446 0.724234 0.760446 validation


Regressor Learning Curve

This example uses a regression model.

diabetes = load_diabetes(as_frame=True)
X, y = diabetes.data, diabetes.target
y.name = "progression"

pipe: Pipeline = make_pipeline(StandardScaler(), RidgeCV()).set_output(transform='pandas')
pipe
Pipeline(steps=[('standardscaler', StandardScaler()), ('ridgecv', RidgeCV())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


lc: LearningCurve = LearningCurve(pipe, x=X, y=y, cv=5)
fig = lc.plot(title='Learning Curve').update_layout(height=600)
fig


Learning Curve with Metrics

While a model is fitted based on the defined scorer, we may be interested in other metrics. The metrics parameter allows us to define additional metrics to calculate.

lc: LearningCurve = LearningCurve(pipe, x=X, y=y,
                                  metrics={'mse': metrics.mean_squared_error, 'moe': metrics.moe_95},
                                  cv=5, n_jobs=5)
fig = lc.plot(title='Learning Curve with Metrics', metrics=['mse', 'moe'], col_wrap=2).update_layout(height=800)
fig


Learning Curve for a metric without the scorer

fig = lc.plot(title='Learning Curve - Metric, no scorer', metrics=['moe'], plot_scorer=False).update_layout(height=700)
fig


Total running time of the script: ( 0 minutes 4.105 seconds)

Gallery generated by Sphinx-Gallery