Learning Curve

This example demonstrates the learning curve, which helps answer two questions:

  1. Is my model over-fitted?

  2. Will my model benefit from more data?

Code has been adapted from the sklearn example.

This machinelearningmastery article is a great resource for interpretation of learning curves.

import logging

import plotly
from sklearn.datasets import load_digits, load_diabetes
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import ShuffleSplit
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler

from elphick.sklearn_viz.model_selection import LearningCurve, plot_learning_curve, metrics

                    format='%(asctime)s %(levelname)s %(module)s - %(funcName)s: %(message)s',

Load Data

X, y = load_digits(return_X_y=True)

Create a Classifier Pipeline

The pipeline will likely include some pre-processing.

pipe: Pipeline = make_pipeline(StandardScaler(), GaussianNB()).set_output(transform='pandas')
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('gaussiannb', GaussianNB())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Plot using the function

cv = ShuffleSplit(n_splits=50, test_size=0.2, random_state=0)
fig = plot_learning_curve(pipe, x=X, y=y, cv=cv)
# noinspection PyTypeChecker

Plot using the object

Plotting using the object allows access to the underlying data.


You can use n_jobs to parallelize the computation.

lc: LearningCurve = LearningCurve(pipe, x=X, y=y, cv=5, n_jobs=5)
fig = lc.plot(title='Learning Curve').update_layout(height=600)

View the data

LearningCurveResult(training_scores=array([[0.97902098, 0.98601399, 0.98601399, 0.98601399, 0.98601399],
       [0.85010707, 0.87366167, 0.91648822, 0.91648822, 0.91648822],
       [0.85063291, 0.8556962 , 0.86582278, 0.81392405, 0.81392405],
       [0.87780773, 0.88679245, 0.87241689, 0.79964061, 0.79245283],
       [0.82602644, 0.84620738, 0.83716075, 0.78705637, 0.84133612]]), validation_scores=array([[0.70277778, 0.59722222, 0.56267409, 0.64345404, 0.55153203],
       [0.69722222, 0.71388889, 0.7632312 , 0.84122563, 0.76044568],
       [0.69722222, 0.73055556, 0.78272981, 0.80779944, 0.76044568],
       [0.76388889, 0.75277778, 0.77715877, 0.80779944, 0.72423398],
       [0.75277778, 0.74444444, 0.77715877, 0.7994429 , 0.76044568]]), training_sizes=array([ 143,  467,  790, 1113, 1437]), metrics=None)

Results as a dataframe

df = lc.results.get_results()
train_count_143 train_count_467 train_count_790 train_count_1113 train_count_1437 dataset
0 0.979021 0.850107 0.850633 0.877808 0.826026 training
1 0.986014 0.873662 0.855696 0.886792 0.846207 training
2 0.986014 0.916488 0.865823 0.872417 0.837161 training
3 0.986014 0.916488 0.813924 0.799641 0.787056 training
4 0.986014 0.916488 0.813924 0.792453 0.841336 training
5 0.702778 0.697222 0.697222 0.763889 0.752778 validation
6 0.597222 0.713889 0.730556 0.752778 0.744444 validation
7 0.562674 0.763231 0.782730 0.777159 0.777159 validation
8 0.643454 0.841226 0.807799 0.807799 0.799443 validation
9 0.551532 0.760446 0.760446 0.724234 0.760446 validation

Regressor Learning Curve

This example uses a regression model.

diabetes = load_diabetes(as_frame=True)
X, y = diabetes.data, diabetes.target
y.name = "progression"

pipe: Pipeline = make_pipeline(StandardScaler(), RidgeCV()).set_output(transform='pandas')
Pipeline(steps=[('standardscaler', StandardScaler()), ('ridgecv', RidgeCV())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

lc: LearningCurve = LearningCurve(pipe, x=X, y=y, cv=5)
fig = lc.plot(title='Learning Curve').update_layout(height=600)

Learning Curve with Metrics

While a model is fitted based on the defined scorer, we may be interested in other metrics. The metrics parameter allows us to define additional metrics to calculate.

lc: LearningCurve = LearningCurve(pipe, x=X, y=y,
                                  metrics={'mse': metrics.mean_squared_error, 'moe': metrics.moe_95},
                                  cv=5, n_jobs=5)
fig = lc.plot(title='Learning Curve with Metrics', metrics=['mse', 'moe'], col_wrap=2).update_layout(height=800)

Learning Curve for a metric without the scorer

fig = lc.plot(title='Learning Curve - Metric, no scorer', metrics=['moe'], plot_scorer=False).update_layout(height=700)

Total running time of the script: ( 0 minutes 4.786 seconds)

Gallery generated by Sphinx-Gallery