Feature importances with a forest of trees

This example shows the use of a forest of trees to evaluate the importance of features on an artificial classification task. The blue bars are the feature importances of the forest, along with their inter-trees variability represented by the error bars.

As expected, the plot suggests that 3 features are informative, while the remaining are not.

The base code has been adapted from the original scikit-learn example

To learn about the benefits of permuted performance over the importance captured when a model is trained you should refer to that original example. This example will focus on the interactive feature importance plot.

import pandas as pd
import plotly
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestClassifier

from elphick.sklearn_viz.features import plot_feature_importance, FeatureImportance
import logging

logging.basicConfig(level=logging.DEBUG,
                    format='%(asctime)s %(levelname)s %(module)s - %(funcName)s: %(message)s',
                    datefmt='%Y-%m-%dT%H:%M:%S%z')

Data generation and model fitting

We generate a synthetic dataset with only 3 informative features. We will explicitly not shuffle the dataset to ensure that the informative features will correspond to the three first columns of X.

X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=3,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    random_state=0,
    shuffle=False,
)
X = pd.DataFrame(X, columns=[f"Feature {i}" for i in range(1, X.shape[1]+1)])
y = pd.Series(y, name='Class')
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

A random forest classifier will be fitted to compute the feature importances.

Note

To obtain the real feature names in the plot the following is needed:

  • Pass pd.DataFrames to the fit method

  • Set the transform output to “pandas”

pipe = make_pipeline(SelectKBest(k='all'), RandomForestClassifier(random_state=0))
pipe.set_output(transform="pandas")
pipe.fit(X_train, y_train)
Pipeline(steps=[('selectkbest', SelectKBest(k='all')),
                ('randomforestclassifier',
                 RandomForestClassifier(random_state=0))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Feature importance based on mean decrease in impurity

Long story short, this approach is faster, since it comes as an output of model fitting, but is less accurate.

Create an interactive Feature Importance plot

fig = plot_feature_importance(pipe)
fig


fig = plot_feature_importance(pipe, sort=True, top_k=5)
fig


fig = FeatureImportance(pipe).plot(horizontal=True, sort=True, top_k=5)
fig


feature_importance: pd.DataFrame = FeatureImportance(pipe).data
feature_importance
importance std
Feature 1 0.209443 0.078813
Feature 2 0.317872 0.054734
Feature 3 0.195190 0.076166
Feature 4 0.040393 0.022759
Feature 5 0.038609 0.020160
Feature 6 0.034066 0.022078
Feature 7 0.040255 0.025062
Feature 8 0.042573 0.027971
Feature 9 0.040018 0.023722
Feature 10 0.041581 0.024136


As expected, the three first features are found important.

Feature importance based on feature permutation

This approach takes longer but is better.

Create an interactive Feature Importance plot using permutation.

fi = FeatureImportance(pipe, permute=True, x_test=X_test, y_test=y_test)
fig = fi.plot()
fig


fig = fi.plot(horizontal=True, sort=True, top_k=5)
# noinspection PyTypeChecker
plotly.io.show(fig)  # this call to show will set the thumbnail for use in the gallery

The same features are detected as most important using both methods. Although the relative importances vary. As seen on the plots, MDI is less likely than permutation importance to fully omit a feature.

If features are engineered, the reported feature importances (by default) will include the engineered features.

pipe2 = make_pipeline(PolynomialFeatures(degree=2), RandomForestClassifier(random_state=0))
pipe2.set_output(transform="pandas")
pipe2.fit(X_train, y_train)
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),
                ('randomforestclassifier',
                 RandomForestClassifier(random_state=0))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


fi = FeatureImportance(pipe2, permute=True, x_test=X_test, y_test=y_test)
fig = fi.plot(sort=True, top_k=10)
fig


For this we set the pipeline_input_features parameter to True.

fi = FeatureImportance(pipe2, permute=True, x_test=X_test, y_test=y_test, pipeline_input_features=True)
fig = fi.plot()
fig


Total running time of the script: ( 0 minutes 5.881 seconds)

Gallery generated by Sphinx-Gallery