.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_forest_importances.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_forest_importances.py: ========================================== Feature importances with a forest of trees ========================================== This example shows the use of a forest of trees to evaluate the importance of features on an artificial classification task. The blue bars are the feature importances of the forest, along with their inter-trees variability represented by the error bars. As expected, the plot suggests that 3 features are informative, while the remaining are not. The base code has been adapted from the `original scikit-learn example `_ To learn about the benefits of permuted performance over the importance captured when a model is trained you should refer to that original example. This example will focus on the interactive feature importance plot. .. GENERATED FROM PYTHON SOURCE LINES 21-37 .. code-block:: default import pandas as pd import plotly from sklearn.feature_selection import SelectKBest from sklearn.pipeline import make_pipeline from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.preprocessing import PolynomialFeatures from sklearn.ensemble import RandomForestClassifier from elphick.sklearn_viz.features import plot_feature_importance, FeatureImportance import logging logging.basicConfig(level=logging.DEBUG, format='%(asctime)s %(levelname)s %(module)s - %(funcName)s: %(message)s', datefmt='%Y-%m-%dT%H:%M:%S%z') .. GENERATED FROM PYTHON SOURCE LINES 38-43 Data generation and model fitting --------------------------------- We generate a synthetic dataset with only 3 informative features. We will explicitly not shuffle the dataset to ensure that the informative features will correspond to the three first columns of X. .. GENERATED FROM PYTHON SOURCE LINES 43-58 .. code-block:: default X, y = make_classification( n_samples=1000, n_features=10, n_informative=3, n_redundant=0, n_repeated=0, n_classes=2, random_state=0, shuffle=False, ) X = pd.DataFrame(X, columns=[f"Feature {i}" for i in range(1, X.shape[1]+1)]) y = pd.Series(y, name='Class') X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42) .. GENERATED FROM PYTHON SOURCE LINES 59-66 A random forest classifier will be fitted to compute the feature importances. .. note :: To obtain the real feature names in the plot the following is needed: - Pass pd.DataFrames to the fit method - Set the transform output to "pandas" .. GENERATED FROM PYTHON SOURCE LINES 66-71 .. code-block:: default pipe = make_pipeline(SelectKBest(k='all'), RandomForestClassifier(random_state=0)) pipe.set_output(transform="pandas") pipe.fit(X_train, y_train) .. raw:: html
Pipeline(steps=[('selectkbest', SelectKBest(k='all')),
                    ('randomforestclassifier',
                     RandomForestClassifier(random_state=0))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 72-77 Feature importance based on mean decrease in impurity ----------------------------------------------------- Long story short, this approach is faster, since it comes as an output of model fitting, but is less accurate. Create an interactive Feature Importance plot .. GENERATED FROM PYTHON SOURCE LINES 77-81 .. code-block:: default fig = plot_feature_importance(pipe) fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 82-86 .. code-block:: default fig = plot_feature_importance(pipe, sort=True, top_k=5) fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 87-91 .. code-block:: default fig = FeatureImportance(pipe).plot(horizontal=True, sort=True, top_k=5) fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 92-96 .. code-block:: default feature_importance: pd.DataFrame = FeatureImportance(pipe).data feature_importance .. raw:: html
importance std
Feature 1 0.209443 0.078813
Feature 2 0.317872 0.054734
Feature 3 0.195190 0.076166
Feature 4 0.040393 0.022759
Feature 5 0.038609 0.020160
Feature 6 0.034066 0.022078
Feature 7 0.040255 0.025062
Feature 8 0.042573 0.027971
Feature 9 0.040018 0.023722
Feature 10 0.041581 0.024136


.. GENERATED FROM PYTHON SOURCE LINES 97-104 As expected, the three first features are found important. Feature importance based on feature permutation ----------------------------------------------- This approach takes longer but is better. Create an interactive Feature Importance plot using permutation. .. GENERATED FROM PYTHON SOURCE LINES 104-109 .. code-block:: default fi = FeatureImportance(pipe, permute=True, x_test=X_test, y_test=y_test) fig = fi.plot() fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 110-115 .. code-block:: default fig = fi.plot(horizontal=True, sort=True, top_k=5) # noinspection PyTypeChecker plotly.io.show(fig) # this call to show will set the thumbnail for use in the gallery .. raw:: html :file: images/sphx_glr_plot_forest_importances_001.html .. GENERATED FROM PYTHON SOURCE LINES 116-119 The same features are detected as most important using both methods. Although the relative importances vary. As seen on the plots, MDI is less likely than permutation importance to fully omit a feature. .. GENERATED FROM PYTHON SOURCE LINES 121-122 If features are engineered, the reported feature importances (by default) will include the engineered features. .. GENERATED FROM PYTHON SOURCE LINES 122-127 .. code-block:: default pipe2 = make_pipeline(PolynomialFeatures(degree=2), RandomForestClassifier(random_state=0)) pipe2.set_output(transform="pandas") pipe2.fit(X_train, y_train) .. raw:: html
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),
                    ('randomforestclassifier',
                     RandomForestClassifier(random_state=0))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 128-132 .. code-block:: default fi = FeatureImportance(pipe2, permute=True, x_test=X_test, y_test=y_test) fig = fi.plot(sort=True, top_k=10) fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 133-134 For this we set the pipeline_input_features parameter to True. .. GENERATED FROM PYTHON SOURCE LINES 134-138 .. code-block:: default fi = FeatureImportance(pipe2, permute=True, x_test=X_test, y_test=y_test, pipeline_input_features=True) fig = fi.plot() fig .. raw:: html


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 6.482 seconds) .. _sphx_glr_download_auto_examples_plot_forest_importances.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_forest_importances.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_forest_importances.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_