.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/outlier_detection.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_outlier_detection.py: ============================== Multivariate Outlier Detection ============================== Mahalanobis Distance can be used to detect outliers in multivariate space. This can be combined with Principal Component Analysis (PCA) to reduce dimensionality prior to outlier detection. .. GENERATED FROM PYTHON SOURCE LINES 10-24 .. code-block:: default import logging import pandas as pd import plotly.io as pio import plotly.express as px from sklearn.datasets import load_diabetes from elphick.sklearn_viz.features import plot_principal_components, plot_scatter_matrix, \ plot_explained_variance, PrincipalComponents, plot_parallel_coordinates from elphick.sklearn_viz.features.outlier_detection import OutlierDetection logging.basicConfig(level=logging.DEBUG, format='%(asctime)s %(levelname)s %(module)s - %(funcName)s: %(message)s', datefmt='%Y-%m-%dT%H:%M:%S%z') .. GENERATED FROM PYTHON SOURCE LINES 25-29 Create a dataset ---------------- From the `sklearn example `_ .. GENERATED FROM PYTHON SOURCE LINES 29-52 .. code-block:: default import numpy as np # for consistent results np.random.seed(7) n_samples = 125 n_outliers = 25 n_features = 4 # generate Gaussian data of shape (125, 4) gen_cov = np.eye(n_features) gen_cov[0, 0] = 2.0 X = np.dot(np.random.randn(n_samples, n_features), gen_cov) # add some outliers outliers_cov = np.eye(n_features) outliers_cov[np.arange(1, n_features), np.arange(1, n_features)] = 7.0 X[-n_outliers:] = np.dot(np.random.randn(n_outliers, n_features), outliers_cov) x: pd.DataFrame = pd.DataFrame(X, columns=[f"F{n}" for n in range(1, n_features + 1)]) test_outlier = pd.Series(x['F1'].index > (n_samples - n_outliers - 1), name='test_outlier').astype('category') x .. raw:: html
F1 F2 F3 F4
0 3.381051 -0.465937 0.032820 0.407516
1 -1.577846 0.002066 -0.000890 -1.754724
2 2.035316 0.600499 -0.625429 -0.171548
3 1.010599 -0.261356 -0.242749 -1.453241
4 1.109161 0.123881 0.274460 -1.526525
... ... ... ... ...
120 -0.118956 -3.487569 13.277863 6.290964
121 1.264443 2.110813 -3.885861 8.212846
122 1.653950 4.073715 -0.169766 7.566816
123 -0.371115 5.351784 -4.981090 -6.792561
124 0.566445 2.365511 13.330145 -1.960749

125 rows × 4 columns



.. GENERATED FROM PYTHON SOURCE LINES 53-58 Explore the data and principal components ----------------------------------------- The parallel coordinate plot is a good place to start. The test outlier class variable is used to color the plot. Note, the outliers shown are before outlier detection - this is the synthetically generated outlier class. .. GENERATED FROM PYTHON SOURCE LINES 58-62 .. code-block:: default fig = plot_parallel_coordinates(data=pd.concat([x, test_outlier], axis=1), color='test_outlier') fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 63-64 Explore the principal components in 2D and then 3D .. GENERATED FROM PYTHON SOURCE LINES 64-67 .. code-block:: default fig = plot_explained_variance(x=x) fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 68-72 .. code-block:: default fig = plot_principal_components(plot_3d=False, x=x, color=test_outlier) fig.update_layout(height=800) fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 73-77 .. code-block:: default fig = plot_principal_components(plot_3d=True, x=x, color=test_outlier) fig.update_layout(height=800) fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 78-80 Detect Outliers --------------- .. GENERATED FROM PYTHON SOURCE LINES 80-84 .. code-block:: default od: OutlierDetection = OutlierDetection(x=x, pca_spec=2) detected_outliers: pd.Series = od.data['outlier'] .. GENERATED FROM PYTHON SOURCE LINES 85-86 Visualise the detected outliers with the default p_val of 0.001 .. GENERATED FROM PYTHON SOURCE LINES 86-91 .. code-block:: default fig = plot_principal_components(plot_3d=False, x=x, color=detected_outliers) fig.update_layout(height=800) fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 92-93 We can tighten the threshold to align more closely with our expectation. .. GENERATED FROM PYTHON SOURCE LINES 93-100 .. code-block:: default detected_outliers: pd.Series = OutlierDetection(x=x, pca_spec=2, p_val=0.25).data['outlier'] fig = plot_principal_components(plot_3d=False, x=x, color=detected_outliers) fig.update_layout(height=800) # noinspection PyTypeChecker pio.show(fig) .. raw:: html :file: images/sphx_glr_outlier_detection_001.html .. GENERATED FROM PYTHON SOURCE LINES 101-105 .. code-block:: default fig = od.plot_outlier_matrix() fig.update_layout(height=800) fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 106-108 The parallel plot allows us to explore the difference between our defined outliers (test_outliers) and what was detected as an outlier (outlier). .. GENERATED FROM PYTHON SOURCE LINES 108-112 .. code-block:: default fig = plot_parallel_coordinates(data=pd.concat([x, test_outlier, detected_outliers.astype('category')], axis=1), color='outlier') fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 113-114 We can detect in the original feature space with pca_spec = 0 .. GENERATED FROM PYTHON SOURCE LINES 114-117 .. code-block:: default detected_outliers: pd.Series = OutlierDetection(x=x, pca_spec=0, p_val=0.25).data['outlier'] detected_outliers.sum() .. rst-class:: sphx-glr-script-out .. code-block:: none np.int64(30) .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 0.399 seconds) .. _sphx_glr_download_auto_examples_outlier_detection.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: outlier_detection.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: outlier_detection.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_