Multivariate Outlier Detection

Mahalanobis Distance can be used to detect outliers in multivariate space. This can be combined with Principal Component Analysis (PCA) to reduce dimensionality prior to outlier detection.

import logging

import pandas as pd
import plotly.io as pio
import plotly.express as px
from sklearn.datasets import load_diabetes

from elphick.sklearn_viz.features import plot_principal_components, plot_scatter_matrix, \
    plot_explained_variance, PrincipalComponents, plot_parallel_coordinates
from elphick.sklearn_viz.features.outlier_detection import OutlierDetection

logging.basicConfig(level=logging.DEBUG,
                    format='%(asctime)s %(levelname)s %(module)s - %(funcName)s: %(message)s',
                    datefmt='%Y-%m-%dT%H:%M:%S%z')

Create a dataset

From the sklearn example

import numpy as np

# for consistent results
np.random.seed(7)

n_samples = 125
n_outliers = 25
n_features = 4

# generate Gaussian data of shape (125, 4)
gen_cov = np.eye(n_features)
gen_cov[0, 0] = 2.0
X = np.dot(np.random.randn(n_samples, n_features), gen_cov)
# add some outliers
outliers_cov = np.eye(n_features)
outliers_cov[np.arange(1, n_features), np.arange(1, n_features)] = 7.0
X[-n_outliers:] = np.dot(np.random.randn(n_outliers, n_features), outliers_cov)

x: pd.DataFrame = pd.DataFrame(X, columns=[f"F{n}" for n in range(1, n_features + 1)])
test_outlier = pd.Series(x['F1'].index > (n_samples - n_outliers - 1), name='test_outlier').astype('category')
x
F1 F2 F3 F4
0 3.381051 -0.465937 0.032820 0.407516
1 -1.577846 0.002066 -0.000890 -1.754724
2 2.035316 0.600499 -0.625429 -0.171548
3 1.010599 -0.261356 -0.242749 -1.453241
4 1.109161 0.123881 0.274460 -1.526525
... ... ... ... ...
120 -0.118956 -3.487569 13.277863 6.290964
121 1.264443 2.110813 -3.885861 8.212846
122 1.653950 4.073715 -0.169766 7.566816
123 -0.371115 5.351784 -4.981090 -6.792561
124 0.566445 2.365511 13.330145 -1.960749

125 rows × 4 columns



Explore the data and principal components

The parallel coordinate plot is a good place to start. The test outlier class variable is used to color the plot. Note, the outliers shown are before outlier detection - this is the synthetically generated outlier class.

fig = plot_parallel_coordinates(data=pd.concat([x, test_outlier], axis=1), color='test_outlier')
fig


Explore the principal components in 2D and then 3D

fig = plot_explained_variance(x=x)
fig


fig = plot_principal_components(plot_3d=False, x=x, color=test_outlier)
fig.update_layout(height=800)
fig


fig = plot_principal_components(plot_3d=True, x=x, color=test_outlier)
fig.update_layout(height=800)
fig


Detect Outliers

od: OutlierDetection = OutlierDetection(x=x, pca_spec=2)
detected_outliers: pd.Series = od.data['outlier']

Visualise the detected outliers with the default p_val of 0.001

fig = plot_principal_components(plot_3d=False, x=x, color=detected_outliers)
fig.update_layout(height=800)
fig


We can tighten the threshold to align more closely with our expectation.

detected_outliers: pd.Series = OutlierDetection(x=x, pca_spec=2, p_val=0.25).data['outlier']
fig = plot_principal_components(plot_3d=False, x=x, color=detected_outliers)
fig.update_layout(height=800)
# noinspection PyTypeChecker
pio.show(fig)
fig = od.plot_outlier_matrix()
fig.update_layout(height=800)
fig


The parallel plot allows us to explore the difference between our defined outliers (test_outliers) and what was detected as an outlier (outlier).

fig = plot_parallel_coordinates(data=pd.concat([x, test_outlier, detected_outliers.astype('category')], axis=1),
                                color='outlier')
fig


We can detect in the original feature space with pca_spec = 0

detected_outliers: pd.Series = OutlierDetection(x=x, pca_spec=0, p_val=0.25).data['outlier']
detected_outliers.sum()
30

Total running time of the script: ( 0 minutes 0.459 seconds)

Gallery generated by Sphinx-Gallery