Compare Multiple N-D datasets

Comparing 1D datasets readily achieved by overlaying distributions, boxplots, etc. For multivariate (N-D) datasets, things get a little more difficult.

This example applies Principal Component Analysis by group variable and colors the loading vectors by group.

import logging

import pandas as pd
import plotly.io

from elphick.sklearn_viz.features import plot_parallel_coordinates
from elphick.sklearn_viz.features.principal_components import plot_loading_vectors, plot_correlation_circle, \
    plot_explained_variance, plot_principal_components

logging.basicConfig(level=logging.DEBUG,
                    format='%(asctime)s %(levelname)s %(module)s - %(funcName)s: %(message)s',
                    datefmt='%Y-%m-%dT%H:%M:%S%z')

Create a dataset

import numpy as np

# for consistent results
np.random.seed(7)

n_samples = 125
n_outliers = 25
n_features = 4

# generate Gaussian data of shape (125, 4)
cov1 = np.array([[9, -7, -2, -2],
                 [-7, 7, 1.5, 1],
                 [-2, 1.5, 1, 0.5],
                 [-2, 1, 0.5, 0.5]])
cov2 = np.array([[5, -2, -1.5, -3],
                 [-2, 2, 0.5, 0.5],
                 [-1.5, 0.5, 1, 1],
                 [-3, 0.5, 1, 3]])

x1 = np.dot(np.random.randn(n_samples, n_features), cov1)
x2 = np.dot(np.random.randn(n_samples, n_features), cov2)

df_x1: pd.DataFrame = pd.DataFrame(x1, columns=[f"F{n}" for n in range(1, n_features + 1)])
# shift the mean on two features
df_x1['F4'] = df_x1['F4'] - 2.0
df_x1['F2'] = df_x1['F2'] + 1.0
df_x2: pd.DataFrame = pd.DataFrame(x2, columns=[f"F{n}" for n in range(1, n_features + 1)])
x = pd.concat([df_x1.assign(group='one'), df_x2.assign(group='two')], axis=0).reset_index(drop=True)
x['group'] = pd.Categorical(x['group'])
x
F1 F2 F3 F4 group
0 17.595620 -13.638495 -3.843379 -5.626821 one
1 -3.603537 4.780860 0.702692 -1.297896 one
2 6.549387 -3.029808 -1.845771 -3.833306 one
3 9.769170 -6.183956 -2.372003 -4.119950 one
4 6.628186 -3.129730 -1.412142 -3.611312 one
... ... ... ... ... ...
245 7.033214 -2.686300 -2.319988 -4.617525 two
246 -8.605161 5.013764 3.277691 5.560217 two
247 11.574064 -6.242021 -3.464603 -6.354748 two
248 3.751013 0.564987 -1.655431 -3.452477 two
249 6.737896 -2.744247 -2.385954 -4.576597 two

250 rows × 5 columns



Explore the data

The parallel coordinate plot is a good place to start. The differences in mean and variance across some features is clear.

fig = plot_parallel_coordinates(x, color='group')
fig


Explore the Principal Components

Note that in the next plot, the data is colored by group, but the loading vectors are for the entire dataset.

fig = plot_principal_components(plot_3d=False, x=x.drop(columns=['group']), color=x['group'])
fig.update_layout(height=800)
fig


This example below allows the loading vectors to be visualised by each group without the data for clarity.

fig = plot_loading_vectors(x=x.drop(columns=['group']), color=x['group'])
fig


And finally, by standardising the input data prior to PCA analysis allows the correlation circle to be shown by group.

fig = plot_correlation_circle(x=x.drop(columns=['group']), color=x['group'])
fig.update_layout(height=800, width=800)
plotly.io.show(fig)

Total running time of the script: ( 0 minutes 0.390 seconds)

Gallery generated by Sphinx-Gallery