Principal Component Analysis

Principal Component Analysis is a feature reduction (decomposition) technique that aims to maximise the retained variance in less features. It is a tool to help manage the “curse of dimensionality”.

import logging

import pandas as pd
import plotly.io as pio
import plotly.express as px
from sklearn.datasets import load_diabetes

from elphick.sklearn_viz.features import plot_principal_components, plot_scatter_matrix, \
    plot_explained_variance, PrincipalComponents

logging.basicConfig(level=logging.DEBUG,
                    format='%(asctime)s %(levelname)s %(module)s - %(funcName)s: %(message)s',
                    datefmt='%Y-%m-%dT%H:%M:%S%z')

Load Classification Data

df = px.data.iris().drop(columns=['species_id'])
df['species'] = df['species'].astype('category')
x = df[[col for col in df.columns if col != 'species']]
y = df['species']

Plot Classification Data

SPLOM - Original Feature Space

fig = plot_scatter_matrix(x=x, y=y, original_features=True)
fig.update_layout(height=800)
fig


SPLOM - Principal Components

fig = plot_scatter_matrix(x=x, y=y)
fig.update_layout(height=800)
fig


Scatter - 2D PCA

fig = plot_principal_components(x=x, color=y, plot_3d=False, loading_vectors=False)
fig.update_layout(height=800)
fig


Plotting loading vectors is the default.

fig = plot_principal_components(x=x, color=y, plot_3d=False)
fig.update_layout(height=800)
# noinspection PyTypeChecker
pio.show(fig)

Explained Variance

fig = plot_explained_variance(x=x, y=y)
fig


Scatter - 3D PCA

fig = plot_principal_components(x=x, color=y, loading_vectors=False)
fig.update_layout(height=800)
fig


Plotting loading vectors is the default.

fig = plot_principal_components(x=x, color=y)
fig.update_layout(height=800)
fig


Regression Datasets

The preceding examples demonstrated a categorical target variable. Regression problems with a numeric variable are also supported.

diabetes = load_diabetes(as_frame=True, scaled=False)
x, y = diabetes.data, diabetes.target.rename('target')
df = pd.concat([x, y], axis=1)
df.shape
(442, 11)
fig = plot_principal_components(x=x, color=y, plot_3d=False)
fig.update_layout(height=800)
fig


This dataset requires more variables to retain a reasonable proportion of the total variance compared to the iris dataset as indicated in the section below.

Accessing the Data

By plotting with the object rather than the function you can access the data.

pca = PrincipalComponents(x=x, color=y)
fig = pca.plot_explained_variance()
fig


pca.data
{'raw': PCResults(data=           PC1        PC2        PC3  ...       PC8       PC9      PC10
0   -37.015229 -18.660758  -3.516635  ...  0.181757 -0.432592 -0.121130
1   -15.751789  22.835567  13.417056  ...  0.303021  0.616363  0.159894
2   -37.369635 -17.075089  -0.217778  ...  0.348658 -0.251111 -0.096245
3    14.080519  15.486712 -22.652022  ... -0.405063  0.247089 -0.164029
4     8.052537   2.108751  -0.220784  ... -0.353808  0.429188  0.034163
..         ...        ...        ...  ...       ...       ...       ...
437  -1.450109 -21.028323   3.482378  ... -0.101225 -0.601114 -0.105718
438  58.918190  11.872762 -22.349991  ...  0.038697 -0.639989  0.072849
439 -24.233049 -17.265420  -4.102360  ...  0.105526 -0.314848  0.163713
440  13.903409   5.839565 -10.239533  ... -0.419348  0.164559 -0.139823
441  52.914252  53.019841  32.987254  ...  0.928400  0.663407 -0.243303

[442 rows x 10 columns], explained_variance=0    73.249152
1     9.621208
2     7.471087
3     4.316437
4     3.216342
5     1.642907
6     0.468170
7     0.007436
8     0.006300
9     0.000960
Name: explained_variance, dtype: float64, loadings=           PC1        PC2       PC3  ...       PC8       PC9      PC10
age   3.714173  -7.118496  4.898654  ... -0.001158  0.001110  0.000139
sex   0.049703  -0.185346 -0.101937  ...  0.357964 -0.261106 -0.000440
bmi   1.254303  -2.003842 -0.374368  ...  0.008416 -0.002737  0.001047
bp    3.629662 -10.214990  4.524721  ... -0.001625  0.002870  0.000129
s1   33.906939   2.747011  5.006616  ... -0.005617 -0.008839  0.005336
s2   29.319152   0.279329 -6.728494  ... -0.000552  0.002171 -0.004946
s3   -0.979456   7.609406  9.679073  ...  0.026486  0.022667 -0.005935
s4    0.803214  -0.514210 -0.626845  ...  0.282119  0.328417 -0.006791
s5    0.241819  -0.210375  0.048459  ... -0.013738 -0.014107 -0.163766
s6    4.073738  -6.720256  0.820792  ... -0.002170 -0.000065  0.000257

[10 rows x 10 columns]), 'std': PCResults(data=          PC1       PC2       PC3  ...       PC8       PC9      PC10
0    0.587199 -1.946832  0.589205  ...  0.757431 -0.181075 -0.048953
1   -2.831625  1.372082  0.027930  ... -0.188436  0.505128  0.043599
2    0.272129 -1.634901  0.739244  ...  0.843203 -0.025353 -0.054175
3    0.049281  0.382278 -2.013032  ... -0.367871 -0.137857 -0.074558
4   -0.756421  0.811960 -0.057238  ... -1.059751  0.044284 -0.010914
..        ...       ...       ...  ...       ...       ...       ...
437  1.239525 -1.035968  0.928679  ...  0.126490 -0.377893 -0.025229
438  1.264719  0.761319 -1.750191  ...  0.180439 -0.371759  0.033447
439 -0.205206 -1.205487  0.496186  ... -0.491849 -0.113220  0.058875
440  0.692871  0.210127 -0.868724  ...  0.078684 -0.127211 -0.045540
441 -1.903941  3.975777 -0.048338  ...  1.185359  0.730475 -0.154558

[442 rows x 10 columns], explained_variance=0    40.242108
1    14.923197
2    12.059663
3     9.554764
4     6.621814
5     6.027171
6     5.365657
7     4.336820
8     0.783200
9     0.085607
Name: explained_variance, dtype: float64, loadings=          PC1       PC2       PC3  ...       PC8       PC9      PC10
age  0.434662  0.054261  0.543842  ...  0.009848  0.002269  0.000302
sex  0.375489 -0.472743 -0.117488  ...  0.292022 -0.000590  0.000339
bmi  0.608846 -0.191130  0.184181  ...  0.259050  0.011873  0.000764
bp   0.545735 -0.169098  0.564625  ... -0.314720  0.007619 -0.000298
s1   0.689365  0.700806 -0.075396  ...  0.085320 -0.011778  0.065746
s2   0.706648  0.557612 -0.296499  ... -0.126139 -0.100671 -0.052168
s3  -0.567223  0.619125  0.424407  ...  0.214029  0.134833 -0.029405
s4   0.861234 -0.083384 -0.418523  ... -0.119059  0.216804 -0.008392
s5   0.760385 -0.032026  0.069956  ...  0.296473 -0.053082 -0.024497
s6   0.647045 -0.103892  0.304363  ... -0.109843 -0.004279  0.000242

[10 rows x 10 columns])}

Total running time of the script: ( 0 minutes 0.802 seconds)

Gallery generated by Sphinx-Gallery