Note
Click here to download the full example code
Principal Component Analysis
Principal Component Analysis is a feature reduction (decomposition) technique that aims to maximise the retained variance in less features. It is a tool to help manage the “curse of dimensionality”.
import logging
import pandas as pd
import plotly.io as pio
import plotly.express as px
from sklearn.datasets import load_diabetes
from elphick.sklearn_viz.features import plot_principal_components, plot_scatter_matrix, \
plot_explained_variance, PrincipalComponents
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s %(levelname)s %(module)s - %(funcName)s: %(message)s',
datefmt='%Y-%m-%dT%H:%M:%S%z')
Load Classification Data
df = px.data.iris().drop(columns=['species_id'])
df['species'] = df['species'].astype('category')
x = df[[col for col in df.columns if col != 'species']]
y = df['species']
Plot Classification Data
SPLOM - Original Feature Space
fig = plot_scatter_matrix(x=x, y=y, original_features=True)
fig.update_layout(height=800)
fig
SPLOM - Principal Components
fig = plot_scatter_matrix(x=x, y=y)
fig.update_layout(height=800)
fig
Scatter - 2D PCA
fig = plot_principal_components(x=x, color=y, plot_3d=False, loading_vectors=False)
fig.update_layout(height=800)
fig
Plotting loading vectors is the default.
fig = plot_principal_components(x=x, color=y, plot_3d=False)
fig.update_layout(height=800)
# noinspection PyTypeChecker
pio.show(fig)
Explained Variance
fig = plot_explained_variance(x=x, y=y)
fig
Scatter - 3D PCA
fig = plot_principal_components(x=x, color=y, loading_vectors=False)
fig.update_layout(height=800)
fig
Plotting loading vectors is the default.
fig = plot_principal_components(x=x, color=y)
fig.update_layout(height=800)
fig
Regression Datasets
The preceding examples demonstrated a categorical target variable. Regression problems with a numeric variable are also supported.
diabetes = load_diabetes(as_frame=True, scaled=False)
x, y = diabetes.data, diabetes.target.rename('target')
df = pd.concat([x, y], axis=1)
df.shape
(442, 11)
fig = plot_principal_components(x=x, color=y, plot_3d=False)
fig.update_layout(height=800)
fig
This dataset requires more variables to retain a reasonable proportion of the total variance compared to the iris dataset as indicated in the section below.
Accessing the Data
By plotting with the object rather than the function you can access the data.
pca = PrincipalComponents(x=x, color=y)
fig = pca.plot_explained_variance()
fig
pca.data
{'raw': PCResults(data= PC1 PC2 PC3 ... PC8 PC9 PC10
0 -37.015229 -18.660758 -3.516635 ... 0.181757 -0.432592 -0.121130
1 -15.751789 22.835567 13.417056 ... 0.303021 0.616363 0.159894
2 -37.369635 -17.075089 -0.217778 ... 0.348658 -0.251111 -0.096245
3 14.080519 15.486712 -22.652022 ... -0.405063 0.247089 -0.164029
4 8.052537 2.108751 -0.220784 ... -0.353808 0.429188 0.034163
.. ... ... ... ... ... ... ...
437 -1.450109 -21.028323 3.482378 ... -0.101225 -0.601114 -0.105718
438 58.918190 11.872762 -22.349991 ... 0.038697 -0.639989 0.072849
439 -24.233049 -17.265420 -4.102360 ... 0.105526 -0.314848 0.163713
440 13.903409 5.839565 -10.239533 ... -0.419348 0.164559 -0.139823
441 52.914252 53.019841 32.987254 ... 0.928400 0.663407 -0.243303
[442 rows x 10 columns], explained_variance=0 73.249152
1 9.621208
2 7.471087
3 4.316437
4 3.216342
5 1.642907
6 0.468170
7 0.007436
8 0.006300
9 0.000960
Name: explained_variance, dtype: float64, loadings= PC1 PC2 PC3 ... PC8 PC9 PC10
age 3.714173 -7.118496 4.898654 ... -0.001158 0.001110 0.000139
sex 0.049703 -0.185346 -0.101937 ... 0.357964 -0.261106 -0.000440
bmi 1.254303 -2.003842 -0.374368 ... 0.008416 -0.002737 0.001047
bp 3.629662 -10.214990 4.524721 ... -0.001625 0.002870 0.000129
s1 33.906939 2.747011 5.006616 ... -0.005617 -0.008839 0.005336
s2 29.319152 0.279329 -6.728494 ... -0.000552 0.002171 -0.004946
s3 -0.979456 7.609406 9.679073 ... 0.026486 0.022667 -0.005935
s4 0.803214 -0.514210 -0.626845 ... 0.282119 0.328417 -0.006791
s5 0.241819 -0.210375 0.048459 ... -0.013738 -0.014107 -0.163766
s6 4.073738 -6.720256 0.820792 ... -0.002170 -0.000065 0.000257
[10 rows x 10 columns]), 'std': PCResults(data= PC1 PC2 PC3 ... PC8 PC9 PC10
0 0.587199 -1.946832 0.589205 ... 0.757431 -0.181075 -0.048953
1 -2.831625 1.372082 0.027930 ... -0.188436 0.505128 0.043599
2 0.272129 -1.634901 0.739244 ... 0.843203 -0.025353 -0.054175
3 0.049281 0.382278 -2.013032 ... -0.367871 -0.137857 -0.074558
4 -0.756421 0.811960 -0.057238 ... -1.059751 0.044284 -0.010914
.. ... ... ... ... ... ... ...
437 1.239525 -1.035968 0.928679 ... 0.126490 -0.377893 -0.025229
438 1.264719 0.761319 -1.750191 ... 0.180439 -0.371759 0.033447
439 -0.205206 -1.205487 0.496186 ... -0.491849 -0.113220 0.058875
440 0.692871 0.210127 -0.868724 ... 0.078684 -0.127211 -0.045540
441 -1.903941 3.975777 -0.048338 ... 1.185359 0.730475 -0.154558
[442 rows x 10 columns], explained_variance=0 40.242108
1 14.923197
2 12.059663
3 9.554764
4 6.621814
5 6.027171
6 5.365657
7 4.336820
8 0.783200
9 0.085607
Name: explained_variance, dtype: float64, loadings= PC1 PC2 PC3 ... PC8 PC9 PC10
age 0.434662 0.054261 0.543842 ... 0.009848 0.002269 0.000302
sex 0.375489 -0.472743 -0.117488 ... 0.292022 -0.000590 0.000339
bmi 0.608846 -0.191130 0.184181 ... 0.259050 0.011873 0.000764
bp 0.545735 -0.169098 0.564625 ... -0.314720 0.007619 -0.000298
s1 0.689365 0.700806 -0.075396 ... 0.085320 -0.011778 0.065746
s2 0.706648 0.557612 -0.296499 ... -0.126139 -0.100671 -0.052168
s3 -0.567223 0.619125 0.424407 ... 0.214029 0.134833 -0.029405
s4 0.861234 -0.083384 -0.418523 ... -0.119059 0.216804 -0.008392
s5 0.760385 -0.032026 0.069956 ... 0.296473 -0.053082 -0.024497
s6 0.647045 -0.103892 0.304363 ... -0.109843 -0.004279 0.000242
[10 rows x 10 columns])}
Total running time of the script: ( 0 minutes 0.802 seconds)