.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/compare_datasets.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_compare_datasets.py: Compare Multiple N-D datasets ============================= Comparing 1D datasets readily achieved by overlaying distributions, boxplots, etc. For multivariate (N-D) datasets, things get a little more difficult. This example applies Principal Component Analysis by group variable and colors the loading vectors by group. .. GENERATED FROM PYTHON SOURCE LINES 11-23 .. code-block:: default import logging import pandas as pd import plotly.io from elphick.sklearn_viz.features import plot_parallel_coordinates from elphick.sklearn_viz.features.principal_components import plot_loading_vectors, plot_correlation_circle, \ plot_explained_variance, plot_principal_components logging.basicConfig(level=logging.DEBUG, format='%(asctime)s %(levelname)s %(module)s - %(funcName)s: %(message)s', datefmt='%Y-%m-%dT%H:%M:%S%z') .. GENERATED FROM PYTHON SOURCE LINES 24-26 Create a dataset ---------------- .. GENERATED FROM PYTHON SOURCE LINES 26-59 .. code-block:: default import numpy as np # for consistent results np.random.seed(7) n_samples = 125 n_outliers = 25 n_features = 4 # generate Gaussian data of shape (125, 4) cov1 = np.array([[9, -7, -2, -2], [-7, 7, 1.5, 1], [-2, 1.5, 1, 0.5], [-2, 1, 0.5, 0.5]]) cov2 = np.array([[5, -2, -1.5, -3], [-2, 2, 0.5, 0.5], [-1.5, 0.5, 1, 1], [-3, 0.5, 1, 3]]) x1 = np.dot(np.random.randn(n_samples, n_features), cov1) x2 = np.dot(np.random.randn(n_samples, n_features), cov2) df_x1: pd.DataFrame = pd.DataFrame(x1, columns=[f"F{n}" for n in range(1, n_features + 1)]) # shift the mean on two features df_x1['F4'] = df_x1['F4'] - 2.0 df_x1['F2'] = df_x1['F2'] + 1.0 df_x2: pd.DataFrame = pd.DataFrame(x2, columns=[f"F{n}" for n in range(1, n_features + 1)]) x = pd.concat([df_x1.assign(group='one'), df_x2.assign(group='two')], axis=0).reset_index(drop=True) x['group'] = pd.Categorical(x['group']) x .. raw:: html
F1 F2 F3 F4 group
0 17.595620 -13.638495 -3.843379 -5.626821 one
1 -3.603537 4.780860 0.702692 -1.297896 one
2 6.549387 -3.029808 -1.845771 -3.833306 one
3 9.769170 -6.183956 -2.372003 -4.119950 one
4 6.628186 -3.129730 -1.412142 -3.611312 one
... ... ... ... ... ...
245 7.033214 -2.686300 -2.319988 -4.617525 two
246 -8.605161 5.013764 3.277691 5.560217 two
247 11.574064 -6.242021 -3.464603 -6.354748 two
248 3.751013 0.564987 -1.655431 -3.452477 two
249 6.737896 -2.744247 -2.385954 -4.576597 two

250 rows × 5 columns



.. GENERATED FROM PYTHON SOURCE LINES 60-65 Explore the data ---------------- The parallel coordinate plot is a good place to start. The differences in mean and variance across some features is clear. .. GENERATED FROM PYTHON SOURCE LINES 65-69 .. code-block:: default fig = plot_parallel_coordinates(x, color='group') fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 70-74 Explore the Principal Components -------------------------------- Note that in the next plot, the data is colored by group, but the loading vectors are for the entire dataset. .. GENERATED FROM PYTHON SOURCE LINES 74-79 .. code-block:: default fig = plot_principal_components(plot_3d=False, x=x.drop(columns=['group']), color=x['group']) fig.update_layout(height=800) fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 80-81 This example below allows the loading vectors to be visualised by each group without the data for clarity. .. GENERATED FROM PYTHON SOURCE LINES 81-84 .. code-block:: default fig = plot_loading_vectors(x=x.drop(columns=['group']), color=x['group']) fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 85-86 And finally, by standardising the input data prior to PCA analysis allows the correlation circle to be shown by group. .. GENERATED FROM PYTHON SOURCE LINES 86-91 .. code-block:: default fig = plot_correlation_circle(x=x.drop(columns=['group']), color=x['group']) fig.update_layout(height=800, width=800) plotly.io.show(fig) .. raw:: html :file: images/sphx_glr_compare_datasets_001.html .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 0.344 seconds) .. _sphx_glr_download_auto_examples_compare_datasets.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: compare_datasets.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: compare_datasets.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_