.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/compare_datasets.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_compare_datasets.py: Compare Multiple N-D datasets ============================= Comparing 1D datasets readily achieved by overlaying distributions, boxplots, etc. For multivariate (N-D) datasets, things get a little more difficult. This example applies Principal Component Analysis by group variable and colors the loading vectors by group. .. GENERATED FROM PYTHON SOURCE LINES 11-23 .. code-block:: default import logging import pandas as pd import plotly.io from elphick.sklearn_viz.features import plot_parallel_coordinates from elphick.sklearn_viz.features.principal_components import plot_loading_vectors, plot_correlation_circle, \ plot_explained_variance, plot_principal_components logging.basicConfig(level=logging.DEBUG, format='%(asctime)s %(levelname)s %(module)s - %(funcName)s: %(message)s', datefmt='%Y-%m-%dT%H:%M:%S%z') .. GENERATED FROM PYTHON SOURCE LINES 24-26 Create a dataset ---------------- .. GENERATED FROM PYTHON SOURCE LINES 26-59 .. code-block:: default import numpy as np # for consistent results np.random.seed(7) n_samples = 125 n_outliers = 25 n_features = 4 # generate Gaussian data of shape (125, 4) cov1 = np.array([[9, -7, -2, -2], [-7, 7, 1.5, 1], [-2, 1.5, 1, 0.5], [-2, 1, 0.5, 0.5]]) cov2 = np.array([[5, -2, -1.5, -3], [-2, 2, 0.5, 0.5], [-1.5, 0.5, 1, 1], [-3, 0.5, 1, 3]]) x1 = np.dot(np.random.randn(n_samples, n_features), cov1) x2 = np.dot(np.random.randn(n_samples, n_features), cov2) df_x1: pd.DataFrame = pd.DataFrame(x1, columns=[f"F{n}" for n in range(1, n_features + 1)]) # shift the mean on two features df_x1['F4'] = df_x1['F4'] - 2.0 df_x1['F2'] = df_x1['F2'] + 1.0 df_x2: pd.DataFrame = pd.DataFrame(x2, columns=[f"F{n}" for n in range(1, n_features + 1)]) x = pd.concat([df_x1.assign(group='one'), df_x2.assign(group='two')], axis=0).reset_index(drop=True) x['group'] = pd.Categorical(x['group']) x .. raw:: html

	F1	F2	F3	F4	group
0	17.595620	-13.638495	-3.843379	-5.626821	one
1	-3.603537	4.780860	0.702692	-1.297896	one
2	6.549387	-3.029808	-1.845771	-3.833306	one
3	9.769170	-6.183956	-2.372003	-4.119950	one
4	6.628186	-3.129730	-1.412142	-3.611312	one
...	...	...	...	...	...
245	7.033214	-2.686300	-2.319988	-4.617525	two
246	-8.605161	5.013764	3.277691	5.560217	two
247	11.574064	-6.242021	-3.464603	-6.354748	two
248	3.751013	0.564987	-1.655431	-3.452477	two
249	6.737896	-2.744247	-2.385954	-4.576597	two

250 rows × 5 columns

.. GENERATED FROM PYTHON SOURCE LINES 60-65 Explore the data ---------------- The parallel coordinate plot is a good place to start. The differences in mean and variance across some features is clear. .. GENERATED FROM PYTHON SOURCE LINES 65-69 .. code-block:: default fig = plot_parallel_coordinates(x, color='group') fig .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 70-74 Explore the Principal Components -------------------------------- Note that in the next plot, the data is colored by group, but the loading vectors are for the entire dataset. .. GENERATED FROM PYTHON SOURCE LINES 74-79 .. code-block:: default fig = plot_principal_components(plot_3d=False, x=x.drop(columns=['group']), color=x['group']) fig.update_layout(height=800) fig .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 80-81 This example below allows the loading vectors to be visualised by each group without the data for clarity. .. GENERATED FROM PYTHON SOURCE LINES 81-84 .. code-block:: default fig = plot_loading_vectors(x=x.drop(columns=['group']), color=x['group']) fig .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 85-86 And finally, by standardising the input data prior to PCA analysis allows the correlation circle to be shown by group. .. GENERATED FROM PYTHON SOURCE LINES 86-91 .. code-block:: default fig = plot_correlation_circle(x=x.drop(columns=['group']), color=x['group']) fig.update_layout(height=800, width=800) plotly.io.show(fig) .. raw:: html :file: images/sphx_glr_compare_datasets_001.html .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 0.344 seconds) .. _sphx_glr_download_auto_examples_compare_datasets.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: compare_datasets.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: compare_datasets.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_