Parallel Coordinates

Parallel coordinate plots are very useful for Exploratory Data Analysis (EDA).

Typically the target variable will be colored, since it is the variable of most interest, though this is optional.

The interactive nature of plotly is a real asset for this particular plot. Records/samples can be highlighted by clicking and dragging the mouse vertically at a given axis for a variable (feature or target). Multiple selections are possible. Single clicking a selection will remove it.

import pandas as pd
import plotly.io as pio
from sklearn.datasets import load_diabetes, load_wine

from elphick.sklearn_viz.features import plot_parallel_coordinates

Load Classification Data

wine = load_wine(as_frame=True)
X, y = wine.data, wine.target.rename('target')
df = pd.concat([X, y], axis=1)
df
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline target
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0 0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0 0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0 0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0 0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
173 13.71 5.65 2.45 20.5 95.0 1.68 0.61 0.52 1.06 7.70 0.64 1.74 740.0 2
174 13.40 3.91 2.48 23.0 102.0 1.80 0.75 0.43 1.41 7.30 0.70 1.56 750.0 2
175 13.27 4.28 2.26 20.0 120.0 1.59 0.69 0.43 1.35 10.20 0.59 1.56 835.0 2
176 13.17 2.59 2.37 20.0 120.0 1.65 0.68 0.53 1.46 9.30 0.60 1.62 840.0 2
177 14.13 4.10 2.74 24.5 96.0 2.05 0.76 0.56 1.35 9.20 0.61 1.60 560.0 2

178 rows × 14 columns



Plot Classification Data

fig = plot_parallel_coordinates(df, color=y.name)
# noinspection PyTypeChecker
pio.show(fig)

The target is optional. If the plot is too dense, then consider sampling as demonstrated.

fig = plot_parallel_coordinates(df.sample(frac=0.5))
fig


Load Regression Data

diabetes = load_diabetes(as_frame=True, scaled=False)
X, y = diabetes.data, diabetes.target.rename('target')
df = pd.concat([X, y], axis=1)
df
age sex bmi bp s1 s2 s3 s4 s5 s6 target
0 59.0 2.0 32.1 101.00 157.0 93.2 38.0 4.00 4.8598 87.0 151.0
1 48.0 1.0 21.6 87.00 183.0 103.2 70.0 3.00 3.8918 69.0 75.0
2 72.0 2.0 30.5 93.00 156.0 93.6 41.0 4.00 4.6728 85.0 141.0
3 24.0 1.0 25.3 84.00 198.0 131.4 40.0 5.00 4.8903 89.0 206.0
4 50.0 1.0 23.0 101.00 192.0 125.4 52.0 4.00 4.2905 80.0 135.0
... ... ... ... ... ... ... ... ... ... ... ...
437 60.0 2.0 28.2 112.00 185.0 113.8 42.0 4.00 4.9836 93.0 178.0
438 47.0 2.0 24.9 75.00 225.0 166.0 42.0 5.00 4.4427 102.0 104.0
439 60.0 2.0 24.9 99.67 162.0 106.6 43.0 3.77 4.1271 95.0 132.0
440 36.0 1.0 30.0 95.00 201.0 125.2 42.0 4.79 5.1299 85.0 220.0
441 36.0 1.0 19.6 71.00 250.0 133.2 97.0 3.00 4.5951 92.0 57.0

442 rows × 11 columns



Plot Regression Data

fig = plot_parallel_coordinates(df, color=y.name)
fig


Categorical data is supported

df['sex'] = df['sex'].map({1: 'Male', 2: 'Female'}).astype('category')
fig = plot_parallel_coordinates(df.sample(frac=0.5), color=y.name)
fig


Total running time of the script: ( 0 minutes 2.684 seconds)

Gallery generated by Sphinx-Gallery