Confusion Matrix as Sankey Diagram

In machine learning a confusion matrix is a kind of a table that is used to understand how well our classification model predictions perform, typically a supervised learning. It helps us a lot in understanding the model behavior and interpreting the results.

Usually, a confusion matrix is displayed as raw numbers in an array. Very often we visualize a confusion matrix by plotting it as a heatmap. But there is also another, more elegant and interactive way. In this notebook we will describe step-by-step the process of creating an interactive Sankey confusion matrix using Plotly.

These are the main features of a Sankey confusion matrix:

Nodes: The source nodes are positioned on the left, representing the actual class labels. The target nodes are on the right, representing the predicted class labels.
Node size: the size of each node is proportional to the number of samples belonging to that specific class, offering insights into the class distribution.
Links: the links between nodes show the flow of samples during the classification process. The width is proportional to the number of samples classified correctly or incorrectly between the respective class labels.
Tooltips: - hovering over the nodes and links provides additional information, displaying the numerical and textual representation of the confusion matrix.

Dependencies¶

In [1]:

import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', None)

import os

# Classification metrics
from sklearn.metrics import confusion_matrix 

Plotly dependencies

In [2]:

from plotly import graph_objects as go

# Set the appropriate renderer in Jupyter Lab to allow Plotly displays figure correctly
# Set the default renderer explicitly as iframe 
import plotly.io as pio
pio.renderers.default = 'iframe' 

# If multiple notebooks are using 'iframe', set different 'html_directory' for each notebook
iframe_renderer = pio.renderers['iframe']
iframe_renderer.html_directory='iframe_figures_n1'

Helper Functions¶

To help us with visualizations, we will import the module metrics_utilities.py. It is a collection of several helper functions for confusion matrix visualization.

The function developed in this notebook will be added to this module.

In [3]:

# Import the script from different folder
import sys  
sys.path.append('./scripts')

import metrics_utilities as mu

Preparing Data¶

The data used in this notebook is a result from one of my previous projects - Bank-Churn-Prediction.
I saved the true (actual) labels and predictions in .npy format and in the next two cells we will load them.

Load Test True (Actual) Labels¶

In [4]:

# True labels
y_test = np.load('./data/y_test.npy')

Load Predictions¶

In [5]:

# Prepare predictions for our models
pred_dt = np.load('./data/pred_dt.npy')
pred_dl = np.load('./data/pred_dl.npy')
pred_knn = np.load('./data/pred_knn.npy')
pred_lr = np.load('./data/pred_lr.npy')
pred_rf = np.load('./data/pred_rf.npy')
pred_svm = np.load('./data/pred_svm.npy')
pred_xgb = np.load('./data/pred_xgb.npy')

Names of Classes¶

The target_names variable holds names of our classes. It will be used later for displaying evaluation results.

In [6]:

# Names of our classes
target_names = ['Stays', 'Exits']

Confusion Matrix¶

Axes Convention¶

In the literature, we can find two variants for representing the samples in a confusion matrix:

each row of the matrix represents samples in an actual class, and each column represents samples in a predicted class
in the other variant, this arrangement is reversed,with each row representing samples in the predicted class and each column representing samples in the actual class.

In this notebook we will use the first variant, where actual labels are on the horizontal axes and predicted labels on the vertical axes. Let us consider a binary classification problem with two classes: 0 (Negative) and 1 (Positive). The confusion matrix would be:

Why Normalized Confusion Matrix?¶

As real-life data is frequently imbalanced, utilizing a confusion matrix without normalization can potentially lead to misleading or incorrect conclusions.

In our Sankey diagram we will include values from both, unnormalized and normalized, matrices.

The simplest way to display a confusion matrix is as raw numbers in an array.

Unnormalized Confusion Matrix

In [7]:

# Confusion matrix
cm = confusion_matrix(y_test, pred_dt)
print(cm)

[[1979  410]
 [ 198  413]]

Normalized Confusion Matrix

Few points to know:

to calculate normalized version, divide each row element by the sum of the entire row
each row represents the total number of true (actual) values for each class label
the normalized matrix shows the percentage of predictions made by the model for each class with respect to the corresponding actual (true) label.

In [8]:

# Normalized confusion matrix
cmn = np.around(cm / cm.sum(axis=1)[:, np.newaxis], 2)
print(cmn)

[[0.83 0.17]
 [0.32 0.68]]

Sankey Diagram¶

A Sankey Diagram is a visual tool used to illustrate the transfer of energy, money, materials, or the flow of any isolated system or process.
It provides a clear depiction of flows and their quantities, showcasing the proportion of values transferring from one set to another. In a Sankey Diagram, the interconnected elements are referred to as nodes, and the connections between them (flows) are known as links.

To understand Sankey diagrams, it is fundamental to become familiar with the key terminology:

Source: This represents the starting node or the origin of the flow.
Target: It refers to the node to which the source connects, representing the destination of the flow.
Value: This corresponds to the connection flow volume, denoted by a numerical value, and determines the thickness of the lines that connect the nodes in the Sankey diagram.
Label: The label refers to the name or description of the node, helping to identify and understand the elements represented in the diagram.

Create Dataframe for Sankey¶

To create a Sankey diagram, first we have to organize our data. We will use Pandas DataFrame to prepare and store the data.
We wiil split the process in several steps:

base dataframe - using all data from our confusion matrix
node labels - create node labels list and dictionary of node labels indices
final dataframe - add columns for normalized matrix, color and text to display when hovering over the links
mapping node labels to integers
prepare text that we want to print in bold font

Base Dataframe and Node Labels¶

Confusion Matrix → DataFrame¶

Create the DataFrame from the confusion matrix, using previously defined target_names as row and column names.

In [9]:

# Create dataframe
df = pd.DataFrame(cm, columns=target_names, index=target_names)
df

Out[9]:

	Stays	Exits
Stays	1979	410
Exits	198	413

The Goal¶

A Sankey diagram requires three data columns — one for the "From" column (source nodes), one for the "To" column (target nodes), and one for the values (flow quantity) corresponding to each pairing (link).

We need to transform this base dataframe to the following dataframe:

                 actual	        predicted	  samples
        0	ACTUAL Stays	PREDICTED Stays	    1979
        1	ACTUAL Stays	PREDICTED Exits	     410
        2	ACTUAL Exits	PREDICTED Stays	     198
        3	ACTUAL Exits	PREDICTED Exits	     413

For our Sankey diagram

column actual represents Sankey source nodes
column predicted represents Sankey target nodes
column samplesrepresents flow quantities for Sankey links.

Later we will add more columns to improve interpretability of our Sankey confusion matrix.

Name the Axes¶

Let's name the row axis to ACTUAL and the column axis to PREDICTED.

In [10]:

# Axes naming
df = df.rename_axis(index='ACTUAL', columns='PREDICTED')
df

Out[10]:

PREDICTED	Stays	Exits
ACTUAL
Stays	1979	410
Exits	198	413

Update Columns¶

Let's append axes names to labels of rows and columns

In [11]:

[f'ACTUAL {s}' for s in target_names]

Out[11]:

['ACTUAL Stays', 'ACTUAL Exits']

In [12]:

# Set new labels for rows and columns
df = df.set_axis([f'ACTUAL {s}' for s in target_names], axis=0)
df = df.set_axis([f'PREDICTED {s}' for s in target_names], axis=1)
print(df)

              PREDICTED Stays  PREDICTED Exits
ACTUAL Stays             1979              410
ACTUAL Exits              198              413

We can do the same in one line.
We will create a dataframe from the confusion matrix using the new labels for rows and columns.

In [13]:

df = pd.DataFrame(cm, columns=[f'PREDICTED {s}' for s in target_names], index=[f'ACTUAL {s}' for s in target_names])
print(df)

              PREDICTED Stays  PREDICTED Exits
ACTUAL Stays             1979              410
ACTUAL Exits              198              413

Node Labels¶

IMPORTANT: Sankeys only take integers for node and target values.
We will do this transformation a little bit later. And for now let's prepare data for that.

First we will create a list of node labels and then a dictionary of their indices.

List of Node Labels¶

In [14]:

# column labels --> Sankey target nodes
cl = df.columns.values.tolist()
cl

Out[14]:

['PREDICTED Stays', 'PREDICTED Exits']

In [15]:

# row labels --> Sankey source nodes
rl = df.index.values.tolist()
rl

Out[15]:

['ACTUAL Stays', 'ACTUAL Exits']

In [16]:

node_labels = rl + cl
node_labels

Out[16]:

['ACTUAL Stays', 'ACTUAL Exits', 'PREDICTED Stays', 'PREDICTED Exits']

Indices for Node Labels¶

In [17]:

# Create dictionary with node labels indices
node_labels_inds = {label:ind for ind, label in enumerate(node_labels)}
node_labels_inds

Out[17]:

{'ACTUAL Stays': 0,
 'ACTUAL Exits': 1,
 'PREDICTED Stays': 2,
 'PREDICTED Exits': 3}

Reshape DataFrame¶

For Sankey diagram we need to plot flows from source nodes to target nodes. The flows are the numbers of samples being correctly or incorrectly classified.

Our source nodes: 'ACTUAL Stays', 'ACTUAL Exits'
Our target nodes: 'PREDICTED Stays', 'PREDICTED Exits'

The new reshaped dataframe will have 2x2=4 rows, 4 combinations. Each row is one flow:

ACTUAL Stays → PREDICTED Stays = # of Stays correctly classified
ACTUAL Stays → PREDICTED Exits = # of Stays incorrectly classified
ACTUAL Exits → PREDICTED Stays = # of Exits incorrectly classified
ACTUAL Exits → PREDICTED Exits = # of Exits correctly classified

To acomplish this we will use Pandas funcitions:

stack() - stack the columns to rows, it returns a series with two levels MultiIndex
reset_index() - reset the multilevel index to default one, and the original index gets converted to columns
rename() - rename new columns

In [18]:

# Reshape dataframe
df = df.stack().reset_index()
df.rename(columns={0:'samples', 'level_0':'actual', 'level_1':'predicted'}, inplace=True)
df

Out[18]:

	actual	predicted	samples
0	ACTUAL Stays	PREDICTED Stays	1979
1	ACTUAL Stays	PREDICTED Exits	410
2	ACTUAL Exits	PREDICTED Stays	198
3	ACTUAL Exits	PREDICTED Exits	413

Final DataFrame¶

Normalized Confusion Matrix Column¶

In [19]:

# Normalized confusion matrix
cmn = np.around(cm / cm.sum(axis=1)[:, np.newaxis], 2)
print(cmn)

[[0.83 0.17]
 [0.32 0.68]]

In [20]:

# Flatten normmalized confusion matrix and add as a new column
df['norm_samples'] = cmn.ravel()
df

Out[20]:

	actual	predicted	samples	norm_samples
0	ACTUAL Stays	PREDICTED Stays	1979	0.83
1	ACTUAL Stays	PREDICTED Exits	410	0.17
2	ACTUAL Exits	PREDICTED Stays	198	0.32
3	ACTUAL Exits	PREDICTED Exits	413	0.68

Add New Columns `color` and `link_hover_text`¶

The ink color is determioned based on classification result (correct or incorrect)

In [21]:

incorrect_red = "rgba(205, 92, 92, 0.8)"
correct_green = "rgba(144, 238, 144, 0.8)"

Create a helper function to add columns color, and link_hover_text for text to be displayed when hovering over the Sankey links.

In [22]:

# 'color' - link color based on classification result (correct or incorrect)
# 'link_hover_text' - text for hovering over connecting links of sankey diagram
def new_columns(row):
    source_1 = ''.join(row.actual.split()[1:])
    target_1 = ''.join(row.predicted.split()[1:])
    # Correct classification
    if source_1 == target_1:
        row['color'] = correct_green
        row['link_hover_text'] = f"{row.samples} ({row.norm_samples:.0%}) {source_1} samples correctly classified as {target_1}"
    # Incorrect classification
    else:
        row['color'] = incorrect_red
        row['link_hover_text'] = f"{row.samples} ({row.norm_samples:.0%}) {source_1} samples incorrectly classified as {target_1}"
    return row

Finalize the DataFrame.

In [23]:

# Apply heper function
df = df.apply(lambda x: new_columns(x), axis=1)
df

Out[23]:

	actual	predicted	samples	norm_samples	color	link_hover_text
0	ACTUAL Stays	PREDICTED Stays	1979	0.83	rgba(144, 238, 144, 0.8)	1979 (83%) Stays samples correctly classified as Stays
1	ACTUAL Stays	PREDICTED Exits	410	0.17	rgba(205, 92, 92, 0.8)	410 (17%) Stays samples incorrectly classified as Exits
2	ACTUAL Exits	PREDICTED Stays	198	0.32	rgba(205, 92, 92, 0.8)	198 (32%) Exits samples incorrectly classified as Stays
3	ACTUAL Exits	PREDICTED Exits	413	0.68	rgba(144, 238, 144, 0.8)	413 (68%) Exits samples correctly classified as Exits

Map Node Labels to Integers¶

Map node label columns (actual, predicted) to integers due to Sankey requirements.

In [24]:

node_labels_inds

Out[24]:

{'ACTUAL Stays': 0,
 'ACTUAL Exits': 1,
 'PREDICTED Stays': 2,
 'PREDICTED Exits': 3}

In [25]:

# using replace for multiple columns
df = df.replace({'actual':node_labels_inds, 'predicted':node_labels_inds})
df

Out[25]:

	actual	predicted	samples	norm_samples	color	link_hover_text
0	0	2	1979	0.83	rgba(144, 238, 144, 0.8)	1979 (83%) Stays samples correctly classified as Stays
1	0	3	410	0.17	rgba(205, 92, 92, 0.8)	410 (17%) Stays samples incorrectly classified as Exits
2	1	2	198	0.32	rgba(205, 92, 92, 0.8)	198 (32%) Exits samples incorrectly classified as Stays
3	1	3	413	0.68	rgba(144, 238, 144, 0.8)	413 (68%) Exits samples correctly classified as Exits

In [26]:

# using assign + apply + lambda
# dft.assign(actual    = dft.actual.apply(lambda x: node_labels_inds[x]),
#            predicted = dft.predicted.apply(lambda x: node_labels_inds[x]))

In [27]:

# Using assign + map
# dft.assign(actual    = dft.actual.map(node_labels_indices),
#             predicted = dft.predicted.map(node_labels_indices))

Bold Printing in Plotly¶

Prepare data for bold printing of some words in Plotly.

Node Labels¶

We want to print class names (2nd word in a string) in bold font.
We will use the HTML <b> tag for that.

In [28]:

node_labels

Out[28]:

['ACTUAL Stays', 'ACTUAL Exits', 'PREDICTED Stays', 'PREDICTED Exits']

In [29]:

node_labels = [f'{ls[0]} <b>{ls[1]}</b>' for ls in [l.split() for l in node_labels]]
print(node_labels)

['ACTUAL <b>Stays</b>', 'ACTUAL <b>Exits</b>', 'PREDICTED <b>Stays</b>', 'PREDICTED <b>Exits</b>']

Hovering Text¶

Printing class names in bold font.

In [30]:

df['link_hover_text'] = [f'{" ".join(ls[0:2])} <b>{ls[2]}</b> {" ".join(ls[3:-1])} <b>{ls[-1]}</b>' for ls in [l.split() for l in df['link_hover_text']]]
df

Out[30]:

	actual	predicted	samples	norm_samples	color	link_hover_text
0	0	2	1979	0.83	rgba(144, 238, 144, 0.8)	1979 (83%) <b>Stays</b> samples correctly classified as <b>Stays</b>
1	0	3	410	0.17	rgba(205, 92, 92, 0.8)	410 (17%) <b>Stays</b> samples incorrectly classified as <b>Exits</b>
2	1	2	198	0.32	rgba(205, 92, 92, 0.8)	198 (32%) <b>Exits</b> samples incorrectly classified as <b>Stays</b>
3	1	3	413	0.68	rgba(144, 238, 144, 0.8)	413 (68%) <b>Exits</b> samples correctly classified as <b>Exits</b>

Plotting¶

In [31]:

fig = go.Figure(data=[go.Sankey(
    
node = dict(
    pad = 30,
    thickness = 20,
    line = dict(color = "gray", width = 1.0),
    label = node_labels,
    hovertemplate = "%{label} has total %{value:d} samples<extra></extra>"
    ),
link = dict(
    source = df.actual, 
    target = df.predicted,
    value = df.samples,
    color = df.color,
    customdata = df['link_hover_text'], 
    hovertemplate = "%{customdata}<extra></extra>"  
))])

title = f'Decision Tree'

fig.update_layout(
    # hovermode = 'x',
    title = {
    'text': title,
    'x':0.5,
    },
    # paper_bgcolor = '#51504f',
    font_size = 15,
    # font_color = 'white',
    width = 600,
    height = 500
)

How to interpret our Sankey confusion matrix?¶

To interpret our Sankey confusion matrix, with Stays representing the negative class and Exits representing the positive class, take note of the following key elements:

Nodes: The source nodes ("ACTUAL Stays" and "ACTUAL Exits" ) are positioned on the left, representing the actual class labels. The target nodes ("PREDICTED Stays" and "PREDICTED Exits") are on the right, representing the predicted class labels.
Node Size: The size of each node is proportional to the number of samples belonging to that specific class, offering insights into the class distribution.
Links: The links between nodes show the flow of samples during classification. Wider links indicate a higher number of samples classified correctly (shown in green) or incorrectly (depicted in red) between the respective class labels.
Tooltips: Hovering over the nodes and links provides additional information, displaying the numerical and textual representation of the confusion matrix. This interactive feature enhances the understanding of classification outcomes and allows for a more detailed examination of the model's performance.

Define Function¶

Let's now collect all these together into a function.

In [32]:

def plot_cm_sankey(model_name, y_test, y_pred, target_names=None):
    """ Plot confusion matrix with Sankey diagram 

    Args:
        model_name: name of the model
        y_test: test target variable
        y_pred: prediction
        target_names: list of class names

    Returns:
        Plot Sankey diagram of confusion matrix
    """ 
    
    # Calculate confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    
    # If class labels not passed, create dummy class labels
    if target_names == None: 
        target_names = []
    if not len(target_names):
        target_names = [f'class-{i+1}' for i in range(len(cm))]
    
    # Prepare dataframe with parameters for Sankey
    def prepare_df_for_sankey(cm, target_names):
        # create a dataframe
        df = pd.DataFrame(cm, columns=[f'PREDICTED {s}' for s in target_names], index=[f'ACTUAL {s}' for s in target_names])
        
        # Create list of node labels
        # target nodes = column labels (PREDICTED ...)
        cl = df.columns.values.tolist()
        # source nodes = row (index) labels (ACTUAL ...)
        rl = df.index.values.tolist()
        node_labels = rl + cl
        
        # Create dictionary with indices for node labels
        node_labels_inds = {label:ind for ind, label in enumerate(node_labels)}
        
        # Stack label from column to row, output is Series
        # Reset index to get DataFrame and rename columns
        df = df.stack().reset_index()
        df.rename(columns={0:'samples', 'level_0':'actual', 'level_1':'predicted'}, inplace=True)
        
        """
               actual	       predicted	  samples
        0	ACTUAL Stays	PREDICTED Stays	    1979
        1	ACTUAL Stays	PREDICTED Exits	     410
        2	ACTUAL Exits	PREDICTED Stays	     198
        3	ACTUAL Exits	PREDICTED Exits	     413
        """

        # Normalized confusion matrix
        cmn = np.around(cm / cm.sum(axis=1)[:, np.newaxis], 2)
        # Add a column with normalized values of samples
        df['norm_samples'] = cmn.ravel()
        
        # Helper function to add new columns: color and link_hover_text 
        # 'color' - link color based on classification result (correct or incorrect)        
        incorrect_red = "rgba(205, 92, 92, 0.8)"
        correct_green = "rgba(144, 238, 144, 0.8)"
        # # 'link_hover_text' - text for hovering on connecting links of sankey diagram
        
        def new_columns(row):
            source_1 = ''.join(row.actual.split()[1:])
            target_1 = ''.join(row.predicted.split()[1:])
            # Correct classification
            if source_1 == target_1:
                row['color'] = correct_green
                row['link_hover_text'] = f"{row.samples} ({row.norm_samples:.0%}) {source_1} samples correctly classified as {target_1}"
            # Incorrect classification
            else:
                row['color'] = incorrect_red
                row['link_hover_text'] = f"{row.samples} ({row.norm_samples:.0%}) {source_1} samples incorrectly classified as {target_1}"
            return row

        # Apply "new_columns" function
        df = df.apply(lambda x: new_columns(x), axis=1)
        
        # Sankey only takes integers for node and target values,
        #  so we need to map node label columns (actual, predicted) to numbers
        # Using replace for multiple columns
        df = df.replace({'actual':node_labels_inds, 'predicted':node_labels_inds})
               
        return df, node_labels
    
    
    # Plotting confusion matrix as Sankey diagram
    # Get dataframe and node labels
    df, node_labels = prepare_df_for_sankey(cm, target_names)
    
    # Prepare for bold printing of some words in Plotly
    node_labels = [f'{ls[0]} <b>{ls[1]}</b>' for ls in [l.split() for l in node_labels]]
    df['link_hover_text'] = [f'{" ".join(ls[0:2])} <b>{ls[2]}</b> {" ".join(ls[3:-1])} <b>{ls[-1]}</b>' for ls in [l.split() for l in df['link_hover_text']]]
    

    fig = go.Figure(data=[go.Sankey(    
        node = dict(
        pad = 50,
        thickness = 30,
        line = dict(color = "gray", width = 1.0),
        label = node_labels,
        hovertemplate = "%{label} has total %{value:d} samples<extra></extra>"
        ),
    link = dict(
        source = df.actual, 
        target = df.predicted,
        value = df.samples,
        color = df.color,
        customdata = df['link_hover_text'], 
        hovertemplate = "%{customdata}<extra></extra>"  
    ))])
    
    margins = {'l': 25, 'r': 25, 't': 70, 'b': 25}
    
    fig.update_layout(
        title = {
        'text': f'<b>{model_name}</b>',
        'x':0.5,
        },
        font_size = 15,
        width = 625,
        height = 500,
        #paper_bgcolor = '#d3d3d3',
        # paper_bgcolor = 'white',
        # plot_bgcolor = 'black',
        margin = margins,
    )
    
    return fig

Run Function¶

Let's test the function

In [33]:

plot_cm_sankey('Decision Tree', y_test, pred_dt, target_names)

Copy the function to the module metrics_utilities.py and reload the kernel. After running all required cells above, run the following cell.

In [34]:

mu.plot_cm_sankey('Decision Tree', y_test, pred_dt, target_names)

Hover over the diagram to get more information about the confusion matrix.

3x3 Confusion Matrix¶

Let's see how this works for an 3x3 confusion matrix.

We will use data from my project T2D-Predictions.

In [35]:

# Actual (True) labels
t2d_y_test = np.load('./data/t2d_y_test.npy')
# Prediction from random forest model
t2d_pred_rf = np.load('./data/t2d_pred_rf.npy')
# Classes
t2d_classes = ['no_diabetes', 'pre_diabetes', 'diabetes']

In [36]:

mu.plot_cm_sankey('Random Forest', t2d_y_test, t2d_pred_rf, t2d_classes)

Hover over the diagram to get more information about the confusion matrix.

NOTE 1:

Depending on the values in the confusion matrix, especially in the case of multi-class classification, Plotly might display the target nodes (PREDICTED labels) in a different order compared to the source nodes (ACTUAL labels).
However, rest assured that despite the potential rearrangement of nodes, all the connected links in the Sankey Diagram will accurately represent the correct width (values) as per the confusion matrix. The flow volumes will be appropriately maintained, enabling a precise representation of the classification outcomes between the actual and predicted labels.

NOTE 2 - for Jupyter Lab users:

Set the appropriate renderer in Jupyter Lab to allow Plotly displays figure correctly
As suggested in the Plotly documentation, you might set the default renderer explicitly as iframe by adding following lines into your codes:

import plotly.io as pio
pio.renderers.default = 'iframe'

This renderer writes figures out as standalone HTML files and then displays iframe elements that reference these files
They are stored in a subdirectory named iframe_figures
The names of the files are given based on the execution number of the notebook cell that produced the figure
Storing multiple notebooks using an iframe renderer in the same directory, could result in notebooks overwriting each other's figures
To avoid this, add the following lines in each notebook to set different names for the html_directory. in this notebook we named it iframe_figures_n2

iframe_renderer = pio.renderers['iframe']
iframe_renderer.html_directory='iframe_figures_n1'

Dependencies¶

Helper Functions¶

Preparing Data¶

Load Test True (Actual) Labels¶

Load Predictions¶

Names of Classes¶

Confusion Matrix¶

Axes Convention¶

Why Normalized Confusion Matrix?¶

Sankey Diagram¶

Create Dataframe for Sankey¶

Base Dataframe and Node Labels¶

Confusion Matrix → DataFrame¶

The Goal¶

Name the Axes¶

Update Columns¶

Node Labels¶

List of Node Labels¶

Indices for Node Labels¶

Reshape DataFrame¶

Final DataFrame¶

Normalized Confusion Matrix Column¶

Add New Columns color and link_hover_text¶

Map Node Labels to Integers¶

Bold Printing in Plotly¶

Node Labels¶

Hovering Text¶

Plotting¶

How to interpret our Sankey confusion matrix?¶

Define Function¶

Run Function¶

3x3 Confusion Matrix¶

Add New Columns `color` and `link_hover_text`¶