Notebook

Import Libraries¶

In [ ]:

from fairlearn.metrics import (
    MetricFrame,
    count
)
import numpy as np
import pandas as pd

In [ ]:

from jiwer import wer

In [ ]:

import matplotlib.pyplot as plt
import seaborn as sns

In [ ]:

import logging
import warnings

warnings.simplefilter(action="ignore")
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

Speech-to-Text Fairness Assessment¶

In this notebook, we walk through a scenario of assessing a speech-to-text service for fairness-related disparities. For this fairness assessment, we consider various sensitive features, such as native language, sex, and country where the speaker is located.

For this audit, we will be working with a CSV file stt_testing_data.csv that contains 2138 speech samples. Each row in the dataset represents a person reading a particular reading passage in English. A machine system is used to generate a transcription from the audio of person reading the passage.

If you wish to run this notebook with your speech data, you can use the run_stt.py provided to query the Microsoft Cognitive Service Speech API. You can also run this notebook with a dataset generated from other speech-to-text systems.

In [ ]:

import zipfile
from raiutils.dataset import fetch_dataset
outdirname = 'responsibleai.12.28.21'
zipfilename = outdirname + '.zip'

fetch_dataset('https://publictestdatasets.blob.core.windows.net/data/' + zipfilename, zipfilename)

with zipfile.ZipFile(zipfilename, 'r') as unzip:
    unzip.extractall('.')
stt_results_csv = "stt_testing_data.csv"

In this dataset, ground_truth_text represents the English passage the participant was asked to read. predicted_text represents the transcription produced by the automated service. In an ideal scenario, the ground_truth_text would represent the transcription produced by a human transcriber, so we could compare the output of the automated transcription to the one produced by the human transcriber. We will also look at demographic features, such as sex of the participant and the country where the participant is located.

In [ ]:

df = pd.read_csv(f"{stt_results_csv}",
                usecols=["age", "native_language", "sex", "country", "ground_truth_text", "predicted_text"]
                )

In [ ]:

df.shape

The goal of a fairness assessment is to understand which groups of people may be disproportionately negatively impacted by an AI system and in which ways?

For our fairness assessment, we perform the following tasks:

Idenfity harms and which groups may be harmed.
Define fairness metrics to quantify harms
Compare our quantified harms across the relevant groups.

1.) Identify Harms and groups who may be harmed¶

The first step of the fairness assessment is identifying the types of fairness-related harms we expect users of the systems to experience. From the harms taxonomy in the Fairlearn User Guide, we expect our speech-to-text system produces quality of service harms to users.

Quality-of-service harms occur when an AI system does not work as well for one person as it does for others, even when no resources or opportunities are withheld.

There have been several studies demonstrating that speech-to-text systems achieve different levels of performance based on the speaker's gender and language dialect (Add link to papers). In this assessment, we will explore how the performance of our speech-to-text system differs based on language dialect (proxied by country) and sex for speakers in three English-speaking countries.

In [ ]:

sensitive_country = ["country"]
sensitive_country_sex = ["country", "sex"]

In [ ]:

countries = ["usa", "uk", "canada"]
filtered_df = df.query(f"country in {countries} and native_language == 'english'")

In [ ]:

filtered_df.head()

One challenge for our fairness assessment is the small group sample sizes. Our filtered dataset consists primarily of English speakers in the USA, so we expect higher uncertainty for our metrics on speakers from the other two countries. The smaller sample sizes for UK and Canadian speakers means we may not be able to find significant differences once we also account for sex.

In [ ]:

display(filtered_df.groupby(["country"])["sex"].count())

2.) Define fairness metrics to quantify harms¶

To measure differences in performance, we will be looking at the word_error_rate. The word_error_rate represented the fraction of words that are transcribed incorrectly compared to a ground truth text. A higher word_error_rate reflects that the system achieves worse performance for a particular group.

Compared to the human transcription (what speaker said is different to ground truth text).

In [ ]:

DISPARITY_BASE = 0.5

def word_error_rate(y_true, y_pred):
    return wer(str(y_true), str(y_pred))

def wer_abs_disparity(y_true, y_pred, disparity=DISPARITY_BASE):
    return (word_error_rate(y_true, y_pred) - disparity)
    
def wer_rel_disparity(y_true, y_pred, disparity=DISPARITY_BASE):
    return wer_abs_disparity(y_true, y_pred, disparity)/disparity

WER as a disparity from some base. Might be better to compute maximal difference between groups.

In [ ]:

fairness_metrics = {
    "count": count,
    "word_error_rate": word_error_rate,
    "word_error_rate_abs_disparity": wer_abs_disparity,
    "word_error_rate_rel_disparity": wer_rel_disparity
}

3.) Compare quantifed harms across different groups¶

In the final part of our fairness assessment, we use the MetricFrame object in the fairlearn package to compare our system's performance across our sensitive features.

To instanstiate a MetricFrame, we pass in four parameters:

metrics: The fairness_metrics to evaluate each group on.
y_true: The ground_truth_text
y_pred: The predicted_text
sensitive_features: Our groups for fairness assessment

For our first analysis, we look at the system's performance with repsect to country.

In [ ]:

metricframe_country = MetricFrame(
    metrics=fairness_metrics,
    y_true=filtered_df.loc[:, "ground_truth_text"],
    y_pred=filtered_df.loc[:, "predicted_text"],
    sensitive_features=filtered_df.loc[:, sensitive_country]
)

Using the MetricFrame, we can easily compute the word_error_rate differences between our three countries.

In [ ]:

display(metricframe_country.by_group[["count", "word_error_rate"]])
display(metricframe_country.difference())

We see the maximal word_error_rate difference (betweenUK and Canada) is 0.05. Since the MetricFrame is built on top of the Pandas DataFrame object, we can take advantage of Pandas's plotting capabilities to visualize the word_error_rate by country.

In [ ]:

metricframe_country.by_group.sort_values(by="word_error_rate", ascending=False).plot(kind="bar", y="word_error_rate", ylabel="Word Error Rate", title="Word Error Rate by Country", figsize=[12,8])

Next, let's explore how our system performs with respect to sex of the speaker. Similar to what we did for country, we create another MetricFrame except passing in the sex column as our sensitive_features.

In [ ]:

metricframe_sex = MetricFrame(
    metrics=fairness_metrics,
    y_true=filtered_df.loc[:, "ground_truth_text"],
    y_pred=filtered_df.loc[:, "predicted_text"],
    sensitive_features=filtered_df.loc[:, "sex"]
    
)

In [ ]:

display(metricframe_sex.by_group[["count", "word_error_rate"]])
display(metricframe_sex.difference())

In our sex-based analysis, we see there is a 0.06 difference in the WER between male and female speakers. If we added uncertainty quantification, such as confidence intervals, to our analysis, we could perform statistical tests to determine if the difference is statistically significant or not.

In [ ]:

metricframe_sex.by_group.sort_values(by="word_error_rate", ascending=False).plot(kind="bar", y="word_error_rate", ylabel="Word Error Rate", title="Word Error Rate by Sex", figsize=[12,8])

Intersectional Analysis¶

One key aspect to remember when performing a fairness analysis is to explore the intersection of different groups. For this final analysis, we would look at groups at the intersection of country and sex.

In particular, we are interested in seeing the word_error_rate difference by sex for each country. That is, we want to compare the WER difference between Canada male and Canada female to the WER difference between males and females of the other two countries.

When we instantiate our MetricFrame this time, we pass in the country column as a control_feature. Now when we call the difference method for our MetricFrame, it will compute the WER difference for male and female by each country.

In [ ]:

# Make country a control feature
metricframe_country_sex_control = MetricFrame(
    metrics=fairness_metrics,
    y_true=filtered_df.loc[:, "ground_truth_text"],
    y_pred=filtered_df.loc[:, "predicted_text"],
    sensitive_features=filtered_df.loc[:, "sex"],
    control_features=filtered_df.loc[:, "country"]
        
)

In [ ]:

display(metricframe_country_sex_control.difference())

If we call the by_group attribute, we get the MultiIndex DataFrame showing the count and word_error_rate for each intersectional group.

In [ ]:

display(metricframe_country_sex_control.by_group[["count", "word_error_rate"]])

Now, let's explore our word_error_rate disparity by sex within each country. Plotting the absolute word_error_rates for each intersectional group shows us that the disparity betwen UK male and UK female is noticeably larger compared to the other countries.

In [ ]:

group_metrics = metricframe_country_sex_control.by_group[["count", "word_error_rate"]]

In [ ]:

group_metrics["word_error_rate"].unstack(level=-1).plot(
    kind="bar",
    ylabel="Word Error Rate",
    title="Word Error Rate by Country and Sex")
plt.legend(bbox_to_anchor=(1.3,1))

In [ ]:

def plot_controlled_features(multiindexframe, title, xaxis, yaxis, order):
    """
    Helper function to plot the visualization for the 
    """
    plt.figure(figsize=[12,8])
    disagg_metrics = multiindexframe["word_error_rate"].unstack(level=0).loc[:, order].to_dict()
    male_scatter = []
    female_scatter = []
    countries = disagg_metrics.keys()
    for country, sex_wer in disagg_metrics.items():
        male_point, female_point = sex_wer.get("male"), sex_wer.get("female")
        plt.vlines(country, female_point, male_point, linestyles="dashed", alpha=0.45)
        #Need to pair X-axis (Country) with each point
        male_scatter.append(male_point)
        female_scatter.append(female_point)
    plt.scatter(countries, male_scatter, marker="^", color="b", label="Male")
    plt.scatter(countries, female_scatter, marker="s", color="r", label="Female")
    plt.title(title)
    plt.legend(bbox_to_anchor=(1,1))
    plt.xlabel(xaxis)
    plt.ylabel(yaxis)

We can also visualize the relative disparity by sex for each country. From these plots, we see the difference between UK male and UK female is ~0.09. This is larger than the disparity between US male and US female (0.06) and the disparity between Canada male and Canada female (> 0.01).

In [ ]:

plot_controlled_features(group_metrics,
                         "Word Error Rate by Country and Sex",
                         "Country",
                         "Word Error Rate",
                        order=["uk", "usa", "canada"])

metricframe_country_sex_control.difference().sort_values(by="word_error_rate", ascending=False).plot(
    kind="bar",
    y="word_error_rate",
    title="Word Error Rate between Sex by Country",
    figsize=[12,8])

Conclusion¶

With this fairness assessment, we explored how country and sex affect the quality of a speech-to-text transcription in three English-speaking countries. Through an intersectional analysis, we found a higher disparity in the quality-of-service between UK male and UK female compared to males and females of other countries.

In [ ]: