fuzzymatcher examples

In the most basic usage, the user provides fuzzymatcher with two pandas dataframes, indicating which columns to join on.

The central output of fuzzymatcher is the link_table.

For each record in the left table, the link table includes one or more possible matching records from the right table.

The user can then inspect the link table and decide which matches to retain, e.g. by choosing a score threshold ( match_score > chosen_threshold ) or just choosing the best match ( match_rank == 1 )

In [ ]:
import logging
logging.basicConfig(level=logging.DEBUG)
In [ ]:
import fuzzymatcher
import pandas as pd

df_left = pd.read_csv("tests/data/left_1.csv")
df_left
In [ ]:
df_right = pd.read_csv("tests/data/right_1.csv")
df_right
In [ ]:
# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# Note that if left_id_col or right_id_col are admitted a unique id will be autogenerated
fuzzymatcher.link_table(df_left, df_right, left_on, right_on, left_id_col = "id", right_id_col = "id")

Basic usage - fuzzy_left_join

A second option is to use fuzzy_left_join, which automatically links the two dataframes based on the highest-scoring match.

In [ ]:
import fuzzymatcher
import pandas as pd

df_left = pd.read_csv("tests/data/left_1.csv")
df_right = pd.read_csv("tests/data/right_1.csv")
left_on = ["fname", "lname",  "dob"]
right_on = ["name", "surname", "date"]

fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)

Basic example with real data

Matching the names of Local Authorities provided by Office for National Statistics with the names provided by Ordnance Survey

We would usually join this data on the Local Authority District (LAD) Codes (e.g. E06000001 = Hartlepool), but sometimes these are unavailable. In this example, we fuzzy match on the name, but provide the LAD code to demonstate it has worked.

In [ ]:
ons = pd.read_csv("tests/data/las_ons.csv")
os = pd.read_csv("tests/data/las_os.csv")

df_joined = fuzzymatcher.fuzzy_left_join(ons, os, left_on = "lad16nm", right_on = "name")
rename = {"lad16cd": "ons_code", "code": "os_code", "lad16nm": "ons_name", "name": "os_name"}
df_joined = df_joined.rename(columns=rename)
col_order = ["best_match_score", "ons_name", "os_name", "ons_code", "os_code"]
df_joined[col_order].sample(5)

We can get a sense of match quality by measuring how often the fuzzy matcher got it right:

In [ ]:
num_records = len(df_joined)
correct_binary = (df_joined["ons_code"] == df_joined["os_code"])
perc_correct = correct_binary.sum()/num_records

"The percentage of codes which were correctly matched was {:,.1f}%".format(perc_correct*100)

Advanced usage - configuring the matcher

fuzzymatcher uses a number of components, each one of which can be re-written or adapted by the user:

  • data_preprocessor: Responsible for normalising strings, removing punctuation etc.
  • datagetter: Responsible for finding a list of possible matches for each df_left record in df_right
  • scorer: Responsible for computing a match score, given a record from df_left and df_right respectively

The main link_table and fuzzy_left_join convenience functions use these components under the hood. See here for how this work.

This section provides a few examples of how an advanced user can compose these components to create a custom matcher

Example 1: Replacing the default sqlite datagetter with the cartesian datagetter

In [ ]:
from fuzzymatcher.data_getter_cartesian import DataGetterCartesian
from fuzzymatcher.matcher import Matcher

dg = DataGetterCartesian()

m = Matcher(data_getter = dg)

df_left = pd.read_csv("tests/data/left_3.csv")
df_right = pd.read_csv("tests/data/right_3.csv")

on = ["first_name", "surname", "dob", "city"]

m.add_data(df_left, df_right, on, on)

m.match_all()
lt = m.get_formatted_link_table()
print("Length of Cartesian join table: {:,.0f}".format(len(lt))) # Note, because df_left and df_right are 100 records each, this table is 10,000 records long
lt.head()

The Cartesian matcher considers more potential matches, but its performance is considerably worse

Performance

In [ ]:
df_left = pd.read_csv("tests/data/left_4.csv")
# df_left = df_left[:1000]
df_right = pd.read_csv("tests/data/right_4.csv")
# df_right = df_right[:1000]
on = ["first_name", "surname", "dob", "city"]

lt = fuzzymatcher.link_table(df_left, df_right, on, on)
lt.head(5)
In [ ]:
def link_table_percentage_correct(link_table):
    """
    In this test dataset, we know what the link should be
    Therefore we can compute a measure of performance
    """
    lt = link_table.copy()
    lt = lt[lt["match_rank"] == 1]
    lt["__id_left"] = lt["__id_left"].str.replace("_left", "")
    lt["__id_right"] = lt["__id_right"].str.replace("_right", "")
    lt["link_correct"] = (lt["__id_left"] == lt["__id_right"])

    return lt["link_correct"].sum()/len(lt)

"Percent matches correct: {:,.1f}%".format(link_table_percentage_correct(lt)*100)

Note that in this particular case we can improve the match rate by including initials and allowing inversion of first name and surname

(Within a field, the matcher pays no attention to token order)

In [ ]:
df_left["full_name"] = df_left["first_name"] + " " + df_left["surname"]
df_right["full_name"] = df_right["first_name"] + " " + df_right["surname"]
df_left["initials"] = df_left["first_name"].str[0] + df_left["surname"].str[0]
df_right["initials"] = df_right["first_name"].str[0] + df_right["surname"].str[0]

on = ["full_name", "initials", "dob", "city"]

lt = fuzzymatcher.link_table(df_left, df_right, on, on)
lt.head(5)
In [ ]:
"Percent matches correct: {:,.1f}%".format(link_table_percentage_correct(lt)*100)