fuzzymatcher
examples¶link_table
¶In the most basic usage, the user provides fuzzymatcher
with two pandas dataframes, indicating which columns to join on.
The central output of fuzzymatcher
is the link_table
.
For each record in the left table, the link table includes one or more possible matching records from the right table.
The user can then inspect the link table and decide which matches to retain, e.g. by choosing a score threshold ( match_score > chosen_threshold
) or just choosing the best match ( match_rank == 1
)
import logging
logging.basicConfig(level=logging.DEBUG)
import fuzzymatcher
import pandas as pd
df_left = pd.read_csv("tests/data/left_1.csv")
df_left
df_right = pd.read_csv("tests/data/right_1.csv")
df_right
# Columns to match on from df_left
left_on = ["fname", "mname", "lname", "dob"]
# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]
# Note that if left_id_col or right_id_col are admitted a unique id will be autogenerated
fuzzymatcher.link_table(df_left, df_right, left_on, right_on, left_id_col = "id", right_id_col = "id")
fuzzy_left_join
¶A second option is to use fuzzy_left_join
, which automatically links the two dataframes based on the highest-scoring match.
import fuzzymatcher
import pandas as pd
df_left = pd.read_csv("tests/data/left_1.csv")
df_right = pd.read_csv("tests/data/right_1.csv")
left_on = ["fname", "lname", "dob"]
right_on = ["name", "surname", "date"]
fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)
We would usually join this data on the Local Authority District (LAD) Codes (e.g. E06000001 = Hartlepool), but sometimes these are unavailable. In this example, we fuzzy match on the name, but provide the LAD code to demonstate it has worked.
ons = pd.read_csv("tests/data/las_ons.csv")
os = pd.read_csv("tests/data/las_os.csv")
df_joined = fuzzymatcher.fuzzy_left_join(ons, os, left_on = "lad16nm", right_on = "name")
rename = {"lad16cd": "ons_code", "code": "os_code", "lad16nm": "ons_name", "name": "os_name"}
df_joined = df_joined.rename(columns=rename)
col_order = ["best_match_score", "ons_name", "os_name", "ons_code", "os_code"]
df_joined[col_order].sample(5)
We can get a sense of match quality by measuring how often the fuzzy matcher got it right:
num_records = len(df_joined)
correct_binary = (df_joined["ons_code"] == df_joined["os_code"])
perc_correct = correct_binary.sum()/num_records
"The percentage of codes which were correctly matched was {:,.1f}%".format(perc_correct*100)
fuzzymatcher
uses a number of components, each one of which can be re-written or adapted by the user:
data_preprocessor
: Responsible for normalising strings, removing punctuation etc.datagetter
: Responsible for finding a list of possible matches for each df_left record in df_rightscorer
: Responsible for computing a match score, given a record from df_left and df_right respectivelyThe main link_table
and fuzzy_left_join
convenience functions use these components under the hood. See here for how this work.
This section provides a few examples of how an advanced user can compose these components to create a custom matcher
from fuzzymatcher.data_getter_cartesian import DataGetterCartesian
from fuzzymatcher.matcher import Matcher
dg = DataGetterCartesian()
m = Matcher(data_getter = dg)
df_left = pd.read_csv("tests/data/left_3.csv")
df_right = pd.read_csv("tests/data/right_3.csv")
on = ["first_name", "surname", "dob", "city"]
m.add_data(df_left, df_right, on, on)
m.match_all()
lt = m.get_formatted_link_table()
print("Length of Cartesian join table: {:,.0f}".format(len(lt))) # Note, because df_left and df_right are 100 records each, this table is 10,000 records long
lt.head()
The Cartesian matcher considers more potential matches, but its performance is considerably worse
df_left = pd.read_csv("tests/data/left_4.csv")
# df_left = df_left[:1000]
df_right = pd.read_csv("tests/data/right_4.csv")
# df_right = df_right[:1000]
on = ["first_name", "surname", "dob", "city"]
lt = fuzzymatcher.link_table(df_left, df_right, on, on)
lt.head(5)
def link_table_percentage_correct(link_table):
"""
In this test dataset, we know what the link should be
Therefore we can compute a measure of performance
"""
lt = link_table.copy()
lt = lt[lt["match_rank"] == 1]
lt["__id_left"] = lt["__id_left"].str.replace("_left", "")
lt["__id_right"] = lt["__id_right"].str.replace("_right", "")
lt["link_correct"] = (lt["__id_left"] == lt["__id_right"])
return lt["link_correct"].sum()/len(lt)
"Percent matches correct: {:,.1f}%".format(link_table_percentage_correct(lt)*100)
(Within a field, the matcher pays no attention to token order)
df_left["full_name"] = df_left["first_name"] + " " + df_left["surname"]
df_right["full_name"] = df_right["first_name"] + " " + df_right["surname"]
df_left["initials"] = df_left["first_name"].str[0] + df_left["surname"].str[0]
df_right["initials"] = df_right["first_name"].str[0] + df_right["surname"].str[0]
on = ["full_name", "initials", "dob", "city"]
lt = fuzzymatcher.link_table(df_left, df_right, on, on)
lt.head(5)
"Percent matches correct: {:,.1f}%".format(link_table_percentage_correct(lt)*100)