In this demo we de-duplicate a small dataset.
The purpose is to demonstrate core Splink functionality as quickly as possible.
A more comprehensive, end to end example is provided elsewhere
The following is just boilerplate code that sets up the Spark session and sets some other non-essential configuration options
import pandas as pd
pd.options.display.max_columns = 500
pd.options.display.max_rows = 100
import altair as alt
alt.renderers.enable('mimetype')
RendererRegistry.enable('mimetype')
import logging
logging.basicConfig() # Means logs will print in Jupyter Lab
# Set to DEBUG if you want splink to log the SQL statements it's executing under the hood
logging.getLogger("splink").setLevel(logging.INFO)
from utility_functions.demo_utils import get_spark
spark = get_spark() # See utility_functions/demo_utils.py for how to set up Spark
Note that the group
column is the truth - rows which share the same value refer to the same person. In the real world, we wouldn't have this field!
df = spark.read.parquet("data/fake_1000.parquet")
df.show(5)
+---------+----------+-------+----------+------+--------------------+-----+ |unique_id|first_name|surname| dob| city| email|group| +---------+----------+-------+----------+------+--------------------+-----+ | 0| Julia | null|2015-10-29|London| hannah88@powers.com| 0| | 1| Julia | Taylor|2015-07-31|London| hannah88@powers.com| 0| | 2| Julia | Taylor|2016-01-27|London| hannah88@powers.com| 0| | 3| Julia | Taylor|2015-10-29| null| hannah88opowersc@m| 0| | 4| oNah| Watson|2008-03-23|Bolton|matthew78@ballard...| 1| +---------+----------+-------+----------+------+--------------------+-----+ only showing top 5 rows
settings
object¶Most of splink
configuration options are stored in a settings dictionary. This dictionary allows significant customisation, and can therefore get quite complex.
💥 We provide an tool for helping to author valid settings dictionaries, which includes tooltips and autocomplete, which you can find here.
Customisation overrides default values built into splink. For the purposes of this demo, we will specify a simple settings dictionary, which means we will be relying on these sensible defaults.
To help with authoring and validation of the settings dictionary, we have written a json schema, which can be found here.
settings = {
"link_type": "dedupe_only",
"blocking_rules": [
"l.surname = r.surname"
],
"comparison_columns": [
{
"col_name": "first_name",
"num_levels": 3,
"term_frequency_adjustments": True
},
{
"col_name": "dob"
},
{
"col_name": "city"
},
{
"col_name": "email"
}
],
"additional_columns_to_retain": ["group"],
"em_convergence": 0.01
}
In words, this setting dictionary says:
link_only
, or link_and_dedupe
)first_name
, surname
, dob
, city
and email
columns to compute a match score.first_name
and surname
, string comparisons will have three levels:first_name
and surname
columnsgroup
column in the results even though this is not used as part of comparisons. This is a labelled dataset and group
contains the true match - i.e. where group matches, the records pertain to the same personColumns are assumed to be strings by default. See the 'comparison vector settings' notebook for details of configuration options.
from splink import Splink
linker = Splink(settings, df, spark)
df_e = linker.get_scored_comparisons()
INFO:splink.iterate:Iteration 0 complete INFO:splink.model:The maximum change in parameters was 0.17303137779235844 for key first_name, level 0 INFO:splink.iterate:Iteration 1 complete INFO:splink.model:The maximum change in parameters was 0.059999048709869385 for key city, level 1 INFO:splink.iterate:Iteration 2 complete INFO:splink.model:The maximum change in parameters was 0.02698001265525818 for key email, level 0 INFO:splink.iterate:Iteration 3 complete INFO:splink.model:The maximum change in parameters was 0.020812273025512695 for key email, level 1 INFO:splink.iterate:Iteration 4 complete INFO:splink.model:The maximum change in parameters was 0.014915108680725098 for key email, level 0 INFO:splink.iterate:Iteration 5 complete INFO:splink.model:The maximum change in parameters was 0.010272204875946045 for key email, level 0 INFO:splink.iterate:Iteration 6 complete INFO:splink.model:The maximum change in parameters was 0.007032573223114014 for key email, level 1 INFO:splink.iterate:EM algorithm has converged
# Inspect main dataframe that contains the match scores
cols_to_inspect = ["match_probability","unique_id_l","unique_id_r","group_l", "group_r", "first_name_l","first_name_r","dob_l","dob_r","city_l","city_r","email_l","email_r",]
df_e.toPandas()[cols_to_inspect].sort_values(["unique_id_l", "unique_id_r"]).head(10)
match_probability | unique_id_l | unique_id_r | group_l | group_r | first_name_l | first_name_r | dob_l | dob_r | city_l | city_r | email_l | email_r | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
39 | 0.999783 | 1 | 2 | 0 | 0 | Julia | Julia | 2015-07-31 | 2016-01-27 | London | London | hannah88@powers.com | hannah88@powers.com |
38 | 0.500607 | 1 | 3 | 0 | 0 | Julia | Julia | 2015-07-31 | 2015-10-29 | London | None | hannah88@powers.com | hannah88opowersc@m |
37 | 0.035330 | 1 | 89 | 0 | 18 | Julia | Chirla | 2015-07-31 | 2006-06-28 | London | London | hannah88@powers.com | mbrooks@booker.com |
36 | 0.035330 | 1 | 142 | 0 | 26 | Julia | Harry | 2015-07-31 | 2017-11-24 | London | London | hannah88@powers.com | coltonray@lee.com |
35 | 0.035330 | 1 | 148 | 0 | 26 | Julia | Harry | 2015-07-31 | 2017-09-01 | London | London | hannah88@powers.com | coltonray@lee.com |
34 | 0.202650 | 1 | 362 | 0 | 62 | Julia | None | 2015-07-31 | 1989-07-25 | London | London | hannah88@powers.com | wagnershane@landry.com |
33 | 0.035330 | 1 | 363 | 0 | 62 | Julia | Nancy | 2015-07-31 | 1989-07-25 | London | London | hannah88@powers.com | wagnershane@landry.com |
32 | 0.035330 | 1 | 364 | 0 | 62 | Julia | Nancy | 2015-07-31 | 1989-07-25 | London | London | hannah88@powers.com | wagnershane@landry.com |
31 | 0.035330 | 1 | 365 | 0 | 62 | Julia | Nancy | 2015-07-31 | 1989-08-19 | London | London | hannah88@powers.com | wagnershane@landry.com |
30 | 0.035330 | 1 | 367 | 0 | 62 | Julia | Nacy | 2015-07-31 | 1989-07-25 | London | London | hannah88@powers.com | wagnershane@landry.com |
The model
property of the linker
is an object that contains a lot of diagnostic information about how the match probability was computed. The following cells demonstrate some of its functionality
model = linker.model
model.bayes_factor_chart()
# If charts aren't displaying correctly in your notebook, you can write them to a file (by default splink_charts.html)
model.all_charts_write_html_file(filename="splink_charts.html", overwrite=True)
You can also generate a report which explains how the match probability was computed for an individual comparison row.
Note that you need to convert the row to a dictionary for this to work
from graphframes import GraphFrame
df_e.createOrReplaceTempView("df_e")
sql = """
select unique_id_l as id
from df_e
union
select unique_id_r as id
from df_e
"""
nodes = spark.sql(sql)
sql = """
select
unique_id_l as src,
unique_id_r as dst,
tf_adjusted_match_prob
from df_e
where tf_adjusted_match_prob > 0.99
"""
edges = spark.sql(sql)
g = GraphFrame(nodes, edges)
cc = g.connectedComponents()
cc.createOrReplaceTempView("cc")
df.createOrReplaceTempView("df")
sql = """
select cc.component as estimated_group, df.*
from cc
left join
df
on cc.id = df.unique_id
order by group, estimated_group
"""
results = spark.sql(sql)
results.toPandas().head(30)
estimated_group | unique_id | first_name | surname | dob | city | group | ||
---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | Julia | Taylor | 2016-01-27 | London | hannah88@powers.com | 0 |
1 | 1 | 1 | Julia | Taylor | 2015-07-31 | London | hannah88@powers.com | 0 |
2 | 3 | 3 | Julia | Taylor | 2015-10-29 | None | hannah88opowersc@m | 0 |
3 | 4 | 7 | Noah | Watson | 2008-02-05 | tolon | matthew78@ballard-mcdonald.net | 1 |
4 | 4 | 4 | oNah | Watson | 2008-03-23 | Bolton | matthew78@ballard-mcdonald.net | 1 |
5 | 4 | 5 | Noah | Watson | 2008-03-23 | Bolton | matthew78@ballard-mcdonald.net | 1 |
6 | 6 | 8 | Watson | Noah | 2008-06-15 | Bolton | matthew78@ballard-mcdonald.net | 1 |
7 | 6 | 6 | Watson | Noah | 2008-03-23 | None | matthew78@ballard-mcdonald.net | 1 |
8 | 9 | 9 | Noah | Watson | 2008-01-19 | Bolton | None | 1 |
9 | 10 | 10 | Watson | Noah | 2008-03-23 | Bolton | matthbw78eallard-mcdonald.net | 1 |
10 | 11 | 11 | Watson | Noah | 2008-01-21 | Bolno | matthea78@bwllar-mcdonald.net | 1 |
11 | 13 | 14 | Moly | Bell | 2002-01-05 | Peterborough | lricsaunders@sievaadams.info | 2 |
12 | 13 | 13 | Molly | Bell | 2002-01-05 | Peterborough | None | 2 |
13 | 15 | 15 | Alexander | Amelia | 1983-05-19 | Glasgow | ic-mpbell@allealewis.org | 3 |
14 | 16 | 19 | lmeAi | Alexander | 1983-05-19 | Glasgow | icampbell@allen-lewis.org | 3 |
15 | 16 | 16 | Amelia | Alexander | 1983-05-19 | Glaogw | None | 3 |
16 | 16 | 18 | Amelia | Alexander | 1983-05-19 | Glasgow | icampbell@allen-lewis.org | 3 |
17 | 17 | 17 | Amelia | Alexander | 1983-04-30 | Glasgow | icampbeal@lllen-lews.org | 3 |
18 | 21 | 21 | Oliver | Connolly | 1972-03-08 | Plymouth | None | 4 |
19 | 22 | 22 | None | Connolly | 1972-03-08 | Plyohtu | derekwilliams@norris.com | 4 |
20 | 23 | 26 | Ollie | Thompson | 1996-03-22 | Leeds | jefferyduke@brown-alazar.org | 5 |
21 | 23 | 23 | None | Thompson | 1996-03-22 | Leeds | jefferyduke@brown-salazar.org | 5 |
22 | 25 | 25 | None | Thompson | 1995-12-18 | Leeds | None | 5 |
23 | 29 | 29 | Harris | Matilda | 1983-04-30 | London | patricia47@davis.com | 6 |
24 | 30 | 30 | Matilda | Harris | 1983-03-03 | London | patricia47@davis.com | 6 |
25 | 31 | 31 | Harris | Matilda | 1983-05-24 | London | None | 6 |
26 | 32 | 32 | Baxter | Aria | 1992-09-07 | London | christineshepherd@allen.com | 7 |
27 | 32 | 34 | Baxter | Aria | 1992-09-30 | London | christineshepherd@allen.com | 7 |
28 | 33 | 33 | Aria | Baxter | 1992-09-07 | London | christineshepherd@allen.com | 7 |
29 | 33 | 36 | Aria | Baxter | 1992-09-07 | London | christineshepherd@allen.com | 7 |