Quickstart deduplication demo¶

In this demo we de-duplicate a small dataset.

The purpose is to demonstrate core Splink functionality as quickly as possible.

A more comprehensive, end to end example is provided elsewhere

Step 1: Imports and setup¶

The following is just boilerplate code that sets up the Spark session and sets some other non-essential configuration options

In [1]:

import pandas as pd 
pd.options.display.max_columns = 500
pd.options.display.max_rows = 100
import altair as alt
alt.renderers.enable('mimetype')

Out[1]:

RendererRegistry.enable('mimetype')

In [2]:

import logging 
logging.basicConfig()  # Means logs will print in Jupyter Lab

# Set to DEBUG if you want splink to log the SQL statements it's executing under the hood
logging.getLogger("splink").setLevel(logging.INFO)

In [3]:

from utility_functions.demo_utils import get_spark
spark = get_spark() # See utility_functions/demo_utils.py for how to set up Spark

Step 2: Read in data¶

Note that the group column is the truth - rows which share the same value refer to the same person. In the real world, we wouldn't have this field!

In [4]:

df = spark.read.parquet("data/fake_1000.parquet")
df.show(5)

+---------+----------+-------+----------+------+--------------------+-----+
|unique_id|first_name|surname|       dob|  city|               email|group|
+---------+----------+-------+----------+------+--------------------+-----+
|        0|    Julia |   null|2015-10-29|London| hannah88@powers.com|    0|
|        1|    Julia | Taylor|2015-07-31|London| hannah88@powers.com|    0|
|        2|    Julia | Taylor|2016-01-27|London| hannah88@powers.com|    0|
|        3|    Julia | Taylor|2015-10-29|  null|  hannah88opowersc@m|    0|
|        4|      oNah| Watson|2008-03-23|Bolton|matthew78@ballard...|    1|
+---------+----------+-------+----------+------+--------------------+-----+
only showing top 5 rows

Step 3: Configure splink using the `settings` object¶

Most of splink configuration options are stored in a settings dictionary. This dictionary allows significant customisation, and can therefore get quite complex.

💥 We provide an tool for helping to author valid settings dictionaries, which includes tooltips and autocomplete, which you can find here.

Customisation overrides default values built into splink. For the purposes of this demo, we will specify a simple settings dictionary, which means we will be relying on these sensible defaults.

To help with authoring and validation of the settings dictionary, we have written a json schema, which can be found here.

In [5]:

settings = {
    "link_type": "dedupe_only",
    "blocking_rules": [
        "l.surname = r.surname"
    ],
    "comparison_columns": [
        {
            "col_name": "first_name",
            "num_levels": 3,
            "term_frequency_adjustments": True
        },
        {
            "col_name": "dob"
        },
        {
            "col_name": "city"
        },
        {
            "col_name": "email"
        }
    ],
    "additional_columns_to_retain": ["group"],
    "em_convergence": 0.01
}

In words, this setting dictionary says:

We are performing a deduplication task (the other options are link_only, or link_and_dedupe)
We are going generate comparisons subject to the blocking rules contained in the specified array
When comparing records, we will use information from the first_name, surname, dob, city and email columns to compute a match score.
For first_name and surname, string comparisons will have three levels:
- Level 2: Strings are (almost) exactly the same
- Level 1: Strings are similar
- Level 0: No match
We will make adjustments for term frequencies on the first_name and surname columns
We will retain the group column in the results even though this is not used as part of comparisons. This is a labelled dataset and group contains the true match - i.e. where group matches, the records pertain to the same person
Consider the algorithm to have converged when no parameter changes by more than 0.01 between iterations

Step 4: Estimate match scores using the Expectation Maximisation algorithm¶

Columns are assumed to be strings by default. See the 'comparison vector settings' notebook for details of configuration options.

In [6]:

from splink import Splink

linker = Splink(settings, df, spark)
df_e = linker.get_scored_comparisons()

INFO:splink.iterate:Iteration 0 complete
INFO:splink.model:The maximum change in parameters was 0.17303137779235844 for key first_name, level 0
INFO:splink.iterate:Iteration 1 complete
INFO:splink.model:The maximum change in parameters was 0.059999048709869385 for key city, level 1
INFO:splink.iterate:Iteration 2 complete
INFO:splink.model:The maximum change in parameters was 0.02698001265525818 for key email, level 0
INFO:splink.iterate:Iteration 3 complete
INFO:splink.model:The maximum change in parameters was 0.020812273025512695 for key email, level 1
INFO:splink.iterate:Iteration 4 complete
INFO:splink.model:The maximum change in parameters was 0.014915108680725098 for key email, level 0
INFO:splink.iterate:Iteration 5 complete
INFO:splink.model:The maximum change in parameters was 0.010272204875946045 for key email, level 0
INFO:splink.iterate:Iteration 6 complete
INFO:splink.model:The maximum change in parameters was 0.007032573223114014 for key email, level 1
INFO:splink.iterate:EM algorithm has converged

Step 5: Inspect results¶

In [7]:

# Inspect main dataframe that contains the match scores
cols_to_inspect = ["match_probability","unique_id_l","unique_id_r","group_l", "group_r", "first_name_l","first_name_r","dob_l","dob_r","city_l","city_r","email_l","email_r",]
df_e.toPandas()[cols_to_inspect].sort_values(["unique_id_l", "unique_id_r"]).head(10)

Out[7]:

	match_probability	unique_id_l	unique_id_r	group_r	first_name_l	first_name_r	dob_l	dob_r	city_l	city_r	email_l	email_r
39	0.999783	1	2	0	Julia	Julia	2015-07-31	2016-01-27	London	London	hannah88@powers.com	hannah88@powers.com
38	0.500607	1	3	0	Julia	Julia	2015-07-31	2015-10-29	London	None	hannah88@powers.com	hannah88opowersc@m
37	0.035330	1	89	18	Julia	Chirla	2015-07-31	2006-06-28	London	London	hannah88@powers.com	mbrooks@booker.com
36	0.035330	1	142	26	Julia	Harry	2015-07-31	2017-11-24	London	London	hannah88@powers.com	coltonray@lee.com
35	0.035330	1	148	26	Julia	Harry	2015-07-31	2017-09-01	London	London	hannah88@powers.com	coltonray@lee.com
34	0.202650	1	362	62	Julia	None	2015-07-31	1989-07-25	London	London	hannah88@powers.com	wagnershane@landry.com
33	0.035330	1	363	62	Julia	Nancy	2015-07-31	1989-07-25	London	London	hannah88@powers.com	wagnershane@landry.com
32	0.035330	1	364	62	Julia	Nancy	2015-07-31	1989-07-25	London	London	hannah88@powers.com	wagnershane@landry.com
31	0.035330	1	365	62	Julia	Nancy	2015-07-31	1989-08-19	London	London	hannah88@powers.com	wagnershane@landry.com
30	0.035330	1	367	62	Julia	Nacy	2015-07-31	1989-07-25	London	London	hannah88@powers.com	wagnershane@landry.com

The model property of the linker is an object that contains a lot of diagnostic information about how the match probability was computed. The following cells demonstrate some of its functionality

In [8]:

model = linker.model
model.bayes_factor_chart()

Out[8]:

In [9]:

# If charts aren't displaying correctly in your notebook, you can write them to a file (by default splink_charts.html)
model.all_charts_write_html_file(filename="splink_charts.html", overwrite=True)

You can also generate a report which explains how the match probability was computed for an individual comparison row.

Note that you need to convert the row to a dictionary for this to work

Use graphframes to resolve links into groups¶

In [10]:

from graphframes import GraphFrame


df_e.createOrReplaceTempView("df_e")
sql = """
select unique_id_l as id
from df_e
union
select unique_id_r as id
from df_e
"""
nodes = spark.sql(sql)

sql = """
select
unique_id_l as src,
unique_id_r as dst,
tf_adjusted_match_prob
from df_e
where tf_adjusted_match_prob > 0.99
"""
edges = spark.sql(sql)

g = GraphFrame(nodes, edges)

cc = g.connectedComponents()

cc.createOrReplaceTempView("cc")
df.createOrReplaceTempView("df")
sql = """
select cc.component as estimated_group, df.*
from cc
left join
df 
on cc.id = df.unique_id
order by group, estimated_group
"""
results = spark.sql(sql)
results.toPandas().head(30)

Out[10]:

	estimated_group	unique_id	first_name	surname	dob	city	email	group
0	1	2	Julia	Taylor	2016-01-27	London	hannah88@powers.com	0
1	1	1	Julia	Taylor	2015-07-31	London	hannah88@powers.com	0
2	3	3	Julia	Taylor	2015-10-29	None	hannah88opowersc@m	0
3	4	7	Noah	Watson	2008-02-05	tolon	matthew78@ballard-mcdonald.net	1
4	4	4	oNah	Watson	2008-03-23	Bolton	matthew78@ballard-mcdonald.net	1
5	4	5	Noah	Watson	2008-03-23	Bolton	matthew78@ballard-mcdonald.net	1
6	6	8	Watson	Noah	2008-06-15	Bolton	matthew78@ballard-mcdonald.net	1
7	6	6	Watson	Noah	2008-03-23	None	matthew78@ballard-mcdonald.net	1
8	9	9	Noah	Watson	2008-01-19	Bolton	None	1
9	10	10	Watson	Noah	2008-03-23	Bolton	matthbw78eallard-mcdonald.net	1
10	11	11	Watson	Noah	2008-01-21	Bolno	matthea78@bwllar-mcdonald.net	1
11	13	14	Moly	Bell	2002-01-05	Peterborough	lricsaunders@sievaadams.info	2
12	13	13	Molly	Bell	2002-01-05	Peterborough	None	2
13	15	15	Alexander	Amelia	1983-05-19	Glasgow	ic-mpbell@allealewis.org	3
14	16	19	lmeAi	Alexander	1983-05-19	Glasgow	icampbell@allen-lewis.org	3
15	16	16	Amelia	Alexander	1983-05-19	Glaogw	None	3
16	16	18	Amelia	Alexander	1983-05-19	Glasgow	icampbell@allen-lewis.org	3
17	17	17	Amelia	Alexander	1983-04-30	Glasgow	icampbeal@lllen-lews.org	3
18	21	21	Oliver	Connolly	1972-03-08	Plymouth	None	4
19	22	22	None	Connolly	1972-03-08	Plyohtu	derekwilliams@norris.com	4
20	23	26	Ollie	Thompson	1996-03-22	Leeds	jefferyduke@brown-alazar.org	5
21	23	23	None	Thompson	1996-03-22	Leeds	jefferyduke@brown-salazar.org	5
22	25	25	None	Thompson	1995-12-18	Leeds	None	5
23	29	29	Harris	Matilda	1983-04-30	London	patricia47@davis.com	6
24	30	30	Matilda	Harris	1983-03-03	London	patricia47@davis.com	6
25	31	31	Harris	Matilda	1983-05-24	London	None	6
26	32	32	Baxter	Aria	1992-09-07	London	christineshepherd@allen.com	7
27	32	34	Baxter	Aria	1992-09-30	London	christineshepherd@allen.com	7
28	33	33	Aria	Baxter	1992-09-07	London	christineshepherd@allen.com	7
29	33	36	Aria	Baxter	1992-09-07	London	christineshepherd@allen.com	7

In [ ]: