Splink data linking and deduplication demo¶

In this demo we link and simultaneously dedupe two datasets which themselves contain duplicates.

Step 1: Imports and setup¶

The following is just boilerplate code that sets up the Spark session and sets some other non-essential configuration options

In [1]:

import pandas as pd 
pd.options.display.max_columns = 500
pd.options.display.max_rows = 100
import altair as alt
alt.renderers.enable('mimetype')

Out[1]:

RendererRegistry.enable('mimetype')

In [2]:

import logging 
logging.basicConfig()  # Means logs will print in Jupyter Lab

# Set to DEBUG if you want splink to log the SQL statements it's executing under the hood
logging.getLogger("splink").setLevel(logging.INFO)

In [3]:

from utility_functions.demo_utils import get_spark
spark = get_spark()

Step 2: Read in data¶

In this example, we read in two datasets, df_1 and df_2. These will be provided to Splink as a list like [df_1, df_2]. By adding additional elements to this link, you can use Splink to link and deduplicate as many datasets as you like.

⚠️ Note that splink makes the following assumptions about your data:

There is a field containing a unique record identifier in each dataset
There is a field containing a dataset name in each dataset, to disambiguate the unique_id column if the same id values occur in more than one dataset. By default, this column is called source_dataset, but you can change this in the settings.
The two datasets being linked have common column names - e.g. date of birth is represented in both datasets in a field of the same name. In many cases, this means that the user needs to rename columns prior to using splink

In [4]:

from pyspark.sql.functions import lit 
from pyspark.sql import functions as f
df_1 = spark.read.parquet("data/fake_df_l_link_dedupe_test.parquet")
df_1 = df_1.withColumn("source_dataset", lit("df_1"))
df_2 = spark.read.parquet("data/fake_df_r_link_dedupe_test.parquet")
df_2 = df_2.withColumn("source_dataset", lit("df_2"))
df_1.show(5)
df_2.show(5)

+---------+----------+-------+----------+------------+--------------------+-----+--------------+
|unique_id|first_name|surname|       dob|        city|               email|group|source_dataset|
+---------+----------+-------+----------+------------+--------------------+-----+--------------+
|        0|    Julia |   null|2015-10-29|      London| hannah88@powers.com|    0|          df_1|
|        1|    Julia | Taylor|2015-07-31|      London| hannah88@powers.com|    0|          df_1|
|        2|      oNah| Watson|2008-03-23|      Bolton|matthew78@ballard...|    1|          df_1|
|        3|     Noah | Watson|2008-03-23|      Bolton|matthew78@ballard...|    1|          df_1|
|        4|    Molly |   Bell|2002-01-05|Peterborough|                null|    2|          df_1|
+---------+----------+-------+----------+------------+--------------------+-----+--------------+
only showing top 5 rows

+---------+----------+-------+----------+------+--------------------+-----+--------------+
|unique_id|first_name|surname|       dob|  city|               email|group|source_dataset|
+---------+----------+-------+----------+------+--------------------+-----+--------------+
|        0|    Julia | Taylor|2016-01-27|London| hannah88@powers.com|    0|          df_2|
|        1|    Julia | Taylor|2015-10-29|  null|  hannah88opowersc@m|    0|          df_2|
|        2|    Watson|  Noah |2008-03-23|  null|matthew78@ballard...|    1|          df_2|
|        3|     Noah | Watson|2008-02-05| tolon|matthew78@ballard...|    1|          df_2|
|        4|    Watson|  Noah |2008-06-15|Bolton|matthew78@ballard...|    1|          df_2|
+---------+----------+-------+----------+------+--------------------+-----+--------------+
only showing top 5 rows

Step 3: Configure splink using the `settings` object¶

Most of splink configuration options are stored in a settings dictionary. This dictionary allows significant customisation, and can therefore get quite complex.

💥 We provide an tool for helping to author valid settings dictionaries, which includes tooltips and autocomplete, which you can find here.

Customisation overrides default values built into splink. For the purposes of this demo, we will specify a simple settings dictionary, which means we will be relying on these sensible defaults.

To help with authoring and validation of the settings dictionary, we have written a json schema, which can be found here.

In [5]:

settings = {
    "link_type": "link_and_dedupe",
    "blocking_rules": [
        "l.city = r.city"
    ],
    "comparison_columns": [
        {
            "col_name": "first_name",
            "num_levels": 3,
            "term_frequency_adjustments": True
        },
        {
            "col_name": "surname",
            "num_levels": 3,
            "term_frequency_adjustments": True
        },
        {
            "col_name": "dob"
        },
        {
            "col_name": "email"
        }
    ],
    "additional_columns_to_retain": ["group"]
}

In words, this setting dictionary says:

We are performing a link and deduplication task (the other options are link_only, or dedupe_only)
We are going generate comparisons subject to the blocking rule l.city = r.city i.e. all pairwise record comparisons (both within and between datasets) where the city field matches
When comparing records, we will use information from the first_name, surname, dob and email columns to compute a match score.
For first_name and surname, string comparisons will have three levels:
- Level 2: Strings are (almost) exactly the same
- Level 1: Strings are similar
- Level 0: No match
We will make adjustments for term frequencies on the first_name and surname columns
We will retain the group column in the results even though this is not used as part of comparisons. This is a labelled dataset and group contains the true match - i.e. where group matches, the records pertain to the same person

Step 4: Estimate match scores using the Expectation Maximisation algorithm¶

Columns are assumed to be strings by default. See the 'comparison vector settings' notebook for details of configuration options.

In [6]:

from splink import Splink

linker = Splink(settings, [df_1, df_2], spark)
df_e = linker.get_scored_comparisons()

INFO:splink.iterate:Iteration 0 complete
INFO:splink.model:The maximum change in parameters was 0.28655259609222417 for key first_name, level 0
INFO:splink.iterate:Iteration 1 complete
INFO:splink.model:The maximum change in parameters was 0.07504022121429443 for key surname, level 2
INFO:splink.iterate:Iteration 2 complete
INFO:splink.model:The maximum change in parameters was 0.032610535621643066 for key surname, level 0
INFO:splink.iterate:Iteration 3 complete
INFO:splink.model:The maximum change in parameters was 0.018751978874206543 for key surname, level 0
INFO:splink.iterate:Iteration 4 complete
INFO:splink.model:The maximum change in parameters was 0.01141306757926941 for key surname, level 0
INFO:splink.iterate:Iteration 5 complete
INFO:splink.model:The maximum change in parameters was 0.007352173328399658 for key surname, level 0
INFO:splink.iterate:Iteration 6 complete
INFO:splink.model:The maximum change in parameters was 0.004936426877975464 for key surname, level 0
INFO:splink.iterate:Iteration 7 complete
INFO:splink.model:The maximum change in parameters was 0.003407001495361328 for key surname, level 0
INFO:splink.iterate:Iteration 8 complete
INFO:splink.model:The maximum change in parameters was 0.0023947954177856445 for key surname, level 0
INFO:splink.iterate:Iteration 9 complete
INFO:splink.model:The maximum change in parameters was 0.0017041563987731934 for key surname, level 0
INFO:splink.iterate:Iteration 10 complete
INFO:splink.model:The maximum change in parameters was 0.0012229979038238525 for key surname, level 0
INFO:splink.iterate:Iteration 11 complete
INFO:splink.model:The maximum change in parameters was 0.0008828639984130859 for key surname, level 0
INFO:splink.iterate:Iteration 12 complete
INFO:splink.model:The maximum change in parameters was 0.0006399154663085938 for key surname, level 0
INFO:splink.iterate:Iteration 13 complete
INFO:splink.model:The maximum change in parameters was 0.00046521425247192383 for key surname, level 0
INFO:splink.iterate:Iteration 14 complete
INFO:splink.model:The maximum change in parameters was 0.0003389120101928711 for key surname, level 0
INFO:splink.iterate:Iteration 15 complete
INFO:splink.model:The maximum change in parameters was 0.000247269868850708 for key surname, level 0
INFO:splink.iterate:Iteration 16 complete
INFO:splink.model:The maximum change in parameters was 0.0001805722713470459 for key surname, level 0
INFO:splink.iterate:Iteration 17 complete
INFO:splink.model:The maximum change in parameters was 0.00013199448585510254 for key surname, level 0
INFO:splink.iterate:Iteration 18 complete
INFO:splink.model:The maximum change in parameters was 9.652972221374512e-05 for key surname, level 0
INFO:splink.iterate:EM algorithm has converged

Step 5: Inspect results¶

Each row in the results table represents a comparison of two input records.

Note that since this is a link and dedupe task, there will be record comparisons of three types:

Comparison between two records both originating from df_l.
Comparison between two records both originating from df_r.
Comparison between one record from df_l and one record from df_r

Information about the source table is included in the _source_table_l and _source_table_r fields.

In [7]:

# Inspect main dataframe that contains the match scores
cols_to_inspect = ["match_probability","unique_id_l", "source_dataset_l","unique_id_r","source_dataset_r","group_l", "group_r", "first_name_l","first_name_r","surname_l","surname_r","dob_l","dob_r","email_l","email_r"]

df_e.toPandas()[cols_to_inspect].sort_values(["unique_id_l", "unique_id_r"]).head(10)

Out[7]:

	match_probability	source_dataset_l	unique_id_r	source_dataset_r	group_r	first_name_l	first_name_r	surname_l	surname_r	dob_l	dob_r	email_l	email_r
163	0.996844	df_1	0	df_2	0	Julia	Julia	None	Taylor	2015-10-29	2016-01-27	hannah88@powers.com	hannah88@powers.com
255	0.996844	df_1	1	df_1	0	Julia	Julia	None	Taylor	2015-10-29	2015-07-31	hannah88@powers.com	hannah88@powers.com
254	0.001500	df_1	12	df_1	6	Julia	Matilda	None	Hsrir	2015-10-29	1983-04-30	hannah88@powers.com	patrcio47@davis.cam
253	0.001500	df_1	13	df_1	6	Julia	Matilda	None	None	2015-10-29	1983-04-30	hannah88@powers.com	patricia47@davis.com
252	0.001500	df_1	14	df_1	7	Julia	Baxter	None	Aria	2015-10-29	1992-09-07	hannah88@powers.com	christineshepherd@allen.com
162	0.001500	df_1	15	df_2	6	Julia	Harris	None	Matilda	2015-10-29	1983-04-30	hannah88@powers.com	patricia47@davis.com
251	0.001500	df_1	15	df_1	7	Julia	Aria	None	Baxter	2015-10-29	1992-09-07	hannah88@powers.com	christineshepherd@allen.com
21847	0.000525	df_2	15	df_2	6	Julia	Harris	Taylor	Matilda	2016-01-27	1983-04-30	hannah88@powers.com	patricia47@davis.com
161	0.001500	df_1	16	df_2	6	Julia	Matilda	None	Harris	2015-10-29	1983-03-03	hannah88@powers.com	patricia47@davis.com
250	0.001500	df_1	16	df_1	8	Julia	Wilson	None	Charlie	2015-10-29	1998-09-15	hannah88@powers.com	samantha81@henry.com

The params property of the linker is an object that contains a lot of diagnostic information about how the match probability was computed. The following cells demonstrate some of its functionality

In [8]:

model = linker.model
model.probability_distribution_chart()

Out[8]:

An alternative representation of the parameters displays them in terms of the effect different values in the comparison vectors have on the match probability:

In [9]:

model.bayes_factor_chart()

Out[9]:

In [10]:

# If charts aren't displaying correctly in your notebook, you can write them to a file (by default splink_charts.html)
model.all_charts_write_html_file("splink_charts.html", overwrite=True)

You can also generate a report which explains how the match probability was computed for an individual comparison row.

Note that you need to convert the row to a dictionary for this to work

In [11]:

from splink.intuition import intuition_report
row_dict = df_e.toPandas().sample(1).to_dict(orient="records")[0]
print(intuition_report(row_dict, model))

Initial probability of match (prior) = λ = 0.03728
------
Comparison of first_name.  Values are:             
first_name_l:                                      Gibson
first_name_r:                                      yarH 
Comparison has:                                    3 levels
Level for this comparison:                         gamma_first_name = 0
m probability = P(level|match):                    0.3577
u probability = P(level|non-match):                0.9931
Bayes factor = m/u:                                0.3602
New probability of match (updated belief):         0.01375

------
Comparison of surname.  Values are:                
surname_l:                                         Noah 
surname_r:                                         None
Comparison has:                                    3 levels
Level for this comparison:                         gamma_surname = -1
m probability = P(level|match):                    1
u probability = P(level|non-match):                1
Bayes factor = m/u:                                1
New probability of match (updated belief):         0.01375

------
Comparison of dob.  Values are:                    
dob_l:                                             1987-08-24
dob_r:                                             2017-11-24
Comparison has:                                    2 levels
Level for this comparison:                         gamma_dob = 0
m probability = P(level|match):                    0.3705
u probability = P(level|non-match):                0.9969
Bayes factor = m/u:                                0.3717
New probability of match (updated belief):         0.005157

------
Comparison of email.  Values are:                  
email_l:                                           avazquez@banks.com
email_r:                                           coltonray@lee.com
Comparison has:                                    2 levels
Level for this comparison:                         gamma_email = 0
m probability = P(level|match):                    0.2887
u probability = P(level|non-match):                0.9964
Bayes factor = m/u:                                0.2898
New probability of match (updated belief):         0.0015


Final probability of match = 0.0015

Reminder:

The m probability for a given level is the proportion of matches which are in this level.
We would generally expect the highest similarity level to have the largest proportion of matches.
For example, we would expect first name field to match exactly amongst most matching records, except where nicknames, aliases or typos have occurred.
For a comparison column that changes through time, like address, we may expect a lower proportion of comparisons to be in the highest similarity level.

The u probability for a given level is the proportion of non-matches which are in this level.
We would generally expect the lowest similarity level to have the highest proportion of non-matches, but the magnitude depends on the cardinality of the field.
For example, we would expect that in the vast majority of non-matching records, the date of birth field would not match.  However, we would expect it to be common for gender to match amongst non-matches.

Use graphframes to resolve links into groups¶

In [12]:

from graphframes import GraphFrame
from splink.vertically_concat import vertically_concatenate_datasets

df_1.createOrReplaceTempView("df_1")
df_2.createOrReplaceTempView("df_2")
df_e.createOrReplaceTempView("df_e")

sql = """
select unique_id, source_dataset, concat(unique_id, '__-__', source_dataset) as id 
from df_1
union
select unique_id, source_dataset, concat(unique_id, '__-__', source_dataset) as id 
from df_2
"""
nodes = spark.sql(sql)

sql = """
select
concat(unique_id_l, '__-__', source_dataset_l) as src,
concat(unique_id_r, '__-__', source_dataset_r) as dst,
match_probability
from df_e
where match_probability > 0.99
"""
edges = spark.sql(sql)

g = GraphFrame(nodes, edges)

cc = g.connectedComponents()

cc.createOrReplaceTempView("cc")

df_nodes = vertically_concatenate_datasets([df_1, df_2])
df_nodes.createOrReplaceTempView("df_nodes")

sql = """
select cc.component, df_nodes.*
from cc
left join
df_nodes on 
cc.unique_id = df_nodes.unique_id and cc.source_dataset = df_nodes.source_dataset
order by component
"""
results = spark.sql(sql)

results.toPandas().head(30)

results.show()

+---------+---------+----------+---------+----------+---------------+--------------------+-----+--------------+
|component|unique_id|first_name|  surname|       dob|           city|               email|group|source_dataset|
+---------+---------+----------+---------+----------+---------------+--------------------+-----+--------------+
|        0|      107|  Gabriel |     null|1977-12-21|           Leds|rachelschtitm@geo...|   57|          df_1|
|        1|      112|   Nathan |  lconaMd|1996-09-09|         London|birara84@robbins.baz|   29|          df_2|
|        1|      111|    Ntaan | Mcdonald|1996-09-09|         London|                null|   29|          df_2|
|        1|       56|   Nathan | Mcdonald|1996-09-09|         London|barbara84@robbins...|   29|          df_1|
|        1|       55|   Nathan | Mcdonald|1996-09-09|         London|barbara84@robbins...|   29|          df_1|
|        1|      109|      null| Mcdonald|1996-09-09|         London|barbara84@robbins...|   29|          df_2|
|        2|      247|    Taylor|   Nancy |1989-07-25|         London|wagnershane@landr...|   62|          df_2|
|        2|      118|    Nancy |   Taylor|1989-07-25|         London|wagnershane@landr...|   62|          df_1|
|        2|      245|    Nancy |   Taylor|1989-07-25|         London|wagnershane@landr...|   62|          df_2|
|        2|      248|     Nacy |   Taylor|1989-07-25|         London|wagnershane@landr...|   62|          df_2|
|        2|      117|      null|   Taylor|1989-07-25|         London|wagnershane@landr...|   62|          df_1|
|        2|      246|    Nancy |   Taylor|1989-08-19|         London|wagnershane@landr...|   62|          df_2|
|        3|      119|     Lewis| Freddie |1971-12-19|Southend-on-Sea|tyler28@weaver-al...|   63|          df_1|
|        3|      249|   Freddi |    Lewis|1971-12-19|Southend-on-Sea|tyler28@wecveralv...|   63|          df_2|
|        3|      250|      null|    Lewis|1971-12-19|Southend-on-Sea|tyler28@weaver-al...|   63|          df_2|
|        3|      120|     Lewis| Freddie |1971-09-14|Southend-on-Sea|tyler28@weaver-al...|   63|          df_1|
|        4|       10|   Amelia |Alexander|1983-05-19|        Glasgow|icampbell@allen-l...|    3|          df_2|
|        4|       11|    lmeAi |Alexander|1983-05-19|        Glasgow|icampbell@allen-l...|    3|          df_2|
|        6|      253|      Aad |    Berry|1986-02-08|         London|walterdunn@pham-h...|   64|          df_2|
|        6|      121|     Adam |    Berry|1986-02-08|         London|walterdunn@pham-h...|   64|          df_1|
+---------+---------+----------+---------+----------+---------------+--------------------+-----+--------------+
only showing top 20 rows

In [ ]: