In this demo we link and simultaneously dedupe two datasets which themselves contain duplicates.
The following is just boilerplate code that sets up the Spark session and sets some other non-essential configuration options
import pandas as pd
pd.options.display.max_columns = 500
pd.options.display.max_rows = 100
import altair as alt
alt.renderers.enable('mimetype')
RendererRegistry.enable('mimetype')
import logging
logging.basicConfig() # Means logs will print in Jupyter Lab
# Set to DEBUG if you want splink to log the SQL statements it's executing under the hood
logging.getLogger("splink").setLevel(logging.INFO)
from utility_functions.demo_utils import get_spark
spark = get_spark()
In this example, we read in two datasets, df_1
and df_2
. These will be provided to Splink
as a list like [df_1, df_2]
. By adding additional elements to this link, you can use Splink
to link and deduplicate as many datasets as you like.
⚠️ Note that splink
makes the following assumptions about your data:
unique_id
column if the same id values occur in more than one dataset. By default, this column is called source_dataset
, but you can change this in the settings.splink
from pyspark.sql.functions import lit
from pyspark.sql import functions as f
df_1 = spark.read.parquet("data/fake_df_l_link_dedupe_test.parquet")
df_1 = df_1.withColumn("source_dataset", lit("df_1"))
df_2 = spark.read.parquet("data/fake_df_r_link_dedupe_test.parquet")
df_2 = df_2.withColumn("source_dataset", lit("df_2"))
df_1.show(5)
df_2.show(5)
+---------+----------+-------+----------+------------+--------------------+-----+--------------+ |unique_id|first_name|surname| dob| city| email|group|source_dataset| +---------+----------+-------+----------+------------+--------------------+-----+--------------+ | 0| Julia | null|2015-10-29| London| hannah88@powers.com| 0| df_1| | 1| Julia | Taylor|2015-07-31| London| hannah88@powers.com| 0| df_1| | 2| oNah| Watson|2008-03-23| Bolton|matthew78@ballard...| 1| df_1| | 3| Noah | Watson|2008-03-23| Bolton|matthew78@ballard...| 1| df_1| | 4| Molly | Bell|2002-01-05|Peterborough| null| 2| df_1| +---------+----------+-------+----------+------------+--------------------+-----+--------------+ only showing top 5 rows +---------+----------+-------+----------+------+--------------------+-----+--------------+ |unique_id|first_name|surname| dob| city| email|group|source_dataset| +---------+----------+-------+----------+------+--------------------+-----+--------------+ | 0| Julia | Taylor|2016-01-27|London| hannah88@powers.com| 0| df_2| | 1| Julia | Taylor|2015-10-29| null| hannah88opowersc@m| 0| df_2| | 2| Watson| Noah |2008-03-23| null|matthew78@ballard...| 1| df_2| | 3| Noah | Watson|2008-02-05| tolon|matthew78@ballard...| 1| df_2| | 4| Watson| Noah |2008-06-15|Bolton|matthew78@ballard...| 1| df_2| +---------+----------+-------+----------+------+--------------------+-----+--------------+ only showing top 5 rows
settings
object¶Most of splink
configuration options are stored in a settings dictionary. This dictionary allows significant customisation, and can therefore get quite complex.
💥 We provide an tool for helping to author valid settings dictionaries, which includes tooltips and autocomplete, which you can find here.
Customisation overrides default values built into splink. For the purposes of this demo, we will specify a simple settings dictionary, which means we will be relying on these sensible defaults.
To help with authoring and validation of the settings dictionary, we have written a json schema, which can be found here.
settings = {
"link_type": "link_and_dedupe",
"blocking_rules": [
"l.city = r.city"
],
"comparison_columns": [
{
"col_name": "first_name",
"num_levels": 3,
"term_frequency_adjustments": True
},
{
"col_name": "surname",
"num_levels": 3,
"term_frequency_adjustments": True
},
{
"col_name": "dob"
},
{
"col_name": "email"
}
],
"additional_columns_to_retain": ["group"]
}
In words, this setting dictionary says:
link_only
, or dedupe_only
)l.city = r.city
i.e. all pairwise record comparisons (both within and between datasets) where the city field matchesfirst_name
, surname
, dob
and email
columns to compute a match score.first_name
and surname
, string comparisons will have three levels:first_name
and surname
columnsgroup
column in the results even though this is not used as part of comparisons. This is a labelled dataset and group
contains the true match - i.e. where group matches, the records pertain to the same personColumns are assumed to be strings by default. See the 'comparison vector settings' notebook for details of configuration options.
from splink import Splink
linker = Splink(settings, [df_1, df_2], spark)
df_e = linker.get_scored_comparisons()
INFO:splink.iterate:Iteration 0 complete INFO:splink.model:The maximum change in parameters was 0.28655259609222417 for key first_name, level 0 INFO:splink.iterate:Iteration 1 complete INFO:splink.model:The maximum change in parameters was 0.07504022121429443 for key surname, level 2 INFO:splink.iterate:Iteration 2 complete INFO:splink.model:The maximum change in parameters was 0.032610535621643066 for key surname, level 0 INFO:splink.iterate:Iteration 3 complete INFO:splink.model:The maximum change in parameters was 0.018751978874206543 for key surname, level 0 INFO:splink.iterate:Iteration 4 complete INFO:splink.model:The maximum change in parameters was 0.01141306757926941 for key surname, level 0 INFO:splink.iterate:Iteration 5 complete INFO:splink.model:The maximum change in parameters was 0.007352173328399658 for key surname, level 0 INFO:splink.iterate:Iteration 6 complete INFO:splink.model:The maximum change in parameters was 0.004936426877975464 for key surname, level 0 INFO:splink.iterate:Iteration 7 complete INFO:splink.model:The maximum change in parameters was 0.003407001495361328 for key surname, level 0 INFO:splink.iterate:Iteration 8 complete INFO:splink.model:The maximum change in parameters was 0.0023947954177856445 for key surname, level 0 INFO:splink.iterate:Iteration 9 complete INFO:splink.model:The maximum change in parameters was 0.0017041563987731934 for key surname, level 0 INFO:splink.iterate:Iteration 10 complete INFO:splink.model:The maximum change in parameters was 0.0012229979038238525 for key surname, level 0 INFO:splink.iterate:Iteration 11 complete INFO:splink.model:The maximum change in parameters was 0.0008828639984130859 for key surname, level 0 INFO:splink.iterate:Iteration 12 complete INFO:splink.model:The maximum change in parameters was 0.0006399154663085938 for key surname, level 0 INFO:splink.iterate:Iteration 13 complete INFO:splink.model:The maximum change in parameters was 0.00046521425247192383 for key surname, level 0 INFO:splink.iterate:Iteration 14 complete INFO:splink.model:The maximum change in parameters was 0.0003389120101928711 for key surname, level 0 INFO:splink.iterate:Iteration 15 complete INFO:splink.model:The maximum change in parameters was 0.000247269868850708 for key surname, level 0 INFO:splink.iterate:Iteration 16 complete INFO:splink.model:The maximum change in parameters was 0.0001805722713470459 for key surname, level 0 INFO:splink.iterate:Iteration 17 complete INFO:splink.model:The maximum change in parameters was 0.00013199448585510254 for key surname, level 0 INFO:splink.iterate:Iteration 18 complete INFO:splink.model:The maximum change in parameters was 9.652972221374512e-05 for key surname, level 0 INFO:splink.iterate:EM algorithm has converged
Each row in the results table represents a comparison of two input records.
Note that since this is a link and dedupe task, there will be record comparisons of three types:
df_l
.df_r
.df_l
and one record from df_r
Information about the source table is included in the _source_table_l
and _source_table_r
fields.
# Inspect main dataframe that contains the match scores
cols_to_inspect = ["match_probability","unique_id_l", "source_dataset_l","unique_id_r","source_dataset_r","group_l", "group_r", "first_name_l","first_name_r","surname_l","surname_r","dob_l","dob_r","email_l","email_r"]
df_e.toPandas()[cols_to_inspect].sort_values(["unique_id_l", "unique_id_r"]).head(10)
match_probability | unique_id_l | source_dataset_l | unique_id_r | source_dataset_r | group_l | group_r | first_name_l | first_name_r | surname_l | surname_r | dob_l | dob_r | email_l | email_r | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
163 | 0.996844 | 0 | df_1 | 0 | df_2 | 0 | 0 | Julia | Julia | None | Taylor | 2015-10-29 | 2016-01-27 | hannah88@powers.com | hannah88@powers.com |
255 | 0.996844 | 0 | df_1 | 1 | df_1 | 0 | 0 | Julia | Julia | None | Taylor | 2015-10-29 | 2015-07-31 | hannah88@powers.com | hannah88@powers.com |
254 | 0.001500 | 0 | df_1 | 12 | df_1 | 0 | 6 | Julia | Matilda | None | Hsrir | 2015-10-29 | 1983-04-30 | hannah88@powers.com | patrcio47@davis.cam |
253 | 0.001500 | 0 | df_1 | 13 | df_1 | 0 | 6 | Julia | Matilda | None | None | 2015-10-29 | 1983-04-30 | hannah88@powers.com | patricia47@davis.com |
252 | 0.001500 | 0 | df_1 | 14 | df_1 | 0 | 7 | Julia | Baxter | None | Aria | 2015-10-29 | 1992-09-07 | hannah88@powers.com | christineshepherd@allen.com |
162 | 0.001500 | 0 | df_1 | 15 | df_2 | 0 | 6 | Julia | Harris | None | Matilda | 2015-10-29 | 1983-04-30 | hannah88@powers.com | patricia47@davis.com |
251 | 0.001500 | 0 | df_1 | 15 | df_1 | 0 | 7 | Julia | Aria | None | Baxter | 2015-10-29 | 1992-09-07 | hannah88@powers.com | christineshepherd@allen.com |
21847 | 0.000525 | 0 | df_2 | 15 | df_2 | 0 | 6 | Julia | Harris | Taylor | Matilda | 2016-01-27 | 1983-04-30 | hannah88@powers.com | patricia47@davis.com |
161 | 0.001500 | 0 | df_1 | 16 | df_2 | 0 | 6 | Julia | Matilda | None | Harris | 2015-10-29 | 1983-03-03 | hannah88@powers.com | patricia47@davis.com |
250 | 0.001500 | 0 | df_1 | 16 | df_1 | 0 | 8 | Julia | Wilson | None | Charlie | 2015-10-29 | 1998-09-15 | hannah88@powers.com | samantha81@henry.com |
The params
property of the linker
is an object that contains a lot of diagnostic information about how the match probability was computed. The following cells demonstrate some of its functionality
model = linker.model
model.probability_distribution_chart()
An alternative representation of the parameters displays them in terms of the effect different values in the comparison vectors have on the match probability:
model.bayes_factor_chart()
# If charts aren't displaying correctly in your notebook, you can write them to a file (by default splink_charts.html)
model.all_charts_write_html_file("splink_charts.html", overwrite=True)
You can also generate a report which explains how the match probability was computed for an individual comparison row.
Note that you need to convert the row to a dictionary for this to work
from splink.intuition import intuition_report
row_dict = df_e.toPandas().sample(1).to_dict(orient="records")[0]
print(intuition_report(row_dict, model))
Initial probability of match (prior) = λ = 0.03728 ------ Comparison of first_name. Values are: first_name_l: Gibson first_name_r: yarH Comparison has: 3 levels Level for this comparison: gamma_first_name = 0 m probability = P(level|match): 0.3577 u probability = P(level|non-match): 0.9931 Bayes factor = m/u: 0.3602 New probability of match (updated belief): 0.01375 ------ Comparison of surname. Values are: surname_l: Noah surname_r: None Comparison has: 3 levels Level for this comparison: gamma_surname = -1 m probability = P(level|match): 1 u probability = P(level|non-match): 1 Bayes factor = m/u: 1 New probability of match (updated belief): 0.01375 ------ Comparison of dob. Values are: dob_l: 1987-08-24 dob_r: 2017-11-24 Comparison has: 2 levels Level for this comparison: gamma_dob = 0 m probability = P(level|match): 0.3705 u probability = P(level|non-match): 0.9969 Bayes factor = m/u: 0.3717 New probability of match (updated belief): 0.005157 ------ Comparison of email. Values are: email_l: avazquez@banks.com email_r: coltonray@lee.com Comparison has: 2 levels Level for this comparison: gamma_email = 0 m probability = P(level|match): 0.2887 u probability = P(level|non-match): 0.9964 Bayes factor = m/u: 0.2898 New probability of match (updated belief): 0.0015 Final probability of match = 0.0015 Reminder: The m probability for a given level is the proportion of matches which are in this level. We would generally expect the highest similarity level to have the largest proportion of matches. For example, we would expect first name field to match exactly amongst most matching records, except where nicknames, aliases or typos have occurred. For a comparison column that changes through time, like address, we may expect a lower proportion of comparisons to be in the highest similarity level. The u probability for a given level is the proportion of non-matches which are in this level. We would generally expect the lowest similarity level to have the highest proportion of non-matches, but the magnitude depends on the cardinality of the field. For example, we would expect that in the vast majority of non-matching records, the date of birth field would not match. However, we would expect it to be common for gender to match amongst non-matches.
from graphframes import GraphFrame
from splink.vertically_concat import vertically_concatenate_datasets
df_1.createOrReplaceTempView("df_1")
df_2.createOrReplaceTempView("df_2")
df_e.createOrReplaceTempView("df_e")
sql = """
select unique_id, source_dataset, concat(unique_id, '__-__', source_dataset) as id
from df_1
union
select unique_id, source_dataset, concat(unique_id, '__-__', source_dataset) as id
from df_2
"""
nodes = spark.sql(sql)
sql = """
select
concat(unique_id_l, '__-__', source_dataset_l) as src,
concat(unique_id_r, '__-__', source_dataset_r) as dst,
match_probability
from df_e
where match_probability > 0.99
"""
edges = spark.sql(sql)
g = GraphFrame(nodes, edges)
cc = g.connectedComponents()
cc.createOrReplaceTempView("cc")
df_nodes = vertically_concatenate_datasets([df_1, df_2])
df_nodes.createOrReplaceTempView("df_nodes")
sql = """
select cc.component, df_nodes.*
from cc
left join
df_nodes on
cc.unique_id = df_nodes.unique_id and cc.source_dataset = df_nodes.source_dataset
order by component
"""
results = spark.sql(sql)
results.toPandas().head(30)
results.show()
+---------+---------+----------+---------+----------+---------------+--------------------+-----+--------------+ |component|unique_id|first_name| surname| dob| city| email|group|source_dataset| +---------+---------+----------+---------+----------+---------------+--------------------+-----+--------------+ | 0| 107| Gabriel | null|1977-12-21| Leds|rachelschtitm@geo...| 57| df_1| | 1| 112| Nathan | lconaMd|1996-09-09| London|birara84@robbins.baz| 29| df_2| | 1| 111| Ntaan | Mcdonald|1996-09-09| London| null| 29| df_2| | 1| 56| Nathan | Mcdonald|1996-09-09| London|barbara84@robbins...| 29| df_1| | 1| 55| Nathan | Mcdonald|1996-09-09| London|barbara84@robbins...| 29| df_1| | 1| 109| null| Mcdonald|1996-09-09| London|barbara84@robbins...| 29| df_2| | 2| 247| Taylor| Nancy |1989-07-25| London|wagnershane@landr...| 62| df_2| | 2| 118| Nancy | Taylor|1989-07-25| London|wagnershane@landr...| 62| df_1| | 2| 245| Nancy | Taylor|1989-07-25| London|wagnershane@landr...| 62| df_2| | 2| 248| Nacy | Taylor|1989-07-25| London|wagnershane@landr...| 62| df_2| | 2| 117| null| Taylor|1989-07-25| London|wagnershane@landr...| 62| df_1| | 2| 246| Nancy | Taylor|1989-08-19| London|wagnershane@landr...| 62| df_2| | 3| 119| Lewis| Freddie |1971-12-19|Southend-on-Sea|tyler28@weaver-al...| 63| df_1| | 3| 249| Freddi | Lewis|1971-12-19|Southend-on-Sea|tyler28@wecveralv...| 63| df_2| | 3| 250| null| Lewis|1971-12-19|Southend-on-Sea|tyler28@weaver-al...| 63| df_2| | 3| 120| Lewis| Freddie |1971-09-14|Southend-on-Sea|tyler28@weaver-al...| 63| df_1| | 4| 10| Amelia |Alexander|1983-05-19| Glasgow|icampbell@allen-l...| 3| df_2| | 4| 11| lmeAi |Alexander|1983-05-19| Glasgow|icampbell@allen-l...| 3| df_2| | 6| 253| Aad | Berry|1986-02-08| London|walterdunn@pham-h...| 64| df_2| | 6| 121| Adam | Berry|1986-02-08| London|walterdunn@pham-h...| 64| df_1| +---------+---------+----------+---------+----------+---------------+--------------------+-----+--------------+ only showing top 20 rows