In this demo we link two small datasets.
The larger table contains duplicates, but in this notebook we use the link_only
setting, so splink
makes no attempt to deduplicate these records.
Note it is possible to simultaneously link and dedupe using the link_and_dedupe
setting.
Important Where deduplication is not required, link_only
can provide an important performance boost by dramatically reducing the number of records which need to be compared.
For example, if you wanted to link 10 records to 1,000, then the maximum number of comparisons that need to be made (i.e. with no blocking rules) is 10,000. If you need to dedupe as well, that number would be n(n-1)/2 = 509,545.
The following is just boilerplate code that sets up the Spark session and sets some other non-essential configuration options
import pandas as pd
pd.options.display.max_columns = 500
pd.options.display.max_rows = 100
import altair as alt
alt.renderers.enable('mimetype')
import logging
logging.basicConfig() # Means logs will print in Jupyter Lab
# Set to DEBUG if you want splink to log the SQL statements it's executing under the hood
logging.getLogger("splink").setLevel(logging.INFO)
from utility_functions.demo_utils import get_spark
spark = get_spark()
In this example, we link two datasets, but you can link as many as you like.
⚠️ Note that splink
makes the following assumptions about your data:
unique_id
, but you can change this in the settingsunique_id
column if the same id values occur in more than one dataset. By default, this column is called source_dataset
, but you can change this in the settings.splink
from pyspark.sql.functions import lit
df_1 = spark.read.parquet("data/fake_df_l.parquet")
df_1 = df_1.withColumn("source_dataset", lit("df_1"))
df_2 = spark.read.parquet("data/fake_df_r.parquet")
df_2 = df_2.withColumn("source_dataset", lit("df_2"))
print(f"The count of rows in `df_1` is {df_1.count()}")
df_1.show(5)
print(f"The count of rows in `df_2` is {df_2.count()}")
df_2.show(5)
The count of rows in `df_1` is 181 +---------+----------+-------+----------+------------+--------------------+-----+--------------+ |unique_id|first_name|surname| dob| city| email|group|source_dataset| +---------+----------+-------+----------+------------+--------------------+-----+--------------+ | 0| Julia | null|2015-10-29| London| hannah88@powers.com| 0| df_1| | 4| oNah| Watson|2008-03-23| Bolton|matthew78@ballard...| 1| df_1| | 13| Molly | Bell|2002-01-05|Peterborough| null| 2| df_1| | 15| Alexander|Amelia |1983-05-19| Glasgow|ic-mpbell@alleale...| 3| df_1| | 20| Ol vri|ynnollC|1972-03-08| Plymouth|derekwilliams@nor...| 4| df_1| +---------+----------+-------+----------+------------+--------------------+-----+--------------+ only showing top 5 rows The count of rows in `df_2` is 819 +---------+----------+-------+----------+------+--------------------+-----+--------------+ |unique_id|first_name|surname| dob| city| email|group|source_dataset| +---------+----------+-------+----------+------+--------------------+-----+--------------+ | 1| Julia | Taylor|2015-07-31|London| hannah88@powers.com| 0| df_2| | 2| Julia | Taylor|2016-01-27|London| hannah88@powers.com| 0| df_2| | 3| Julia | Taylor|2015-10-29| null| hannah88opowersc@m| 0| df_2| | 5| Noah | Watson|2008-03-23|Bolton|matthew78@ballard...| 1| df_2| | 6| Watson| Noah |2008-03-23| null|matthew78@ballard...| 1| df_2| +---------+----------+-------+----------+------+--------------------+-----+--------------+ only showing top 5 rows
settings
object¶Most of splink
configuration options are stored in a settings dictionary. This dictionary allows significant customisation, and can therefore get quite complex.
💥 We provide an tool for helping to author valid settings dictionaries, which includes tooltips and autocomplete, which you can find here.
Customisation overrides default values built into splink. For the purposes of this demo, we will specify a simple settings dictionary, which means we will be relying on these sensible defaults.
To help with authoring and validation of the settings dictionary, we have written a json schema, which can be found here.
# The comparison expression allows for the case where a first name and surname have been inverted
sql_case_expression = """
CASE
WHEN first_name_l = first_name_r AND surname_l = surname_r THEN 4
WHEN first_name_l = surname_r AND surname_l = first_name_r THEN 3
WHEN first_name_l = first_name_r THEN 2
WHEN surname_l = surname_r THEN 1
ELSE 0
END
"""
settings = {
"link_type": "link_only",
"max_iterations": 20,
"blocking_rules": [
],
"comparison_columns": [
{
"custom_name": "name_inversion",
"custom_columns_used": ["first_name", "surname"],
"case_expression": sql_case_expression,
"num_levels": 5
},
{
"col_name": "city",
"num_levels": 3
},
{
"col_name": "email",
"num_levels": 3
},
{
"col_name": "dob"
}
],
"additional_columns_to_retain": ["group"]
}
In words, this setting dictionary says:
dedupe_only
, or link_and_dedupe
)first_name
, surname
, city
and email
columns to compute a match score.first_name
and surname
column we allow the possibility that the names have been inputted in the wrong order.first_name
and surname
both match.group
column in the results even though this is not used as part of comparisons. This is a labelled dataset and group
contains the true match - i.e. where group matches, the records pertain to the same personfrom splink import Splink
linker = Splink(settings, [df_1, df_2], spark)
df_e = linker.get_scored_comparisons()
# Later, we will make term frequency adjustments.
# Persist caches these results in memory, preventing them having to be recomputed when we make these adjustments.
df_e.persist()
/Users/robinlinacre/anaconda3/lib/python3.8/site-packages/splink/default_settings.py:199: UserWarning: You have not specified any blocking rules, meaning all comparisons between the input dataset(s) will be generated and blocking will not be used.For large input datasets, this will generally be computationally intractable because it will generate comparisons equal to the number of rows squared. warnings.warn( INFO:splink.iterate:Iteration 0 complete INFO:splink.model:The maximum change in parameters was 0.40568520724773405 for key name_inversion, level 4 INFO:splink.iterate:Iteration 1 complete INFO:splink.model:The maximum change in parameters was 0.06933289766311646 for key email, level 1 INFO:splink.iterate:Iteration 2 complete INFO:splink.model:The maximum change in parameters was 0.02503591775894165 for key dob, level 0 INFO:splink.iterate:Iteration 3 complete INFO:splink.model:The maximum change in parameters was 0.009511321783065796 for key dob, level 0 INFO:splink.iterate:Iteration 4 complete INFO:splink.model:The maximum change in parameters was 0.004227638244628906 for key dob, level 0 INFO:splink.iterate:Iteration 5 complete INFO:splink.model:The maximum change in parameters was 0.0022344589233398438 for key dob, level 0 INFO:splink.iterate:Iteration 6 complete INFO:splink.model:The maximum change in parameters was 0.001312553882598877 for key dob, level 1 INFO:splink.iterate:Iteration 7 complete INFO:splink.model:The maximum change in parameters was 0.0008212625980377197 for key dob, level 0 INFO:splink.iterate:Iteration 8 complete INFO:splink.model:The maximum change in parameters was 0.0005371570587158203 for key dob, level 0 INFO:splink.iterate:Iteration 9 complete INFO:splink.model:The maximum change in parameters was 0.0003641173243522644 for key city, level 0 INFO:splink.iterate:Iteration 10 complete INFO:splink.model:The maximum change in parameters was 0.0002571418881416321 for key city, level 0 INFO:splink.iterate:Iteration 11 complete INFO:splink.model:The maximum change in parameters was 0.0001854151487350464 for key city, level 0 INFO:splink.iterate:Iteration 12 complete INFO:splink.model:The maximum change in parameters was 0.0001360774040222168 for key city, level 0 INFO:splink.iterate:Iteration 13 complete INFO:splink.model:The maximum change in parameters was 0.0001013725996017456 for key city, level 0 INFO:splink.iterate:Iteration 14 complete INFO:splink.model:The maximum change in parameters was 7.649511098861694e-05 for key city, level 0 INFO:splink.iterate:EM algorithm has converged
DataFrame[match_probability: double, source_dataset_l: string, unique_id_l: bigint, source_dataset_r: string, unique_id_r: bigint, first_name_l: string, first_name_r: string, surname_l: string, surname_r: string, gamma_name_inversion: int, city_l: string, city_r: string, gamma_city: int, email_l: string, email_r: string, gamma_email: int, dob_l: string, dob_r: string, gamma_dob: int, group_l: bigint, group_r: bigint]
# Inspect main dataframe that contains the match scores
df_e.toPandas().sort_values("match_probability", ascending=False).head(20)
match_probability | source_dataset_l | unique_id_l | source_dataset_r | unique_id_r | first_name_l | first_name_r | surname_l | surname_r | gamma_name_inversion | city_l | city_r | gamma_city | email_l | email_r | gamma_email | dob_l | dob_r | gamma_dob | group_l | group_r | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
58499 | 1.0 | df_1 | 419 | df_2 | 422 | Emily | Brown | Brown | Emily | 3 | Lndon | London | 1 | sarahbrown@mckinney.com | sarahnron@mckinbey.com | 1 | 2005-07-15 | 2005-07-15 | 1 | 71 | 71 |
79930 | 1.0 | df_1 | 581 | df_2 | 585 | Eleanor | Shaw | Shaw | Eleanor | 3 | Birmingham | Birmingha | 1 | stephaniewebbhart.net | stephaniewebb@hart.net | 1 | 1979-03-31 | 1979-03-31 | 1 | 97 | 97 |
93101 | 1.0 | df_1 | 664 | df_2 | 668 | Ivy | Taylor | Taylor | Ivy | 3 | Lonon | London | 1 | jonesjennmfer@pitt.coi | jonesjennifer@pitts.com | 1 | 1980-01-13 | 1980-01-13 | 1 | 113 | 113 |
93106 | 1.0 | df_1 | 664 | df_2 | 673 | Ivy | Taylor | Taylor | Ivy | 3 | Lonon | London | 1 | jonesjennmfer@pitt.coi | jonesjennifer@pitts.com | 1 | 1980-01-13 | 1980-01-13 | 1 | 113 | 113 |
2471 | 1.0 | df_1 | 15 | df_2 | 18 | Alexander | Amelia | Amelia | Alexander | 3 | Glasgow | Glasgow | 2 | ic-mpbell@allealewis.org | icampbell@allen-lewis.org | 1 | 1983-05-19 | 1983-05-19 | 1 | 3 | 3 |
137531 | 1.0 | df_1 | 924 | df_2 | 926 | Mills | Thomas | Thomas | Mills | 3 | London | London | 2 | hensondebbie@garcia.com | hensondrbbie@gaeia.com | 1 | 1970-03-09 | 1970-03-09 | 1 | 167 | 167 |
79105 | 1.0 | df_1 | 574 | df_2 | 578 | George | Williams | Williams | George | 3 | London | London | 2 | desek58gibbr.biz | derek58@gibbs.biz | 1 | 1981-08-06 | 1981-08-06 | 1 | 96 | 96 |
79104 | 1.0 | df_1 | 574 | df_2 | 577 | George | Williams | Williams | George | 3 | London | London | 2 | desek58gibbr.biz | derek58@gibbs.biz | 1 | 1981-08-06 | 1981-08-06 | 1 | 96 | 96 |
142479 | 1.0 | df_1 | 960 | df_2 | 966 | Gabriel | Bartlett | Bartlett | Gabriel | 3 | Wolverhampton | Wolverhampton | 2 | ogomez@robinson-mckinney.com | ogomez@rob-nsonimcknney.com | 1 | 1973-12-09 | 1973-12-09 | 1 | 173 | 173 |
29657 | 1.0 | df_1 | 209 | df_2 | 210 | Thompson | Freddie | Freddie | Thompson | 3 | Peterborough | Peterborough | 2 | scottsalinas@hughes-lopez.com | scottsalinah@ughes-lopez.com | 1 | 1999-07-23 | 1999-07-23 | 1 | 36 | 36 |
73322 | 1.0 | df_1 | 517 | df_2 | 521 | Brown | Martha | Martha | Brown | 3 | Southend-on-Sea | Southend-on-Sea | 2 | watsonthomas@jones-stuart.biz | watsonthomas@onesistuart.b-z | 1 | 2002-09-01 | 2002-09-01 | 1 | 89 | 89 |
73327 | 1.0 | df_1 | 517 | df_2 | 526 | Brown | Martha | Martha | Brown | 3 | Southend-on-Sea | Southend-on-Sea | 2 | watsonthomas@jones-stuart.biz | watsonthomas@jones-s.urttbiz | 1 | 2002-09-01 | 2002-09-01 | 1 | 89 | 89 |
102976 | 1.0 | df_1 | 726 | df_2 | 727 | Harry | Lawrence | Lawrence | Harry | 3 | Stoke-on-Trent | Stoke-on-Trent | 2 | aarbarpace@mbnning.org | barbarapace@manning.org | 1 | 2016-12-25 | 2016-12-25 | 1 | 125 | 125 |
93102 | 1.0 | df_1 | 664 | df_2 | 669 | Ivy | Ivy | Taylor | Taylor | 4 | Lonon | Lodno | 1 | jonesjennmfer@pitt.coi | jonesjennifer@pitts.com | 1 | 1980-01-13 | 1980-01-13 | 1 | 113 | 113 |
79102 | 1.0 | df_1 | 574 | df_2 | 575 | George | George | Williams | Williams | 4 | London | Lndon | 1 | desek58gibbr.biz | derek58@gibbs.biz | 1 | 1981-08-06 | 1981-08-06 | 1 | 96 | 96 |
93100 | 1.0 | df_1 | 664 | df_2 | 667 | Ivy | Ivy | Taylor | Taylor | 4 | Lonon | London | 1 | jonesjennmfer@pitt.coi | jonesjennifer@pitts.com | 1 | 1980-01-13 | 1980-01-13 | 1 | 113 | 113 |
102979 | 1.0 | df_1 | 726 | df_2 | 730 | Harry | Harry | Lawrence | Lawrence | 4 | Stoke-on-Trent | Stoke-on-ernt | 1 | aarbarpace@mbnning.org | barbarapace@manning.org | 1 | 2016-12-25 | 2016-12-25 | 1 | 125 | 125 |
128490 | 1.0 | df_1 | 879 | df_2 | 883 | Leo | Leo | Jones | Jones | 4 | Ldnon | London | 1 | tcarr@lewis-kline.com | tcarr@lweis-kine.com | 1 | 2019-06-15 | 2019-06-15 | 1 | 156 | 156 |
79934 | 1.0 | df_1 | 581 | df_2 | 589 | Eleanor | Eleanor | Shaw | Shaw | 4 | Birmingham | Birmingham | 2 | stephaniewebbhart.net | stephaniewebb@hart.net | 1 | 1979-03-31 | 1979-03-31 | 1 | 97 | 97 |
79103 | 1.0 | df_1 | 574 | df_2 | 576 | George | George | Williams | Williams | 4 | London | London | 2 | desek58gibbr.biz | derek58@gibbs.biz | 1 | 1981-08-06 | 1981-08-06 | 1 | 96 | 96 |
The params
property of the linker
is an object that contains a lot of diagnostic information about how the match probability was computed. The following cells demonstrate some of its functionality
model = linker.model
model.probability_distribution_chart()
An alternative representation of the parameters displays them in terms of the effect different values in the comparison vectors have on the match probability:
model.bayes_factor_chart()
# If charts aren't displaying correctly in your notebook, you can write them to a file (by default splink_charts.html)
model.all_charts_write_html_file("splink_charts.html", overwrite=True)
You can also generate a report which explains how the match probability was computed for an individual comparison row.
Note that you need to convert the row to a dictionary for this to work
from splink.intuition import intuition_report
row_dict = df_e.toPandas().sample(1).to_dict(orient="records")[0]
print(intuition_report(row_dict, model))
Initial probability of match (prior) = λ = 0.00563 ------ Comparison of name_inversion. Values are: name_inversion_l: Ja k, Kirk name_inversion_r: Leo , Jones Comparison has: 5 levels Level for this comparison: gamma_name_inversion = 0 m probability = P(level|match): 0.3353 u probability = P(level|non-match): 0.9937 Bayes factor = m/u: 0.3375 New probability of match (updated belief): 0.001907 ------ Comparison of city. Values are: city_l: London city_r: Manchester Comparison has: 3 levels Level for this comparison: gamma_city = 0 m probability = P(level|match): 0.08482 u probability = P(level|non-match): 0.8792 Bayes factor = m/u: 0.09647 New probability of match (updated belief): 0.0001843 ------ Comparison of email. Values are: email_l: fphillips@young-trner.info email_r: None Comparison has: 3 levels Level for this comparison: gamma_email = -1 m probability = P(level|match): 1 u probability = P(level|non-match): 1 Bayes factor = m/u: 1 New probability of match (updated belief): 0.0001843 ------ Comparison of dob. Values are: dob_l: 2008-02-17 dob_r: 1983-07-01 Comparison has: 2 levels Level for this comparison: gamma_dob = 0 m probability = P(level|match): 0.4109 u probability = P(level|non-match): 1 Bayes factor = m/u: 0.4109 New probability of match (updated belief): 7.573e-05 Final probability of match = 7.573e-05 Reminder: The m probability for a given level is the proportion of matches which are in this level. We would generally expect the highest similarity level to have the largest proportion of matches. For example, we would expect first name field to match exactly amongst most matching records, except where nicknames, aliases or typos have occurred. For a comparison column that changes through time, like address, we may expect a lower proportion of comparisons to be in the highest similarity level. The u probability for a given level is the proportion of non-matches which are in this level. We would generally expect the lowest similarity level to have the highest proportion of non-matches, but the magnitude depends on the cardinality of the field. For example, we would expect that in the vast majority of non-matching records, the date of birth field would not match. However, we would expect it to be common for gender to match amongst non-matches.
from splink.diagnostics import splink_score_histogram
from pyspark.sql.functions import expr
splink_score_histogram(df_e.filter(expr('match_probability > 0.001')), spark)