Splink assumes that you have cleaned up your data and assigned unique ids to rows prior to linking. The following describes the data cleaning that should be performed prior to loading data into Splink in more detail.
unique_id
, but this can be changed with the unique_id_column_name
key in your SPlink settings. This unique id column is used to match entries across datasets, so it is essential that each entry in this column is unique within its respective dataset.Splink works best when data has been cleaned and standardised prior to linking. Here are some examples of data cleaning rules that can improve the accuracy of data matching:
Trimming leading and trailing whitespace from string values. For example, if a dataset contains the value " john smith " in a name column, it should be trimmed to "john smith" to avoid mismatches with the value "john smith" in another dataset.
Removing special characters from string values. For example, if a dataset contains the value "O'Hara" in a name column, it could be cleaned to "Ohara" to ensure consistency.
Standardizing date formats. Generally we recommend formatting all dates as strings in the format "yyyy-mm-dd".
Replacing abbreviations with full words. For example, if a dataset contains the values "St." and "Street" in an address column, they should be standardized to the full word (e.g. "Street") to avoid mismatches.