Splink assumes that you have cleaned up your data and assigned unique ids to rows prior to linking. The following describes the data cleaning that should be performed prior to loading data into Splink in more detail.

Unique IDs¶

Each input dataset must have a unique id column that's unique within the dataset. By default, Splink assumes this column will be called unique_id, but this can be changed with the unique_id_column_name key in your SPlink settings. This unique id column is used to match entries across datasets, so it is essential that each entry in this column is unique within its respective dataset.

Conformant input datasets¶

Input datasets must be conformant. This means that they should have the same column names, which correspond to data in the same format. For example, if one dataset has a column called "date of birth" and another has a column called "dob", these columns should be renamed to match in order to ensure that Splink can match entries across the datasets and the data type/number formatting should be the same in both columns. It is not necessary for the columns in the input dataframes to be in the same order.

Cleaning¶

Data should be cleaned to ensure consistency. This includes things like ensuring that dates are in the same format, that text is in the same case, and that any invalid data has been dealt with. For example, if one dataset has dates in the format "yyyy-mm-dd" and another has dates in the format "mm/dd/yyyy", these should be converted to the same format before using Splink. If a name column has been incorrected parsed and sometimes contains titles (e.g. Lord, Mrs, etc), this should be cleaned up prior to linkage.

Ensure nulls are consistently and correctly represented¶

Any null values (or other values representing 'not known') in the datasets must be true nulls, not zero length strings. This is because Splink treats null values differently than zero length strings, and using true nulls will ensure that Splink functions properly when matching entries across datasets. For example, if a cell in a dataset contains an empty string, it should be replaced with a true null value in order for Splink to handle it correctly.

Further details on data cleaning and standardisation¶

Splink works best when data has been cleaned and standardised prior to linking. Here are some examples of data cleaning rules that can improve the accuracy of data matching:

Trimming leading and trailing whitespace from string values. For example, if a dataset contains the value " john smith " in a name column, it should be trimmed to "john smith" to avoid mismatches with the value "john smith" in another dataset.
Removing special characters from string values. For example, if a dataset contains the value "O'Hara" in a name column, it could be cleaned to "Ohara" to ensure consistency.
Standardizing date formats. Generally we recommend formatting all dates as strings in the format "yyyy-mm-dd".
Replacing abbreviations with full words. For example, if a dataset contains the values "St." and "Street" in an address column, they should be standardized to the full word (e.g. "Street") to avoid mismatches.

Data Prerequisites¶

Unique IDs¶

Conformant input datasets¶

Cleaning¶

Ensure nulls are consistently and correctly represented¶

Further details on data cleaning and standardisation¶