Introductory tutorial¶

This is the introduction to a five part tutorial which demonstrates how to de-duplicate a small dataset using simple settings.

The aim of the tutorial is to demonstarate core Splink functionality succinctly, rather that comprehensively document all configuration options.

The five parts are:

1. Exploratory analysis
2. Choosing blocking rules to optimise runtimes
3. Estimating model parameters
4. Predicting results
5. Visualising predictions
6. Quality assurance

Throughout the tutorial, we use the duckdb backend, which is the recommended option for smaller datasets of up to around 1 million records on a normal laptop.

You can find these tutorial notebooks in the splink_demos repo, and you can run them live in your web browser by clicking the following link: