Integration with splink package

In this example, we demonstrate how to integrate BlockingPy with the Splink package for probabilistic record linkage. Splink provides a powerful framework for entity resolution, and BlockingPy can enhance its performance by providing another blocking approach. This example will show how to deduplicate the fake_1000 dataset included with Splink using BlockingPy for the blocking phase improvement and Splink for the matching phase. We aim to follow the example available in the Splink documentation and modify the blocking procedure. The original can be foud here.

Setup

First, we need to install BlockingPy and Splink:

pip install blockingpy splink

Import necessary components:

from splink import splink_datasets, SettingsCreator, Linker, block_on, DuckDBAPI
import splink.comparison_library as cl
from blockingpy import Blocker
import pandas as pd
import numpy as np
np.random.seed(42)

Data preparation

The fake_1000 dataset contains 1000 records with personal information like names, dates of birth, and email addresses. The dataset consists of 251 unique entities (clusters), with each entity having one original record and various duplicates.

df = splink_datasets.fake_1000
print(df.head(5))
#    unique_id first_name surname         dob    city                    email    cluster  
# 0          0     Robert    Alan  1971-06-24     NaN      robert255@smith.net          0
# 1          1     Robert   Allen  1971-05-24     NaN      roberta25@smith.net          0
# 2          2        Rob   Allen  1971-06-24  London      roberta25@smith.net          0
# 3          3     Robert    Alen  1971-06-24   Lonon                      NaN          0
# 4          4      Grace     NaN  1997-04-26    Hull  grace.kelly52@jones.com          1

For BlockingPy, we’ll create a text field combining multiple columns to allow blocking on overall record similarity:

df['txt'] = df['first_name'].fillna('') + ' ' + \
            df['surname'].fillna('') + \
            df['dob'].fillna('') + ' ' + \
            df['city'].fillna('') + ' ' + \
            df['email'].fillna('')   

Blocking

Now we can obtain blocks from BlockingPy:

blocker = Blocker()

res = blocker.block(
        x = df['txt'],
        ann='hnsw',
        random_seed=42,
)

print(res)
# ========================================================
# Blocking based on the hnsw method.
# Number of blocks: 252
# Number of columns created for blocking: 906
# Reduction ratio: 0.996306
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          2 | 62             
#          3 | 53             
#          4 | 51             
#          5 | 36             
#          6 | 26             
#          7 | 16             
#          8 | 7              
#          9 | 1     
print(res.result.head())
#      x  y  block      dist
# 0    1  0      0  0.142391
# 1    1  2      0  0.208361
# 2    2  3      0  0.230678
# 3    5  4      1  0.145114
# 4  814  6      2  0.584251

Results integration

To integrate our results, we can add a block column to the original dataframe, which we can with the help of add_block_column method.

df = res.add_block_column(df)

Splink settings

Now we can configure and run Splink using our BlockingPy results. The following steps are adapted from the Splink documentation example:

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("block"), # BlockingPy integration
        # block_on("first_name"),
        # block_on("surname"),
        # block_on("dob"),
        # block_on("email"),
    ],
    comparisons=[
        cl.ForenameSurnameComparison("first_name", "surname"),
        cl.DateOfBirthComparison(
            "dob",
            input_is_string=True,
        ),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.EmailComparison("email"),
    ],
    retain_intermediate_calculation_columns=True,
)

db_api = DuckDBAPI()
linker = Linker(df, settings, db_api=db_api)

Training the Splink model

Let’s train the Splink model to learn the parameters for record comparison:

deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    "l.email = r.email",
]

linker.training.estimate_probability_two_random_records_match(
    deterministic_rules, recall=0.7
)

linker.training.estimate_u_using_random_sampling(max_pairs=1e6, seed=5)

session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("dob"), estimate_without_term_frequencies=True
)
linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("email"), estimate_without_term_frequencies=True
)
linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("first_name", "surname"), estimate_without_term_frequencies=True
)

The above example shows how to inegrate BlockingPy with Splink. In the following section, we will compare several blocking approaches using this dataset.

Comparing Different Blocking Strategies

We can compare three ways to handle blocking:

Using only Splink (from the original example)
Using only BlockingPy
Combining both approaches

To test these approaches, we simply modify the block_on parameters in SettingsCreator while keeping everything else the same. This lets us see how each blocking strategy affects match quality.

# 1. BlockingPy only
blocking_rules_to_generate_predictions=[
        block_on("block"),
],
# 2. Splink only
blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
        block_on("dob"),
        block_on("email"),
],
# 3. Splink + BlockingPy
blocking_rules_to_generate_predictions=[
        block_on("block"),
        block_on("first_name"),
        block_on("surname"),
        block_on("dob"),
        block_on("email"),
],    

Ater training each model, we can evaluate the results using the accuracy_analysis_from_labels_column method from Splink, which will visialize the results. Below we present the results of the three models:

BlockingPy only

BlockingPy only

Splink only

Splink only

Splink + BlockingPy

Splink + BlockingPy

Conclusion

In this example, we demonstrated how to integrate BlockingPy with Splink for probabilistic record linkage. The comparsion between traditional methods, BlockingPy and the combination of both shows that when using both approaches we were able to significantly improve the performance metrics by capturing comparison pairs that would otherwise be missed. The integration allows for efficient blocking and accurate matching, making it a powerful combination for entity resolution tasks.