# Integration with splink package In this example, we demonstrate how to integrate `BlockingPy` with the `Splink` package for probabilistic record linkage. `Splink` provides a powerful framework for entity resolution, and `BlockingPy` can enhance its performance by providing another blocking approach. This example will show how to deduplicate the `fake_1000` dataset included with `Splink` using `BlockingPy` for the blocking phase improvement and `Splink` for the matching phase. We aim to follow the example available in the `Splink` documentation and modify the blocking procedure. The original can be foud [here](https://moj-analytical-services.github.io/splink/demos/examples/duckdb/accuracy_analysis_from_labels_column.html). ## Setup First, we need to install `BlockingPy` and `Splink`: ```bash pip install blockingpy splink ``` Import necessary components: ```python from splink import splink_datasets, SettingsCreator, Linker, block_on, DuckDBAPI import splink.comparison_library as cl from blockingpy import Blocker import pandas as pd import numpy as np np.random.seed(42) ``` ## Data preparation The `fake_1000` dataset contains 1000 records with personal information like names, dates of birth, and email addresses. The dataset consists of 251 unique entities (clusters), with each entity having one original record and various duplicates. ```python df = splink_datasets.fake_1000 print(df.head(5)) # unique_id first_name surname dob city email cluster # 0 0 Robert Alan 1971-06-24 NaN robert255@smith.net 0 # 1 1 Robert Allen 1971-05-24 NaN roberta25@smith.net 0 # 2 2 Rob Allen 1971-06-24 London roberta25@smith.net 0 # 3 3 Robert Alen 1971-06-24 Lonon NaN 0 # 4 4 Grace NaN 1997-04-26 Hull grace.kelly52@jones.com 1 ``` For BlockingPy, we'll create a text field combining multiple columns to allow blocking on overall record similarity: ```python df['txt'] = df['first_name'].fillna('') + ' ' + \ df['surname'].fillna('') + \ df['dob'].fillna('') + ' ' + \ df['city'].fillna('') + ' ' + \ df['email'].fillna('') ``` ## Blocking Now we can obtain blocks from `BlockingPy`: ```python blocker = Blocker() res = blocker.block( x = df['txt'], ann='hnsw', random_seed=42, ) print(res) # ======================================================== # Blocking based on the hnsw method. # Number of blocks: 252 # Number of columns created for blocking: 906 # Reduction ratio: 0.996306 # ======================================================== # Distribution of the size of the blocks: # Block Size | Number of Blocks # 2 | 62 # 3 | 53 # 4 | 51 # 5 | 36 # 6 | 26 # 7 | 16 # 8 | 7 # 9 | 1 print(res.result.head()) # x y block dist # 0 1 0 0 0.142391 # 1 1 2 0 0.208361 # 2 2 3 0 0.230678 # 3 5 4 1 0.145114 # 4 814 6 2 0.584251 ``` ## Results integration To integrate our results, we can add a `block` column to the original dataframe, which we can with the help of `add_block_column` method. ```python df = res.add_block_column(df) ``` ## Splink settings Now we can configure and run `Splink` using our `BlockingPy` results. The following steps are adapted from the `Splink` documentation example: ```python settings = SettingsCreator( link_type="dedupe_only", blocking_rules_to_generate_predictions=[ block_on("block"), # BlockingPy integration # block_on("first_name"), # block_on("surname"), # block_on("dob"), # block_on("email"), ], comparisons=[ cl.ForenameSurnameComparison("first_name", "surname"), cl.DateOfBirthComparison( "dob", input_is_string=True, ), cl.ExactMatch("city").configure(term_frequency_adjustments=True), cl.EmailComparison("email"), ], retain_intermediate_calculation_columns=True, ) db_api = DuckDBAPI() linker = Linker(df, settings, db_api=db_api) ``` ## Training the Splink model Let's train the `Splink` model to learn the parameters for record comparison: ```python deterministic_rules = [ "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1", "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1", "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2", "l.email = r.email", ] linker.training.estimate_probability_two_random_records_match( deterministic_rules, recall=0.7 ) linker.training.estimate_u_using_random_sampling(max_pairs=1e6, seed=5) session_dob = linker.training.estimate_parameters_using_expectation_maximisation( block_on("dob"), estimate_without_term_frequencies=True ) linker.training.estimate_parameters_using_expectation_maximisation( block_on("email"), estimate_without_term_frequencies=True ) linker.training.estimate_parameters_using_expectation_maximisation( block_on("first_name", "surname"), estimate_without_term_frequencies=True ) ``` The above example shows how to inegrate `BlockingPy` with `Splink`. In the following section, we will compare several blocking approaches using this dataset. ## Comparing Different Blocking Strategies We can compare three ways to handle blocking: 1. **Using only Splink** (from the original example) 2. **Using only BlockingPy** 3. **Combining both approaches** To test these approaches, we simply modify the `block_on` parameters in `SettingsCreator` while keeping everything else the same. This lets us see how each blocking strategy affects match quality. ```python # 1. BlockingPy only blocking_rules_to_generate_predictions=[ block_on("block"), ], # 2. Splink only blocking_rules_to_generate_predictions=[ block_on("first_name"), block_on("surname"), block_on("dob"), block_on("email"), ], # 3. Splink + BlockingPy blocking_rules_to_generate_predictions=[ block_on("block"), block_on("first_name"), block_on("surname"), block_on("dob"), block_on("email"), ], ``` Ater training each model, we can evaluate the results using the `accuracy_analysis_from_labels_column` method from `Splink`, which will visialize the results. Below we present the results of the three models: ### BlockingPy only ![BlockingPy only](./voyager.svg "BlockingPy only") ### Splink only ![Splink only](./splink_only_2.svg "Splink only") ### Splink + BlockingPy ![Splink + BlockingPy](./combined.svg "Splink + BlockingPy") ## Conclusion In this example, we demonstrated how to integrate `BlockingPy` with `Splink` for probabilistic record linkage. The comparsion between traditional methods, `BlockingPy` and the combination of both shows that when using both approaches we were able to significantly improve the performance metrics by capturing comparison pairs that would otherwise be missed. The integration allows for efficient blocking and accurate matching, making it a powerful combination for entity resolution tasks.