Integration with recordlinkage package
In this example we aim to show how users can utilize blocking results achieved with BlockingPy and use them with the recordlinkage package. The recordlinkage allows for both blocking and one-to-one record linkage and deduplication. However, it is possible to transfer blocking results from BlockingPy and incorporate them in the full entity resolution pipeline.
This example will show deduplication of febrl1 dataset which comes buillt-in with recordlinkage.
We aim to follow the Data deduplication example available on the recordlinkage documentation website and substitute the blocking procedure with our own.
Setup
Firstly, we need to install BlockingPy and recordlinkage:
pip install blockingpy recordlinkage
Import necessary components:
import recordlinkage
from recordlinkage.datasets import load_febrl1
from blockingpy import Blocker
import pandas as pd
import numpy as np
np.random.seed(42)
Data preparation
febrl1 dataset contains 1000 records of which 500 are original and 500 are duplicates. It containts fictitious personal information e.g. name, surname, adress.
df = load_febrl1()
print(df.head(2))
# given_name surnam street_number address_1 address_2 suburb postcode state date_of_birth soc_sec_id
# rec_id
# rec-223-org NaN waller 6 tullaroop street willaroo st james 4011 wa 19081209 6988048
# rec-122-org lachlan berry 69 giblin street killarney bittern 4814 qld 19990219 7364009
Prepare data in a suitable format for blockingpy. For this we need to fill missing values and concat fields to the txt column:
df = df.fillna('')
df['txt'] = df['given_name'] + df['surname'] + \
df['street_number'] + df['address_1'] + \
df['address_2'] + df['suburb'] + \
df['postcode'] + df['state'] + \
df['date_of_birth'] + df['soc_sec_id']
Blocking
Now we can obtain blocks from BlockingPy:
blocker = Blocker()
blocking_result = blocker.block(
x=df['txt'],
ann='hnsw',
random_seed=42
)
print(blocking_result)
# ========================================================
# Blocking based on the hnsw method.
# Number of blocks: 500
# Number of columns created for blocking: 1023
# Reduction ratio: 0.998999
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
# 2 | 500
print(blocking_result.result.head())
# x y block dist
# 0 474 0 0 0.048375
# 1 330 1 1 0.038961
# 2 351 2 2 0.086690
# 3 290 3 3 0.024617
# 4 333 4 4 0.105662
Integration
To integrate our results, we can add a block column to the original dataframe.
Blockingpy provides a add_block_column method for this step. Since the index of the original dataframe is not the same as the positional index in the blocking result, we need to add an id column to the original dataframe.
df['id'] = range(len(df))
df_final = blocking_result.add_block_column(df, id_col_left='id')
print(df_final['block'].head(5))
# block
# rec_id
# rec-223-org 0
# rec-122-org 1
# rec-373-org 2
# rec-10-dup-0 3
# rec-227-org 4
Now we can use the Index object from recordlinkage with the block column to integrate BlockingPy results with recordlinkage:
indexer = recordlinkage.Index()
indexer.block('block')
pairs = indexer.index(df_final)
print(pairs)
# MultiIndex([('rec-344-dup-0', 'rec-344-org'),
# ( 'rec-251-org', 'rec-251-dup-0'),
# ('rec-335-dup-0', 'rec-335-org'),
# ( 'rec-23-dup-0', 'rec-23-org'),
# ( 'rec-382-org', 'rec-382-dup-0'),
# ....
NOTE : This is the example for deduplication. Keep in mind that for record linkage this step needs to be modified.
Finally, we can use the execute one-to-one record linkage with the recordlinkage package. We will use the same comparison rules as in the original example:
dfA = load_febrl1() # load original dataset once again for clean data
compare_cl = recordlinkage.Compare()
compare_cl.exact("given_name", "given_name", label="given_name")
compare_cl.string(
"surname", "surname", method="jarowinkler", threshold=0.85, label="surname"
)
compare_cl.exact("date_of_birth", "date_of_birth", label="date_of_birth")
compare_cl.exact("suburb", "suburb", label="suburb")
compare_cl.exact("state", "state", label="state")
compare_cl.string("address_1", "address_1", threshold=0.85, label="address_1")
features = compare_cl.compute(pairs, dfA)
matches = features[features.sum(axis=1) > 3]
print(len(matches))
# 458
# vs. 317 when blocking traditionally on 'given_name'
Comparison rules were adopted from the orignal example.