Integration with recordlinkage package

In this example we aim to show how users can utilize blocking results achieved with BlockingPy and use them with the recordlinkage package. The recordlinkage allows for both blocking and one-to-one record linkage and deduplication. However, it is possible to transfer blocking results from BlockingPy and incorporate them in the full entity resolution pipeline.

This example will show deduplication of febrl1 dataset which comes buillt-in with recordlinkage.

We aim to follow the Data deduplication example available on the recordlinkage documentation website and substitute the blocking procedure with our own.

Setup

Firstly, we need to install BlockingPy and recordlinkage:

pip install blockingpy recordlinkage

Import necessary components:

import recordlinkage
from recordlinkage.datasets import load_febrl1
from blockingpy import Blocker
import pandas as pd
import numpy as np
np.random.seed(42)

Data preparation

febrl1 dataset contains 1000 records of which 500 are original and 500 are duplicates. It containts fictitious personal information e.g. name, surname, adress.

df = load_febrl1()
print(df.head(2))

#               given_name	 surnam     street_number   address_1         address_2	suburb	    postcode	state	date_of_birth	soc_sec_id
# rec_id										
# rec-223-org	NaN	         waller	    6	            tullaroop street  willaroo	st james    4011        wa	    19081209	    6988048
# rec-122-org	lachlan	         berry	    69	            giblin street     killarney	bittern	    4814        qld	    19990219	    7364009

Prepare data in a suitable format for blockingpy. For this we need to fill missing values and concat fields to the txt column:

df = df.fillna('')
df['txt'] = df['given_name'] + df['surname'] + \
            df['street_number'] + df['address_1'] + \
            df['address_2'] + df['suburb'] + \
            df['postcode'] + df['state'] + \
            df['date_of_birth'] + df['soc_sec_id']

Blocking

Now we can obtain blocks from BlockingPy:

blocker = Blocker()
blocking_result = blocker.block(
    x=df['txt'],
    ann='hnsw',
    random_seed=42
)

print(blocking_result)
# ========================================================
# Blocking based on the hnsw method.
# Number of blocks: 500
# Number of columns created for blocking: 1023
# Reduction ratio: 0.998999
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          2 | 500  
print(blocking_result.result.head())
#      x  y  block      dist
# 0  474  0      0  0.048375
# 1  330  1      1  0.038961
# 2  351  2      2  0.086690
# 3  290  3      3  0.024617
# 4  333  4      4  0.105662

Integration

To integrate our results, we can add a block column to the original dataframe. Blockingpy provides a add_block_column method for this step. Since the index of the original dataframe is not the same as the positional index in the blocking result, we need to add an id column to the original dataframe.

df['id'] = range(len(df))
df_final = blocking_result.add_block_column(df, id_col_left='id')

print(df_final['block'].head(5))
# 	         block
# rec_id		
# rec-223-org	0
# rec-122-org	1
# rec-373-org	2
# rec-10-dup-0	3
# rec-227-org	4

Now we can use the Index object from recordlinkage with the block column to integrate BlockingPy results with recordlinkage:

indexer = recordlinkage.Index()
indexer.block('block')
pairs = indexer.index(df_final)
print(pairs)
# MultiIndex([('rec-344-dup-0',   'rec-344-org'),
#             (  'rec-251-org', 'rec-251-dup-0'),
#             ('rec-335-dup-0',   'rec-335-org'),
#             ( 'rec-23-dup-0',    'rec-23-org'),
#             (  'rec-382-org', 'rec-382-dup-0'),
#               ....

NOTE : This is the example for deduplication. Keep in mind that for record linkage this step needs to be modified.

Finally, we can use the execute one-to-one record linkage with the recordlinkage package. We will use the same comparison rules as in the original example:

dfA = load_febrl1() # load original dataset once again for clean data
compare_cl = recordlinkage.Compare()

compare_cl.exact("given_name", "given_name", label="given_name")
compare_cl.string(
    "surname", "surname", method="jarowinkler", threshold=0.85, label="surname"
)
compare_cl.exact("date_of_birth", "date_of_birth", label="date_of_birth")
compare_cl.exact("suburb", "suburb", label="suburb")
compare_cl.exact("state", "state", label="state")
compare_cl.string("address_1", "address_1", threshold=0.85, label="address_1")

features = compare_cl.compute(pairs, dfA)

matches = features[features.sum(axis=1) > 3]
print(len(matches))
# 458 
# vs. 317 when blocking traditionally on 'given_name'

Comparison rules were adopted from the orignal example.