Abt–Buy record linkage

This example shows how to use BlockingPy for record linkage on the Abt–Buy product datasets. We:

load Abt, Buy and ground truth files,
build a simple text field,
run embedding-based blocking with HNSW,

The datasets can be found in the PyJedAI repository.

Setup

pip install blockingpy

Load data

from blockingpy import Blocker
import pandas as pd

abt = pd.read_csv("abt.csv", sep="|")
buy = pd.read_csv("buy.csv", sep='|')
gt = pd.read_csv("gt.csv", sep="|")

print(abt.head())

print(abt.shape)
print(buy.shape)
print(gt.shape)

#    id                                               name  \
# 0   0                          Sony Turntable - PSLX350H   
# 1   1  Bose Acoustimass 5 Series III Speaker System -...   
# 2   2                             Sony Switcher - SBV40S   
# 3   3                   Sony 5 Disc CD Player - CDPCE375   
# 4   4  Bose 27028 161 Bookshelf Pair Speakers In Whit...   

#                                          description  price  
# 0  Sony Turntable - PSLX350H/ Belt Drive System/ ...    NaN  
# 1  Bose Acoustimass 5 Series III Speaker System -...  399.0  
# 2  Sony Switcher - SBV40S/ Eliminates Disconnecti...   49.0  
# 3  Sony 5 Disc CD Player- CDPCE375/ 5 Disc Change...    NaN  
# 4  Bose 161 Bookshelf Speakers In White - 161WH/ ...  158.0


# (1076, 4)
# (1076, 4)
# (1076, 2)

Creating “True Blocks”

We need to adjust the gt dataframe to match the expected format.

gt['block'] = range(len(gt))
gt = gt.rename(columns={"D1": 'x', "D2": 'y'})

Data preprocessing

We will convert all string columns to the string dtype and fill missing values. Then, we will create a new text field name_price by concatenating the name and price columns.

You can experiment with different combinations of fields to see how they affect blocking performance.

str_cols = [col for col in abt.columns if col != 'id']
abt = abt.astype({col: 'string' for col in str_cols})
buy = buy.astype({col: 'string' for col in str_cols})

abt = abt.fillna('')
buy = buy.fillna('')

abt['name_price'] = abt['name'] + abt['price']
buy['name_price'] = buy['name'] + buy['price']

Blocking with HNSW

We will use embedding-based blocking with HNSW.

blocker = Blocker()

control_txt = {
        "encoder": "embedding",
        "embedding": {
            "model": "minishlab/potion-base-32M",
            "normalize": True,
            "max_length": 512,
            "emb_batch_size": 1024,
            "show_progress_bar": True,
            "use_multiprocessing": True,
            "multiprocessing_threshold": 10000,
        },
}

res = blocker.block(
    x=abt['name_price'],
    y=buy['name_price'],
    true_blocks=gt,
    verbose=1,
    random_seed=42,
    ann='hnsw',
    control_txt=control_txt,

)

print(res)
# INFO - ===== creating tokens =====
# 100%|██████████| 2/2 [00:00<00:00, 69.95it/s]
# 100%|██████████| 2/2 [00:00<00:00, 64.36it/s]
# INFO - ===== starting search (hnsw, x, y: 1076,1076, t: 512) =====
# INFO - ===== creating graph =====
# INFO - ===== evaluating =====
# ========================================================
# Blocking based on the hnsw method.
# Number of blocks: 902
# Number of columns created for blocking: 512
# Reduction ratio: 0.999071
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          2 | 787            
#          3 | 85             
#          4 | 16             
#          5 | 8              
#          6 | 3              
#          8 | 1              
#          9 | 1              
#         10 | 1              
# ========================================================
# Evaluation metrics (standard):
# recall : 82.342
# precision : 82.342
# fpr : 0.0164
# fnr : 17.658
# accuracy : 99.9672
# specificity : 99.9836
# f1_score : 82.342
print(res.confusion)
#                  Predicted Positive  Predicted Negative
# Actual Positive                 886                 190
# Actual Negative                 190             1156510