Deduplication with Embeddings

This tutorial demonstrates how to use the BlockingPy library for deduplication using embeddings instead of n-gram shingles. It is based on the Deduplication No. 2 tutorial, but adapted to showcase the use of embeddings.

Once again, we will use the RLdata10000 dataset taken from RecordLinkage R package developed by Murat Sariyar and Andreas Borg. It contains 10 000 records in total where some have been duplicated with randomly generated errors. There are 9000 original records and 1000 duplicates.

Data Preparation

Let’s install blockingpy:

pip install blockingpy

Import necessary packages and functions:

import pandas as pd
from blockingpy import Blocker
from blockingpy.datasets import load_deduplication_data

Let’s load the data and take a look at first 5 rows:

data = load_deduplication_data()
data.head()

#   fname_c1 fname_c2    lname_c1 lname_c2    by  bm  bd  id  true_id
# 0    FRANK      NaN     MUELLER      NaN  1967   9  27   1     3606
# 1   MARTIN      NaN     SCHWARZ      NaN  1967   2  17   2     2560
# 2  HERBERT      NaN  ZIMMERMANN      NaN  1961  11   6   3     3892
# 3     HANS      NaN     SCHMITT      NaN  1945   8  14   4      329
# 4      UWE      NaN      KELLER      NaN  2000   7   5   5     1994

Now we need to prepare the txt column:

data = data.fillna('')
data[['by', 'bm', 'bd']] = data[['by', 'bm', 'bd']].astype('str')
data['txt'] = (
    data["fname_c1"] +
    data["fname_c2"] +
    data['lname_c1'] +
    data['lname_c2'] +
    data['by'] +
    data['bm'] +
    data['bd']
    )   
data['txt'].head()

# 0         FRANK  MUELLER  1967 9 27
# 1        MARTIN  SCHWARZ  1967 2 17
# 2    HERBERT  ZIMMERMANN  1961 11 6
# 3          HANS  SCHMITT  1945 8 14
# 4             UWE  KELLER  2000 7 5
# Name: txt, dtype: object

Basic Deduplication

We’ll now perform basic deduplication with hnsw algorithm, but instead of character-level n-grams, the text will be encoded into dense embeddings before approximate nearest neighbor search.

blocker = Blocker()

control_txt = {
    "encoder": "embedding",
    "embedding": {
        "model": "minishlab/potion-base-32M",
        # for other customization options see 
        # configuration in User Guide
    }
}

dedup_result = blocker.block(
    x=data['txt'],
    ann='hnsw',
    verbose=1,
    random_seed=42,
    control_txt=control_txt,
)
# ===== creating tokens: embedding =====
# ===== starting search (hnsw, x, y: 10000,10000, t: 512) =====
# ===== creating graph =====

We can now take a look at the results:

print(dedup_result)

# ========================================================
# Blocking based on the hnsw method.
# Number of blocks: 2656
# Number of columns created for blocking: 512
# Reduction ratio: 0.999600
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          2 | 906            
#          3 | 631            
#          4 | 424            
#          5 | 273            
#          6 | 155            
#          7 | 121            
#          8 | 48             
#          9 | 34             
#         10 | 23             
#         11 | 14             
#         12 | 13             
#         13 | 4              
#         14 | 3              
#         15 | 2              
#         16 | 1              
#         18 | 1              
#         20 | 1              
#         23 | 1              
#         26 | 1     

and:

print(dedup_result.result)
#          x     y  block      dist
# 0     2337     0      0  0.227015
# 1     4504     1      1  0.373196
# 2      233     2      2  0.294851
# 3     1956     3      3  0.261316
# 4     4040     4      4  0.216883
# ...    ...   ...    ...       ...
# 7339  6692  9984   2328  0.338963
# 7340  5725  9986   1532  0.243514
# 7341  8521  9993   1915  0.324314
# 7342  7312  9997    774  0.235769
# 7343  5897  9999   1558  0.217153

Let’s see the pair in the block no. 3

print(data.iloc[[1956, 3], : ])
#      fname_c1 fname_c2 lname_c1  ...    id true_id                      txt
# 1956    HRANS           SCHMITT  ...  1957     329  HRANS SCHMITT 1945 8 14
# 3        HANS           SCHMITT  ...     4     329   HANS SCHMITT 1945 8 14

True Blocks Preparation

df_eval = data.copy()
df_eval['block'] = df_eval['true_id']
df_eval['x'] = range(len(df_eval))
print(df_eval.head())
#   fname_c1 fname_c2    lname_c1  ...                       txt block  x
# 0    FRANK              MUELLER  ...       FRANK MUELLER 1967 9  27  3606  0
# 1   MARTIN              SCHWARZ  ...      MARTIN SCHWARZ 1967 2 17  2560  1
# 2  HERBERT           ZIMMERMANN  ...  HERBERT ZIMMERMANN 1961 1 16  3892  2
# 3     HANS              SCHMITT  ...        HANS SCHMITT 1945 8 14   329  3
# 4      UWE               KELLER  ...           UWE KELLER 2000 7 5  1994  4

Let’s create the final true_blocks_dedup:

true_blocks_dedup = df_eval[['x', 'block']]

Evaluation

Finally, we can evaluate the blocking performance when using embeddings:

blocker = Blocker()
eval_result = blocker.block(
    x=df_eval['txt'], 
    ann='voyager',
    true_blocks=true_blocks_dedup, 
    verbose=1, 
    random_seed=42,
    control_txt=control_txt, # Using the same config
)
# ===== creating tokens: embedding =====
# ===== starting search (voyager, x, y: 10000,10000, t: 512) =====
# ===== creating graph =====
# ===== evaluating =====

You can also inspect:

print(eval_result.metrics)
# recall         0.957000
# precision      0.047266
# fpr            0.000386
# fnr            0.043000
# accuracy       0.999613
# specificity    0.999614
# f1_score       0.090083
# dtype: float64
print(eval_result.confusion)
#                  Predicted Positive  Predicted Negative
# Actual Positive                 957                  43
# Actual Negative               19290            49974710

Summary

Comparing both methods, we can see that using embeddings performed slightly worse than the traditional shingle-based approach in this example (95.7% recall vs. 100% with shingles). However, embeddings still provide a viable and effective solution for deduplication. In certain datasets or conditions embeddings may even outperform shingle-based methods.