Deduplication with Embeddings
This tutorial demonstrates how to use the BlockingPy library for deduplication using embeddings instead of n-gram shingles. It is based on the Deduplication No. 2 tutorial, but adapted to showcase the use of embeddings.
Once again, we will use the RLdata10000 dataset taken from RecordLinkage R package developed by Murat Sariyar
and Andreas Borg. It contains 10 000 records in total where some have been duplicated with randomly generated errors. There are 9000 original records and 1000 duplicates.
Data Preparation
Let’s install blockingpy:
pip install blockingpy
Import necessary packages and functions:
import pandas as pd
from blockingpy import Blocker
from blockingpy.datasets import load_deduplication_data
Let’s load the data and take a look at first 5 rows:
data = load_deduplication_data()
data.head()
# fname_c1 fname_c2 lname_c1 lname_c2 by bm bd id true_id
# 0 FRANK NaN MUELLER NaN 1967 9 27 1 3606
# 1 MARTIN NaN SCHWARZ NaN 1967 2 17 2 2560
# 2 HERBERT NaN ZIMMERMANN NaN 1961 11 6 3 3892
# 3 HANS NaN SCHMITT NaN 1945 8 14 4 329
# 4 UWE NaN KELLER NaN 2000 7 5 5 1994
Now we need to prepare the txt column:
data = data.fillna('')
data[['by', 'bm', 'bd']] = data[['by', 'bm', 'bd']].astype('str')
data['txt'] = (
data["fname_c1"] +
data["fname_c2"] +
data['lname_c1'] +
data['lname_c2'] +
data['by'] +
data['bm'] +
data['bd']
)
data['txt'].head()
# 0 FRANK MUELLER 1967 9 27
# 1 MARTIN SCHWARZ 1967 2 17
# 2 HERBERT ZIMMERMANN 1961 11 6
# 3 HANS SCHMITT 1945 8 14
# 4 UWE KELLER 2000 7 5
# Name: txt, dtype: object
Basic Deduplication
We’ll now perform basic deduplication with hnsw algorithm, but instead of character-level n-grams, the text will be encoded into dense embeddings before approximate nearest neighbor search.
blocker = Blocker()
control_txt = {
"encoder": "embedding",
"embedding": {
"model": "minishlab/potion-base-32M",
# for other customization options see
# configuration in User Guide
}
}
dedup_result = blocker.block(
x=data['txt'],
ann='hnsw',
verbose=1,
random_seed=42,
control_txt=control_txt,
)
# ===== creating tokens: embedding =====
# ===== starting search (hnsw, x, y: 10000,10000, t: 512) =====
# ===== creating graph =====
We can now take a look at the results:
print(dedup_result)
# ========================================================
# Blocking based on the hnsw method.
# Number of blocks: 2656
# Number of columns created for blocking: 512
# Reduction ratio: 0.999600
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
# 2 | 906
# 3 | 631
# 4 | 424
# 5 | 273
# 6 | 155
# 7 | 121
# 8 | 48
# 9 | 34
# 10 | 23
# 11 | 14
# 12 | 13
# 13 | 4
# 14 | 3
# 15 | 2
# 16 | 1
# 18 | 1
# 20 | 1
# 23 | 1
# 26 | 1
and:
print(dedup_result.result)
# x y block dist
# 0 2337 0 0 0.227015
# 1 4504 1 1 0.373196
# 2 233 2 2 0.294851
# 3 1956 3 3 0.261316
# 4 4040 4 4 0.216883
# ... ... ... ... ...
# 7339 6692 9984 2328 0.338963
# 7340 5725 9986 1532 0.243514
# 7341 8521 9993 1915 0.324314
# 7342 7312 9997 774 0.235769
# 7343 5897 9999 1558 0.217153
Let’s see the pair in the block no. 3
print(data.iloc[[1956, 3], : ])
# fname_c1 fname_c2 lname_c1 ... id true_id txt
# 1956 HRANS SCHMITT ... 1957 329 HRANS SCHMITT 1945 8 14
# 3 HANS SCHMITT ... 4 329 HANS SCHMITT 1945 8 14
True Blocks Preparation
df_eval = data.copy()
df_eval['block'] = df_eval['true_id']
df_eval['x'] = range(len(df_eval))
print(df_eval.head())
# fname_c1 fname_c2 lname_c1 ... txt block x
# 0 FRANK MUELLER ... FRANK MUELLER 1967 9 27 3606 0
# 1 MARTIN SCHWARZ ... MARTIN SCHWARZ 1967 2 17 2560 1
# 2 HERBERT ZIMMERMANN ... HERBERT ZIMMERMANN 1961 1 16 3892 2
# 3 HANS SCHMITT ... HANS SCHMITT 1945 8 14 329 3
# 4 UWE KELLER ... UWE KELLER 2000 7 5 1994 4
Let’s create the final true_blocks_dedup:
true_blocks_dedup = df_eval[['x', 'block']]
Evaluation
Finally, we can evaluate the blocking performance when using embeddings:
blocker = Blocker()
eval_result = blocker.block(
x=df_eval['txt'],
ann='voyager',
true_blocks=true_blocks_dedup,
verbose=1,
random_seed=42,
control_txt=control_txt, # Using the same config
)
# ===== creating tokens: embedding =====
# ===== starting search (voyager, x, y: 10000,10000, t: 512) =====
# ===== creating graph =====
# ===== evaluating =====
You can also inspect:
print(eval_result.metrics)
# recall 0.957000
# precision 0.047266
# fpr 0.000386
# fnr 0.043000
# accuracy 0.999613
# specificity 0.999614
# f1_score 0.090083
# dtype: float64
print(eval_result.confusion)
# Predicted Positive Predicted Negative
# Actual Positive 957 43
# Actual Negative 19290 49974710
Summary
Comparing both methods, we can see that using embeddings performed slightly worse than the traditional shingle-based approach in this example (95.7% recall vs. 100% with shingles).
However, embeddings still provide a viable and effective solution for deduplication.
In certain datasets or conditions embeddings may even outperform shingle-based methods.