Deduplication
This example demonstrates how to use BlockingPy for deduplication of a dataset containing duplicate records. We’ll use example data generated with geco3 package which allows for generating data from lookup files or functions and then modifying part of records to create “corrupted” duplicates. This dataset contains 10,000 records, 4,000 of which are duplicates. Original records have 0-2 “corrupted” duplicates and those have 3 modified attributes.
Setup
First, install BlockingPy:
pip install blockingpy
Import required packages:
from blockingpy import Blocker
import pandas as pd
Data Preparation
Load the example dataset:
data = pd.read_csv('geco_2_dup_per_rec_3_mod.csv')
Let’s take a look at the data:
data.iloc[40:50, :]
# rec-id first_name second_name last_name region \
# 40 rec-024-org MAJA OLGA LEWICKA ZACHODNIOPOMORSKIE
# 41 rec-025-org POLA LEOKADIA RUTKOWSKA ZACHODNIOPOMORSKIE
# 42 rec-026-dup-0 ALEKSANDRA RYBAK ZÓFIA KUJAWSKO-POMORSKIE
# 43 rec-026-dup-1 ALEKSANDRA RYBAK ZÓFIA KUJAWSKO-POMORSKIE
# 44 rec-026-org ZOFIA ALEKSANDRA RYBAK KUJAWSKO-POMORSKIE
# 45 rec-027-dup-0 LAÓRA JAGYEŁŁO JOANNA WIELKOPOLSKIE
# 46 rec-027-org LAURA JOANNA JAGIEŁŁO WIELKOPOLSKIE
# 47 rec-028-dup-0 MARIA KOZA WIKTÓRIA DOLNOŚLĄSKIE
# 48 rec-028-org WIKTORIA MARIA KOZA DOLNOŚLĄSKIE
# 49 rec-029-org NIKOLA BRONISŁAWA WIĘCKOWSKA ŚLĄSKIE
# birth_date personal_id
# 40 22/10/1935 DKK423341
# 41 29/11/1956 LJL907920
# 42 NaN DAT77p499
# 43 NaN NaN
# 44 24/03/1982 DAT770499
# 45 10/11/1984 LNRt57399
# 46 10/11/1984 LNR657399
# 47 NaN HEH671979
# 48 09/09/1982 HEH671989
# 49 09/11/1992 JKR103426
Preprocess data by concatenating all fields into a single text column:
data['txt'] = (
data['first_name'].fillna('') +
data['second_name'].fillna('') +
data['last_name'].fillna('') +
data['region'].fillna('') +
data['birth_date'].fillna('') +
data['personal_id'].fillna('')
)
print(data['txt'].head())
# 0 GÓRKAKARÓLINAMELANIIAŚWIĘTOKRZYSKIE25/07/2010S...
# 1 MELANIAKAROLINAGÓRKAŚWIĘTOKRZYSKIE25/07/2001SG...
# 2 MARTAMARTYNAMUSIAŁPODKARPACKIE23/04/1944TLS812403
# 3 KAJAPATRYCJADROZDDOLNOŚLĄSKIE05/12/1950TJH243280
# 4 HANNAKLARALIPSKAMAŁOPOLSKIE28/05/1991MTN763673
Basic Deduplication
Initialize blocker instance and perform deduplication using the Voyager algorithm:
control_ann = {
'voyager': {
'distance': 'cosine',
'random_seed': 42,
'M': 16,
'ef_construction': 300,
}
}
blocker = Blocker()
dedup_result = blocker.block(
x=data['txt'],
ann='voyager',
verbose=1,
control_ann=control_ann,
random_seed=42
)
# ===== creating tokens: shingle =====
# ===== starting search (voyager, x, y: 10000,10000, t: 1169) =====
# ===== creating graph =====
Let’s examine the results:
print(dedup_result)
# ========================================================
# Blocking based on the voyager method.
# Number of blocks: 2723
# Number of columns created for blocking: 1169
# Reduction ratio: 0.999564
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
# 2 | 926
# 3 | 883
# 4 | 363
# 5 | 211
# 6 | 100
# 7 | 78
# 8 | 41
# 9 | 26
# 10 | 21
# 11 | 15
# 12 | 13
# 13 | 7
# 14 | 7
# 15 | 9
# 16 | 9
# 17 | 5
# 18 | 2
# 19 | 2
# 20 | 1
# 23 | 1
# 24 | 1
# 27 | 1
# 32 | 1
and:
print(dedup_result.result)
# x y block dist
# 0 1 0 0 0.102041
# 1 5974 2 1 0.390295
# 2 7378 3 2 0.425410
# 3 5562 4 3 0.396494
# 4 1389 5 4 0.461184
# ... ... ... ... ...
# 7281 9995 9993 2722 0.241895
# 7282 9995 9994 2722 0.135667
# 7283 4029 9996 1561 0.386845
# 7284 9998 9997 67 0.128579
# 7285 9998 9999 67 0.128579
Let’s take a look at the pair in block 66:
print(data.iloc[[9998,9999], : ])
# rec-id first_name second_name last_name region birth_date personal_id txt
# 9998 rec-999-dup-1 RESŻKA LILIANNA MAŁGÓRZATA WARMIŃSKO-MAZURSKIE 12/01/1978 NaN RESŻKALILIANNAMAŁGÓRZATAWARMIŃSKO-MAZURSKIE12/...
# 9999 rec-999-org LILIANNA MAŁGORZATA RESZKA WARMIŃSKO-MAZURSKIE 12/01/1978 TCX847483 LILIANNAMAŁGORZATARESZKAWARMIŃSKO-MAZURSKIE12/...
Even though records differ a lot, our package managed to get this pair right.
Evaluation with True Blocks
Since our dataset contains known duplicate information in the rec-id field, we can evaluate the blocking performance. First, we’ll prepare the true blocks information:
df_eval = data.copy()
# Extract block numbers from rec-id
df_eval['block'] = df_eval['rec-id'].str.extract(r'rec-(\d+)-')
df_eval['block'] = df_eval['block'].astype('int')
# Add sequential index
df_eval = df_eval.sort_values(by=['block'], axis=0).reset_index()
df_eval['x'] = range(len(df_eval))
# Prepare true blocks dataframe
true_blocks_dedup = df_eval[['x', 'block']]
Print true_blocks_dedup:
print(true_blocks_dedup.head(10))
# x block
# 0 0 0
# 1 1 0
# 2 2 1
# 3 3 2
# 4 4 3
# 5 5 4
# 6 6 5
# 7 7 6
# 8 8 6
# 9 9 7
Now we can perform blocking with evaluation using the HNSW algorithm:
control_ann = {
"hnsw": {
'distance': "cosine",
'M': 40,
'ef_c': 500,
'ef_s': 500
}
}
blocker = Blocker()
eval_result = blocker.block(
x=df_eval['txt'],
ann='hnsw',
true_blocks=true_blocks_dedup,
verbose=1,
control_ann=control_ann,
random_seed=42
)
# We can also evaluate separately with `eval` method:
# result = blocker.block(
# x=df_eval['txt'],
# ann='hnsw',
# verbose=1,
# control_ann=control_ann,
# random_seed=42
# )
# eval_result = blocker.eval(
# blocking_result=result,
# true_blocks=true_blocs_dedup
# )
# The rest stays the same in both cases
print(eval_result)
print(eval_result.metrics)
# ========================================================
# Blocking based on the hnsw method.
# Number of blocks: 2972
# Number of columns created for blocking: 1169
# Reduction ratio: 0.999649
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
# 2 | 1113
# 3 | 997
# 4 | 391
# 5 | 200
# 6 | 88
# 7 | 65
# 8 | 39
# 9 | 19
# 10 | 16
# 11 | 13
# 12 | 9
# 13 | 8
# 14 | 4
# 15 | 1
# 16 | 3
# 17 | 1
# 18 | 2
# 19 | 1
# 22 | 1
# 25 | 1
# ========================================================
# Evaluation metrics (standard):
# recall : 99.0151
# precision : 29.2353
# fpr : 0.0248
# fnr : 0.9849
# accuracy : 99.9751
# specificity : 99.9752
# f1_score : 45.142
The results show:
High reduction ratio (
0.9996) indicating significant reduction in comparison spaceHigh recall (
99.02%) showing most true duplicates are found
The block size distribution shows most blocks contain 2-4 records, with a few larger blocks which could occur due to the fact that even records without duplicates will be grouped it to one of the blocks. This is not a problem since those pairs would not be matched when performing one-to-one comparison afterwards. This demonstrates BlockingPy’s effectiveness at identifying potential duplicates while drastically reducing the number of required comparisons.