Deduplication No. 2

In this example we’ll use data known as RLdata10000 taken from RecordLinkage R package developed by Murat Sariyar and Andreas Borg. It contains 10 000 records in total where some have been duplicated with randomly generated errors. There are 9000 original records and 1000 duplicates.

Data Preparation

Let’s install blockingpy

pip install blockingpy

Import necessary packages and functions:

import pandas as pd
from blockingpy import Blocker
from blockingpy.datasets import load_deduplication_data

Let’s load the data and take a look at first 5 rows:

data = load_deduplication_data()
data.head()

# 	fname_c1	fname_c2	lname_c1	lname_c2   by	bm	bd	id  true_id
# 0	FRANK	    NaN	        MUELLER	    NaN	       1967	9	27	1	3606
# 1	MARTIN	    NaN	        SCHWARZ	    NaN	       1967	2	17	2	2560
# 2	HERBERT	    NaN	        ZIMMERMANN  NaN	       1961	11	6	3	3892
# 3	HANS	    NaN	        SCHMITT	    NaN	       1945	8	14	4	329
# 4	UWE	    NaN	        KELLER	    NaN	       2000	7	5	5	1994

Now we need to prepare the txt column:

data = data.fillna('')
data[['by', 'bm', 'bd']] = data[['by', 'bm', 'bd']].astype('str')
data['txt'] = (
    data["fname_c1"] +
    data["fname_c2"] +
    data['lname_c1'] +
    data['lname_c2'] +
    data['by'] +
    data['bm'] +
    data['bd']
    )   
data['txt'].head()

# 0         FRANKMUELLER1967927
# 1        MARTINSCHWARZ1967217
# 2    HERBERTZIMMERMANN1961116
# 3          HANSSCHMITT1945814
# 4             UWEKELLER200075
# Name: txt, dtype: object

Basic Deduplication

Let’s perfrom basic deduplication using hnsw algorithm

blocker = Blocker()
dedup_result = blocker.block(
    x=data['txt'],
    ann='hnsw',
    verbose=1,
    random_seed=42,
)

# ===== creating tokens: shingle =====
# ===== starting search (hnsw, x, y: 10000,10000, t: 674) =====
# ===== creating graph =====

We can now take a look at the results:

print(dedup_result)

# ========================================================
# Blocking based on the hnsw method.
# Number of blocks: 2736
# Number of columns created for blocking: 674
# Reduction ratio: 0.999586
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          2 | 962            
#          3 | 725            
#          4 | 409            
#          5 | 263            
#          6 | 139            
#          7 | 89             
#          8 | 52             
#          9 | 37             
#         10 | 24             
#         11 | 14             
#         12 | 9              
#         13 | 5              
#         14 | 2              
#         15 | 1              
#         16 | 1              
#         17 | 2              
#         20 | 1              
#         64 | 1   

and:

print(dedup_result.result)
#          x     y  block      dist
# 0     3402     0      0  0.256839
# 1     1179     1      1  0.331352
# 2     2457     2      2  0.209737
# 3     1956     3      3  0.085341
# 4     4448     4      4  0.375000
# ...    ...   ...    ...       ...
# 7259  9206  9994   1981  0.390912
# 7260  6309  9995   1899  0.268436
# 7261  5162  9996   1742  0.188893
# 7262  6501  9997   1293  0.245406
# 7263  5183  9999   1273  0.209088

Let’s see the pair in the block no. 3

print(data.iloc[[1956, 3], : ])
#      fname_c1 fname_c2 lname_c1  ...    id true_id                  txt
# 1956    HRANS           SCHMITT  ...  1957     329  HRANSSCHMITT1945814
# 3        HANS           SCHMITT  ...     4     329   HANSSCHMITT1945814

True Blocks Preparation

df_eval = data.copy()
df_eval['block'] = df_eval['true_id']
df_eval['x'] = range(len(df_eval))

print(df_eval.head())
#   fname_c1 fname_c2    lname_c1  ...                       txt block  x
# 0    FRANK              MUELLER  ...       FRANKMUELLER1967927  3606  0
# 1   MARTIN              SCHWARZ  ...      MARTINSCHWARZ1967217  2560  1
# 2  HERBERT           ZIMMERMANN  ...  HERBERTZIMMERMANN1961116  3892  2
# 3     HANS              SCHMITT  ...        HANSSCHMITT1945814   329  3
# 4      UWE               KELLER  ...           UWEKELLER200075  1994  4

Let’s create the final true_blocks_dedup:

true_blocks_dedup = df_eval[['x', 'block']]

Evaluation

Now we can evaluate our algorithm:

blocker = Blocker()
eval_result = blocker.block(
    x=df_eval['txt'], 
    ann='voyager',
    true_blocks=true_blocks_dedup, 
    verbose=1, 
    random_seed=42,
)
# ===== creating tokens: shingle =====
# ===== starting search (voyager, x, y: 10000,10000, t: 674) =====
# ===== creating graph =====

And the results:

# print(eval_result)
# print(eval_result.metrics)
# ========================================================
# Blocking based on the voyager method.
# Number of blocks: 2726
# Number of columns created for blocking: 674
# Reduction ratio: 0.999581
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          2 | 956            
#          3 | 712            
#          4 | 412            
#          5 | 267            
#          6 | 141            
#          7 | 91             
#          8 | 51             
#          9 | 36             
#         10 | 22             
#         11 | 14             
#         12 | 9              
#         13 | 5              
#         14 | 4              
#         15 | 2              
#         17 | 1              
#         18 | 1              
#         20 | 1              
#         66 | 1              
# ========================================================
# Evaluation metrics (standard):
# recall : 100.0
# precision : 4.7787
# fpr : 0.0399
# fnr : 0.0
# accuracy : 99.9601
# specificity : 99.9601
# f1_score : 9.1216

print(eval_result.confusion)
# 	                Predicted Positive     Predicted Negative
# Actual Positive	1000	               0
# Actual Negative	19926	               49974074

The results show high reduction ratio 0.9996 alongside perfect recall (1.000) indicating that our package handled this dataset very well.