# Deduplication

This example demonstrates how to use BlockingPy for deduplication of a dataset containing duplicate records. We'll use example data generated with [geco3](https://github.com/T-Strojny/geco3) package which allows for generating data from lookup files or functions and then modifying part of records to create "corrupted" duplicates. This dataset contains 10,000 records, 4,000 of which are duplicates. Original records have 0-2 "corrupted" duplicates and those have 3 modified attributes.

## Setup

First, install BlockingPy:

```python
pip install blockingpy
```

Import required packages:

```python
from blockingpy import Blocker
import pandas as pd
```

## Data Preparation

Load the example dataset:

```python
data = pd.read_csv('geco_2_dup_per_rec_3_mod.csv')
```

Let's take a look at the data:

```python
data.iloc[40:50, :]

#            rec-id  first_name second_name   last_name              region  \
# 40    rec-024-org        MAJA        OLGA     LEWICKA  ZACHODNIOPOMORSKIE   
# 41    rec-025-org        POLA    LEOKADIA   RUTKOWSKA  ZACHODNIOPOMORSKIE   
# 42  rec-026-dup-0  ALEKSANDRA       RYBAK       ZÓFIA  KUJAWSKO-POMORSKIE   
# 43  rec-026-dup-1  ALEKSANDRA       RYBAK       ZÓFIA  KUJAWSKO-POMORSKIE   
# 44    rec-026-org       ZOFIA  ALEKSANDRA       RYBAK  KUJAWSKO-POMORSKIE   
# 45  rec-027-dup-0       LAÓRA    JAGYEŁŁO      JOANNA       WIELKOPOLSKIE   
# 46    rec-027-org       LAURA      JOANNA    JAGIEŁŁO       WIELKOPOLSKIE   
# 47  rec-028-dup-0       MARIA        KOZA    WIKTÓRIA        DOLNOŚLĄSKIE   
# 48    rec-028-org    WIKTORIA       MARIA        KOZA        DOLNOŚLĄSKIE   
# 49    rec-029-org      NIKOLA  BRONISŁAWA  WIĘCKOWSKA             ŚLĄSKIE   

#     birth_date personal_id  
# 40  22/10/1935   DKK423341  
# 41  29/11/1956   LJL907920  
# 42         NaN   DAT77p499  
# 43         NaN         NaN  
# 44  24/03/1982   DAT770499  
# 45  10/11/1984   LNRt57399  
# 46  10/11/1984   LNR657399  
# 47         NaN   HEH671979  
# 48  09/09/1982   HEH671989  
# 49  09/11/1992   JKR103426  
```

Preprocess data by concatenating all fields into a single text column:

```python
data['txt'] = (
    data['first_name'].fillna('') +
    data['second_name'].fillna('') +
    data['last_name'].fillna('') + 
    data['region'].fillna('') +
    data['birth_date'].fillna('') +
    data['personal_id'].fillna('')
)

print(data['txt'].head())

# 0	GÓRKAKARÓLINAMELANIIAŚWIĘTOKRZYSKIE25/07/2010S...
# 1	MELANIAKAROLINAGÓRKAŚWIĘTOKRZYSKIE25/07/2001SG...
# 2	MARTAMARTYNAMUSIAŁPODKARPACKIE23/04/1944TLS812403
# 3	KAJAPATRYCJADROZDDOLNOŚLĄSKIE05/12/1950TJH243280
# 4	HANNAKLARALIPSKAMAŁOPOLSKIE28/05/1991MTN763673
```

## Basic Deduplication

Initialize blocker instance and perform deduplication using the Voyager algorithm:

```python
control_ann = {
    'voyager': {
        'distance': 'cosine',
        'random_seed': 42,
        'M': 16,
        'ef_construction': 300,
    }
}

blocker = Blocker()
dedup_result = blocker.block(
    x=data['txt'],
    ann='voyager',
    verbose=1,
    control_ann=control_ann,
    random_seed=42
)

# ===== creating tokens: shingle =====
# ===== starting search (voyager, x, y: 10000,10000, t: 1169) =====
# ===== creating graph =====
```

Let's examine the results:

```python
print(dedup_result)

# ========================================================
# Blocking based on the voyager method.
# Number of blocks: 2723
# Number of columns created for blocking: 1169
# Reduction ratio: 0.999564
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          2 | 926            
#          3 | 883            
#          4 | 363            
#          5 | 211            
#          6 | 100            
#          7 | 78             
#          8 | 41             
#          9 | 26             
#         10 | 21             
#         11 | 15             
#         12 | 13             
#         13 | 7              
#         14 | 7              
#         15 | 9              
#         16 | 9              
#         17 | 5              
#         18 | 2              
#         19 | 2              
#         20 | 1              
#         23 | 1              
#         24 | 1              
#         27 | 1              
#         32 | 1         
```
and:

```python
print(dedup_result.result)

#          x     y  block      dist
# 0        1     0      0  0.102041
# 1     5974     2      1  0.390295
# 2     7378     3      2  0.425410
# 3     5562     4      3  0.396494
# 4     1389     5      4  0.461184
# ...    ...   ...    ...       ...
# 7281  9995  9993   2722  0.241895
# 7282  9995  9994   2722  0.135667
# 7283  4029  9996   1561  0.386845
# 7284  9998  9997     67  0.128579
# 7285  9998  9999     67  0.128579
```
Let's take a look at the pair in block `66`:

```python
print(data.iloc[[9998,9999], : ])

#              rec-id first_name second_name   last_name               region        birth_date personal_id                                                       txt
# 9998  rec-999-dup-1     RESŻKA    LILIANNA  MAŁGÓRZATA  WARMIŃSKO-MAZURSKIE         12/01/1978        NaN         RESŻKALILIANNAMAŁGÓRZATAWARMIŃSKO-MAZURSKIE12/...
# 9999    rec-999-org   LILIANNA  MAŁGORZATA      RESZKA  WARMIŃSKO-MAZURSKIE         12/01/1978   TCX847483        LILIANNAMAŁGORZATARESZKAWARMIŃSKO-MAZURSKIE12/...
```
Even though records differ a lot, our package managed to get this pair right.

## Evaluation with True Blocks

Since our dataset contains known duplicate information in the `rec-id` field, we can evaluate the blocking performance. First, we'll prepare the true blocks information:

```python
df_eval = data.copy()

# Extract block numbers from rec-id
df_eval['block'] = df_eval['rec-id'].str.extract(r'rec-(\d+)-')
df_eval['block'] = df_eval['block'].astype('int')

# Add sequential index
df_eval = df_eval.sort_values(by=['block'], axis=0).reset_index()
df_eval['x'] = range(len(df_eval))

# Prepare true blocks dataframe
true_blocks_dedup = df_eval[['x', 'block']]
```
Print `true_blocks_dedup`:

```python
print(true_blocks_dedup.head(10))

#    x  block
# 0  0      0
# 1  1      0
# 2  2      1
# 3  3      2
# 4  4      3
# 5  5      4
# 6  6      5
# 7  7      6
# 8  8      6
# 9  9      7
```

Now we can perform blocking with evaluation using the HNSW algorithm:

```python
control_ann = {
    "hnsw": {
        'distance': "cosine",
        'M': 40,
        'ef_c': 500,
        'ef_s': 500
    }
}

blocker = Blocker()
eval_result = blocker.block(
    x=df_eval['txt'], 
    ann='hnsw',
    true_blocks=true_blocks_dedup, 
    verbose=1, 
    control_ann=control_ann,
    random_seed=42
)
# We can also evaluate separately with `eval` method:
# result = blocker.block(
#     x=df_eval['txt'], 
#     ann='hnsw', 
#     verbose=1, 
#     control_ann=control_ann,
#     random_seed=42
# )
# eval_result = blocker.eval(
#     blocking_result=result,
#     true_blocks=true_blocs_dedup
# ) 
# The rest stays the same in both cases

print(eval_result)
print(eval_result.metrics)
# ========================================================
# Blocking based on the hnsw method.
# Number of blocks: 2972
# Number of columns created for blocking: 1169
# Reduction ratio: 0.999649
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          2 | 1113           
#          3 | 997            
#          4 | 391            
#          5 | 200            
#          6 | 88             
#          7 | 65             
#          8 | 39             
#          9 | 19             
#         10 | 16             
#         11 | 13             
#         12 | 9              
#         13 | 8              
#         14 | 4              
#         15 | 1              
#         16 | 3              
#         17 | 1              
#         18 | 2              
#         19 | 1              
#         22 | 1              
#         25 | 1              
# ========================================================
# Evaluation metrics (standard):
# recall : 99.0151
# precision : 29.2353
# fpr : 0.0248
# fnr : 0.9849
# accuracy : 99.9751
# specificity : 99.9752
# f1_score : 45.142
```
The results show:

- High reduction ratio (`0.9996`) indicating significant reduction in comparison space
- High recall (`99.02%`) showing most true duplicates are found

The block size distribution shows most blocks contain 2-4 records, with a few larger blocks which could occur due to the fact that even records without duplicates will be grouped it to one of the blocks. This is not a problem since those pairs would not be matched when performing one-to-one comparison afterwards. This demonstrates BlockingPy's effectiveness at identifying potential duplicates while drastically reducing the number of required comparisons.