# Integration with recordlinkage package In this example we aim to show how users can utilize blocking results achieved with BlockingPy and use them with the [recordlinkage](https://github.com/J535D165/recordlinkage) package. The [recordlinkage](https://github.com/J535D165/recordlinkage) allows for both blocking and one-to-one record linkage and deduplication. However, it is possible to transfer blocking results from BlockingPy and incorporate them in the full entity resolution pipeline. This example will show deduplication of febrl1 dataset which comes buillt-in with [recordlinkage](https://github.com/J535D165/recordlinkage). We aim to follow the [Data deduplication](https://recordlinkage.readthedocs.io/en/latest/guides/data_deduplication.html#Introduction) example available on the recordlinkage documentation website and substitute the blocking procedure with our own. ## Setup Firstly, we need to install `BlockingPy` and `recordlinkage`: ```bash pip install blockingpy recordlinkage ``` Import necessary components: ```python import recordlinkage from recordlinkage.datasets import load_febrl1 from blockingpy import Blocker import pandas as pd import numpy as np np.random.seed(42) ``` ## Data preparation `febrl1` dataset contains 1000 records of which 500 are original and 500 are duplicates. It containts fictitious personal information e.g. name, surname, adress. ```python df = load_febrl1() print(df.head(2)) # given_name surnam street_number address_1 address_2 suburb postcode state date_of_birth soc_sec_id # rec_id # rec-223-org NaN waller 6 tullaroop street willaroo st james 4011 wa 19081209 6988048 # rec-122-org lachlan berry 69 giblin street killarney bittern 4814 qld 19990219 7364009 ``` Prepare data in a suitable format for blockingpy. For this we need to fill missing values and concat fields to the `txt` column: ```python df = df.fillna('') df['txt'] = df['given_name'] + df['surname'] + \ df['street_number'] + df['address_1'] + \ df['address_2'] + df['suburb'] + \ df['postcode'] + df['state'] + \ df['date_of_birth'] + df['soc_sec_id'] ``` ## Blocking Now we can obtain blocks from `BlockingPy`: ```python blocker = Blocker() blocking_result = blocker.block( x=df['txt'], ann='hnsw', random_seed=42 ) print(blocking_result) # ======================================================== # Blocking based on the hnsw method. # Number of blocks: 500 # Number of columns created for blocking: 1023 # Reduction ratio: 0.998999 # ======================================================== # Distribution of the size of the blocks: # Block Size | Number of Blocks # 2 | 500 print(blocking_result.result.head()) # x y block dist # 0 474 0 0 0.048375 # 1 330 1 1 0.038961 # 2 351 2 2 0.086690 # 3 290 3 3 0.024617 # 4 333 4 4 0.105662 ``` ## Integration To integrate our results, we can add a `block` column to the original dataframe. `Blockingpy` provides a `add_block_column` method for this step. Since the index of the original dataframe is not the same as the positional index in the blocking result, we need to add an `id` column to the original dataframe. ```python df['id'] = range(len(df)) df_final = blocking_result.add_block_column(df, id_col_left='id') print(df_final['block'].head(5)) # block # rec_id # rec-223-org 0 # rec-122-org 1 # rec-373-org 2 # rec-10-dup-0 3 # rec-227-org 4 ``` Now we can use the `Index` object from `recordlinkage` with the `block` column to integrate `BlockingPy` results with `recordlinkage`: ```python indexer = recordlinkage.Index() indexer.block('block') pairs = indexer.index(df_final) print(pairs) # MultiIndex([('rec-344-dup-0', 'rec-344-org'), # ( 'rec-251-org', 'rec-251-dup-0'), # ('rec-335-dup-0', 'rec-335-org'), # ( 'rec-23-dup-0', 'rec-23-org'), # ( 'rec-382-org', 'rec-382-dup-0'), # .... ``` ***NOTE*** : This is the example for deduplication. Keep in mind that for record linkage this step needs to be modified. Finally, we can use the execute one-to-one record linkage with the `recordlinkage` package. We will use the same comparison rules as in the original example: ```python dfA = load_febrl1() # load original dataset once again for clean data compare_cl = recordlinkage.Compare() compare_cl.exact("given_name", "given_name", label="given_name") compare_cl.string( "surname", "surname", method="jarowinkler", threshold=0.85, label="surname" ) compare_cl.exact("date_of_birth", "date_of_birth", label="date_of_birth") compare_cl.exact("suburb", "suburb", label="suburb") compare_cl.exact("state", "state", label="state") compare_cl.string("address_1", "address_1", threshold=0.85, label="address_1") features = compare_cl.compute(pairs, dfA) matches = features[features.sum(axis=1) > 3] print(len(matches)) # 458 # vs. 317 when blocking traditionally on 'given_name' ``` Comparison rules were adopted from the [orignal example](https://recordlinkage.readthedocs.io/en/latest/guides/data_deduplication.html#Introduction).