Quick Start

This guide will help you get started with BlockingPy by walking through some basic examples. We’ll cover both record linkage (matching records between two datasets) and deduplication (finding duplicates within a single dataset).

Basic Record Linkage

Let’s start with a simple example of matching records between two datasets. We’ll use names that have slight variations to demonstrate how BlockingPy handles approximate matching.

Firstly, we will import our main blocker class Blocker used for blocking from BlockingPy and Pandas:

from blockingpy import Blocker
import pandas as pd

Now let’s create simple datasets for our example:

dataset1 = pd.DataFrame({
    "txt": [
        "johnsmith",
        "smithjohn",
        "smiithhjohn",
        "smithjohnny",
        "montypython",
        "pythonmonty",
        "errmontypython",
        "monty",
    ]
})

dataset2 = pd.DataFrame({
    "txt": [
        "montypython",
        "smithjohn",
        "other",
    ]
})

We initialize the Blocker instance and perform blocking:

blocker = Blocker()
blocking_result = blocker.block(x=dataset1['txt'], y=dataset2['txt'])

Let’s print blocking_result and see the output:

print(blocking_result)
# ========================================================
# Blocking based on the faiss method.
# Number of blocks: 3
# Number of columns used for blocking: 17
# Reduction ratio: 0.8750
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          2 | 3

Our output contains:

Algorithm used for blocking (default - faiss - HNSW index)
Number of blocks created
Number of columns used for blocking (obtained by creating DTMs from datasets)
Reduction ratio i.e. how large is the reduction of comparison pairs (here 0.8750 which means blocking reduces comparison by over 87.5%).

We can print blocking_result.result to get the detailed matching results:

print(blocking_result.result)
#    x  y  block  dist
# 0  4  0      0   0.0
# 1  1  1      1   0.0
# 2  7  2      2   6.0

Here we have:

x: Index from the first dataset (dataset1)
y: Index from the second dataset (dataset2)
block: The block ID these records were grouped into
dist: The distance between the records (smaller means more similar)

Basic Deduplication

Now let’s try finding duplicates within a single dataset:

dedup_result = blocker.block(x=dataset1['txt'])

print(dedup_result)
# ========================================================
# Blocking based on the faiss method.
# Number of blocks: 2
# Number of columns created for blocking: 25
# Reduction ratio: 0.5714
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          4 | 2

Output contains similar information as the record linkage one:

faiss algorithm used
2 blocks created
25 columns (features) created for blocking from text representation
0.5714 reduction ratio (meaning we get about 57.14% reduction in comparison pairs)

Let’s take a look at the detailed information:

print(dedup_result.result)
#    x  y  block  dist
# 0  0  1      0   2.0
# 1  1  2      0   2.0
# 2  1  3      0   2.0
# 3  4  5      1   2.0
# 4  4  6      1   3.0
# 5  4  7      1   6.0

Understanding the Results

BlockingPy uses character n-grams and approximate nearest neighbor algorithms to group similar records together. By default, it uses the FAISS algorithm with sensible default parameters.

The reduction ratio shows how much the blocking reduces the number of required comparisons. For example, a ratio of 0.8750 means the blocking eliminates 87.5% of possible comparisons, greatly improving efficiency while maintaining accuracy.

Next Steps

This quick start covered the basics using default settings. BlockingPy offers several additional features:

Multiple ANN algorithms (Faiss, HNSW, Voyager, Annoy, MLPack, NND)
Various distance metrics
Custom text processing options (Embeddings or Ngrams)
Performance tuning parameters
Eval metrics when true blocks are known

Check out the User Guide for more detailed examples and configuration options.