Quick Start
This guide will help you get started with BlockingPy by walking through some basic examples. We’ll cover both record linkage (matching records between two datasets) and deduplication (finding duplicates within a single dataset).
Basic Record Linkage
Let’s start with a simple example of matching records between two datasets. We’ll use names that have slight variations to demonstrate how BlockingPy handles approximate matching.
Firstly, we will import our main blocker class Blocker used for blocking from BlockingPy and Pandas:
from blockingpy import Blocker
import pandas as pd
Now let’s create simple datasets for our example:
dataset1 = pd.DataFrame({
"txt": [
"johnsmith",
"smithjohn",
"smiithhjohn",
"smithjohnny",
"montypython",
"pythonmonty",
"errmontypython",
"monty",
]
})
dataset2 = pd.DataFrame({
"txt": [
"montypython",
"smithjohn",
"other",
]
})
We initialize the Blocker instance and perform blocking:
blocker = Blocker()
blocking_result = blocker.block(x=dataset1['txt'], y=dataset2['txt'])
Let’s print blocking_result and see the output:
print(blocking_result)
# ========================================================
# Blocking based on the faiss method.
# Number of blocks: 3
# Number of columns used for blocking: 17
# Reduction ratio: 0.8750
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
# 2 | 3
Our output contains:
Algorithm used for blocking (default -
faiss - HNSW index)Number of blocks created
Number of columns used for blocking (obtained by creating DTMs from datasets)
Reduction ratio i.e. how large is the reduction of comparison pairs (here
0.8750which means blocking reduces comparison by over 87.5%).
We can print blocking_result.result to get the detailed matching results:
print(blocking_result.result)
# x y block dist
# 0 4 0 0 0.0
# 1 1 1 1 0.0
# 2 7 2 2 6.0
Here we have:
x: Index from the first dataset (dataset1)y: Index from the second dataset (dataset2)block: The block ID these records were grouped intodist: The distance between the records (smaller means more similar)
Basic Deduplication
Now let’s try finding duplicates within a single dataset:
dedup_result = blocker.block(x=dataset1['txt'])
print(dedup_result)
# ========================================================
# Blocking based on the faiss method.
# Number of blocks: 2
# Number of columns created for blocking: 25
# Reduction ratio: 0.5714
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
# 4 | 2
Output contains similar information as the record linkage one:
faissalgorithm used2blocks created25columns (features) created for blocking from text representation0.5714reduction ratio (meaning we get about57.14%reduction in comparison pairs)
Let’s take a look at the detailed information:
print(dedup_result.result)
# x y block dist
# 0 0 1 0 2.0
# 1 1 2 0 2.0
# 2 1 3 0 2.0
# 3 4 5 1 2.0
# 4 4 6 1 3.0
# 5 4 7 1 6.0
Understanding the Results
BlockingPy uses character n-grams and approximate nearest neighbor algorithms to group similar records together. By default, it uses the FAISS algorithm with sensible default parameters.
The reduction ratio shows how much the blocking reduces the number of required comparisons. For example, a ratio of 0.8750 means the blocking eliminates 87.5% of possible comparisons, greatly improving efficiency while maintaining accuracy.
Next Steps
This quick start covered the basics using default settings. BlockingPy offers several additional features:
Multiple ANN algorithms (Faiss, HNSW, Voyager, Annoy, MLPack, NND)
Various distance metrics
Custom text processing options (Embeddings or Ngrams)
Performance tuning parameters
Eval metrics when true blocks are known
Check out the User Guide for more detailed examples and configuration options.