Core Concepts
What is Blocking?
Blocking is a technique to reduce the number of comparisons needed by:
Grouping similar records into “blocks”
Only comparing records within the same block
Excluding comparisons between records in different blocks
A good blocking strategy should:
Drastically reduce the number of comparisons needed
Maintain high recall (not miss true matches)
Be computationally efficient
The ANN Solution
BlockingPy uses Approximate Nearest Neighbor (ANN) algorithms to implement efficient blocking. The process works by:
Converting records into vector representations
Using ANN algorithms to efficiently find similar vectors
Grouping similar vectors into blocks
This approach has several advantages:
Scales well to large datasets
Works with both structured and text data
Provides tunable trade-offs between speed and accuracy
Key Components
The main components of BlockingPy are:
Blocker: The main class that handles:
Data preprocessing
ANN algorithm selection and configuration
Block generation
BlockingResult: Contains the blocking results, including:
Block assignments
Distance metrics
Quality measures
Controls: Configuration systems for:
Text processing parameters
ANN algorithm parameters