Welcome to BlockingPy’s Documentation
Contents
BlockingPy is a Python package that implements efficient blocking methods for record linkage and data deduplication using Approximate Nearest Neighbor (ANN) algorithms. It is based on R blocking package.
Additionally, GPU acceleration is available via blockingpy-gpu (FAISS-GPU).
Purpose
When performing record linkage or deduplication on large datasets, comparing all possible record pairs becomes computationally infeasible. Blocking helps reduce the comparison space by identifying candidate record pairs that are likely to match, using efficient approximate nearest neighbor search algorithms.
Key Features
Multiple ANN Algorithms: Supports FAISS, HNSW, Voyager, Annoy, MLPack, and NND
Flexible Input: Works with text data, sparse matrices, or dense feature vectors
Customizable Processing: Configurable text processing and algorithm parameters
Performance Focused: Optimized for both accuracy and computational efficiency
Easy Integration: Simple API that works with pandas DataFrames
Quality Assessment: Built-in evaluation metrics when true matches are known
If you’re new to BlockingPy, we recommend following these steps:
Start with the Getting Started guide to set up BlockingPy
Try the Quick Start guide to see basic usage examples
Look at Examples to understand more about BlockingPy
Explore the User Guide for detailed usage instructions
Obtain more information via BlockingPy API
Example Datasets
BlockingPy comes with built-in example datasets:
Census-Cis dataset created by Paula McLeod, Dick Heasman and Ian Forbes, ONS, for the ESSnet DI on-the-job training course, Southampton, 25-28 January 2011
Deduplication dataset taken from RecordLinkage R package developed by Murat Sariyar and Andreas Borg. Package is licensed under GPL-3 license. Also known as RLdata10000.
License
BlockingPy is released under MIT license.
Issues
Feel free to report any issues, bugs, suggestions with github issues here.
Contributing
Please see CONTRIBUTING.md for more information.
Code of Conduct
You can find it here.
Acknowledgements
This package is based on the R blocking package developed by BERENZ.
Funding
Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941 (Towards census-like statistics for foreign-born populations – quality, data integration and estimation)