Welcome to BlockingPy’s Documentation

License Project Status: Active – The project has reached a stable, usable state and is being actively developed. Python version codecov PyPI version Ruff Tests
GitHub last commit Documentation Status PyPI Downloads PyPI (GPU) CUDA ≥12.4

pyOpenSci Peer-Reviewed DOI

BlockingPy is a Python package that implements efficient blocking methods for record linkage and data deduplication using Approximate Nearest Neighbor (ANN) algorithms. It is based on R blocking package.

Additionally, GPU acceleration is available via blockingpy-gpu (FAISS-GPU).

Purpose

When performing record linkage or deduplication on large datasets, comparing all possible record pairs becomes computationally infeasible. Blocking helps reduce the comparison space by identifying candidate record pairs that are likely to match, using efficient approximate nearest neighbor search algorithms.

Key Features

  • Multiple ANN Algorithms: Supports FAISS, HNSW, Voyager, Annoy, MLPack, and NND

  • Flexible Input: Works with text data, sparse matrices, or dense feature vectors

  • Customizable Processing: Configurable text processing and algorithm parameters

  • Performance Focused: Optimized for both accuracy and computational efficiency

  • Easy Integration: Simple API that works with pandas DataFrames

  • Quality Assessment: Built-in evaluation metrics when true matches are known

If you’re new to BlockingPy, we recommend following these steps:

  1. Start with the Getting Started guide to set up BlockingPy

  2. Try the Quick Start guide to see basic usage examples

  3. Look at Examples to understand more about BlockingPy

  4. Explore the User Guide for detailed usage instructions

  5. Obtain more information via BlockingPy API

Example Datasets

BlockingPy comes with built-in example datasets:

  • Census-Cis dataset created by Paula McLeod, Dick Heasman and Ian Forbes, ONS, for the ESSnet DI on-the-job training course, Southampton, 25-28 January 2011

  • Deduplication dataset taken from RecordLinkage R package developed by Murat Sariyar and Andreas Borg. Package is licensed under GPL-3 license. Also known as RLdata10000.

License

BlockingPy is released under MIT license.

Issues

Feel free to report any issues, bugs, suggestions with github issues here.

Contributing

Please see CONTRIBUTING.md for more information.

Code of Conduct

You can find it here.

Acknowledgements

This package is based on the R blocking package developed by BERENZ.

Funding

Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941 (Towards census-like statistics for foreign-born populations – quality, data integration and estimation)