BlockingPy vs blocklib - comparison
Below we compare BlockingPy with blocklib, a similar library for blocking. We present results obtained by running algorithms from both libraries on 3 generated datasets across 10 runs. The datasets were generated using the geco3 tool, which allows for controlled generation of datasets with duplicates. The datasets resemble real-world personal information data with the fields such as name, 2nd name, surname, dob, municipality, and country of origin. There are 1,5k, 15k and 150k records respectively, with 500, 5k and 50k duplicates in each dataset. For each original record, there are 0, 1, or 2 duplicates. The datasets and code to reproduce the results can be found here. The results were obtained on 6 cores Intel i5 CPU with 16GB RAM (py 3.12).
Algorithm |
Size |
Time [s] |
Recall |
Reduction Ratio |
Pairs (M) |
|---|---|---|---|---|---|
BlockingPy (faiss_hnsw) |
1500 |
0.164 ± 0.049 |
0.959938 ± 0.000000 |
0.997533 ± 0.000000 |
0.003 ± 0.000 |
BlockingPy (faiss_lsh) |
1500 |
0.185 ± 0.015 |
0.955470 ± 0.008391 |
0.997371 ± 0.000112 |
0.003 ± 0.000 |
BlockingPy (voyager) |
1500 |
0.293 ± 0.029 |
0.951772 ± 0.006036 |
0.997382 ± 0.000025 |
0.003 ± 0.000 |
P-Sig |
1500 |
0.051 ± 0.010 |
0.599384 ± 0.000000 |
0.996124 ± 0.000000 |
0.004 ± 0.000 |
λ-fold LSH |
1500 |
0.193 ± 0.009 |
0.465794 ± 0.033199 |
0.990865 ± 0.002847 |
0.010 ± 0.003 |
BlockingPy (faiss_hnsw) |
15000 |
12.029 ± 1.681 |
0.913009 ± 0.000108 |
0.999726 ± 0.000000 |
0.031 ± 0.000 |
BlockingPy (faiss_lsh) |
15000 |
1.398 ± 0.148 |
0.899706 ± 0.003506 |
0.999708 ± 0.000007 |
0.033 ± 0.001 |
BlockingPy (voyager) |
15000 |
6.777 ± 0.903 |
0.875978 ± 0.004538 |
0.999635 ± 0.000008 |
0.041 ± 0.001 |
P-Sig |
15000 |
0.388 ± 0.087 |
0.616241 ± 0.000000 |
0.996380 ± 0.000000 |
0.407 ± 0.000 |
λ-fold LSH |
15000 |
2.016 ± 0.251 |
0.453983 ± 0.027622 |
0.991644 ± 0.002993 |
0.940 ± 0.337 |
BlockingPy (faiss_hnsw) |
150000 |
250.012 ± 4.232 |
0.832092 ± 0.000507 |
0.999967 ± 0.000000 |
0.375 ± 0.001 |
BlockingPy (faiss_lsh) |
150000 |
56.544 ± 1.315 |
0.818256 ± 0.001381 |
0.999964 ± 0.000000 |
0.400 ± 0.005 |
BlockingPy (voyager) |
150000 |
110.341 ± 3.815 |
0.715393 ± 0.005450 |
0.999940 ± 0.000002 |
0.675 ± 0.021 |
P-Sig |
150000 |
3.444 ± 0.091 |
0.608723 ± 0.000000 |
0.996424 ± 0.000000 |
40.231 ± 0.000 |
λ-fold LSH |
150000 |
19.848 ± 0.476 |
0.450190 ± 0.026086 |
0.991550 ± 0.003045 |
95.064 ± 34.252 |
Why BlockingPy outperforms blocklib
Much higher recall
Across all datasets, BlockingPy achieves higher recall then blocklib algorithms. (~0.52 for blocklib vs ~0.88 for BlockingPy).
Better reduction ratio
BlockingPy achieves better reduction ratio than blocklib algorithms, while maintaining higher recall. For instance on a dataset of size 150_000 records the difference in number of pairs between RR of 0.992 (λ-fold LSH) and RR of 0.99994 (voyager) is a difference of 95 milion pairs vs. 0.67 milion pairs requiring comparison.
Minimal setup versus manual tuning
Results shown for BlockingPy can be obtained with just a few lines of code, e.g., blocklib’s p-sig algorithm requires manual setup of blocking features, filters, bloom-filter parameters and signature specifications, which could require significant time and effort to tune.
Scalability
BlockingPy algorithms allow for n_threads selection and most algorithms allow for on-disk index building, where blocklib is missing both of these fetures.
Where is blocklib better
Privacy preserving blocking
blocklib implements privacy preserving blocking algorithms, which are not available in BlockingPy.
Time
blocklib finishes the blocking phase sooner, but the extra minutes that BlockingPy spends are quickly repaid in the matching phase.
In our benchmark (150k dataset) blocklib left ≈ 95 million candidate pairs, whereas BlockingPy left ≈ 0.67 million, that’s a ~140 × reduction.
Even though BlockingPy’s blocking step is ~5 × slower, the downstream classifier now has 140 × less work, so the end-to-end pipeline could still be faster, while achieving much higher recall (0.72 vs. 0.45).
Additionally, we can tune the voyager algorithm to achieve similar recall as blocklib’s algorithms. On those settings the time difference is only ~1.8x, while still getting ~47x less candidate pairs (95 million vs. 2.0 million) compared to λ-fold LSH.
Algorithm |
Size |
Time [s] |
Recall |
Reduction Ratio |
Pairs (M) |
|
|---|---|---|---|---|---|---|
BlockingPy (voyager) - fast |
1500 |
0.215 ± 0.015 |
0.921726 ± 0.010369 |
0.996613 ± 0.000181 |
0.004 ± 0.000 |
|
BlockingPy (voyager) - fast |
15000 |
2.495 ± 0.173 |
0.713395 ± 0.017146 |
0.999297 ± 0.000041 |
0.079 ± 0.005 |
|
BlockingPy (voyager) - fast |
150000 |
34.395 ± 0.722 |
0.450416 ± 0.016316 |
0.999820 ± 0.000009 |
2.025 ± 0.102 |
Blockingpy with GPU acceleration
We also ran BlockingPy on GPU (ann="gpu_faiss"; index types: flat, ivf, cagra). On these datasets the GPU variants match the CPU back-ends on recall and reduction ratio, and the blocking step is faster. For presented dataset sizes, GPU gains are not clearly visible, however on larger datasets our GPU version might surpass blocklib’s algorithms in terms of speed, while still achieving much higher recall.
Algorithm |
Size |
Time [s] |
Recall |
Reduction Ratio |
Pairs (M) |
|---|---|---|---|---|---|
BlockingPy (gpu_faiss cagra) |
1500 |
2.905 ± 0.327 |
0.959938 ± 0.000000 |
0.997499 ± 0.000000 |
0.003 ± 0.000 |
BlockingPy (gpu_faiss flat) |
1500 |
0.898 ± 0.528 |
0.963020 ± 0.000000 |
0.997514 ± 0.000000 |
0.003 ± 0.000 |
BlockingPy (gpu_faiss ivf) |
1500 |
0.908 ± 0.062 |
0.944992 ± 0.004238 |
0.997176 ± 0.000104 |
0.003 ± 0.000 |
BlockingPy (gpu_faiss cagra) |
15000 |
6.835 ± 0.629 |
0.912838 ± 0.000294 |
0.999724 ± 0.000000 |
0.031 ± 0.000 |
BlockingPy (gpu_faiss flat) |
15000 |
1.662 ± 0.165 |
0.913380 ± 0.000000 |
0.999724 ± 0.000000 |
0.031 ± 0.000 |
BlockingPy (gpu_faiss ivf) |
15000 |
2.376 ± 0.244 |
0.866605 ± 0.003575 |
0.999627 ± 0.000006 |
0.042 ± 0.001 |
BlockingPy (gpu_faiss cagra) |
150000 |
51.084 ± 0.802 |
0.826633 ± 0.000155 |
0.999966 ± 0.000000 |
0.380 ± 0.000 |
BlockingPy (gpu_faiss flat) |
150000 |
22.511 ± 0.508 |
0.839217 ± 0.000000 |
0.999967 ± 0.000000 |
0.367 ± 0.000 |
BlockingPy (gpu_faiss ivf) |
150000 |
71.458 ± 2.185 |
0.801427 ± 0.003905 |
0.999957 ± 0.000002 |
0.487 ± 0.021 |