BlockingPy vs blocklib - comparison

Below we compare BlockingPy with blocklib, a similar library for blocking. We present results obtained by running algorithms from both libraries on 3 generated datasets across 10 runs. The datasets were generated using the geco3 tool, which allows for controlled generation of datasets with duplicates. The datasets resemble real-world personal information data with the fields such as name, 2nd name, surname, dob, municipality, and country of origin. There are 1,5k, 15k and 150k records respectively, with 500, 5k and 50k duplicates in each dataset. For each original record, there are 0, 1, or 2 duplicates. The datasets and code to reproduce the results can be found here. The results were obtained on 6 cores Intel i5 CPU with 16GB RAM (py 3.12).

Algorithm

Size

Time [s]

Recall

Reduction Ratio

Pairs (M)

BlockingPy (faiss_hnsw)

1500

0.164 ± 0.049

0.959938 ± 0.000000

0.997533 ± 0.000000

0.003 ± 0.000

BlockingPy (faiss_lsh)

1500

0.185 ± 0.015

0.955470 ± 0.008391

0.997371 ± 0.000112

0.003 ± 0.000

BlockingPy (voyager)

1500

0.293 ± 0.029

0.951772 ± 0.006036

0.997382 ± 0.000025

0.003 ± 0.000

P-Sig

1500

0.051 ± 0.010

0.599384 ± 0.000000

0.996124 ± 0.000000

0.004 ± 0.000

λ-fold LSH

1500

0.193 ± 0.009

0.465794 ± 0.033199

0.990865 ± 0.002847

0.010 ± 0.003

BlockingPy (faiss_hnsw)

15000

12.029 ± 1.681

0.913009 ± 0.000108

0.999726 ± 0.000000

0.031 ± 0.000

BlockingPy (faiss_lsh)

15000

1.398 ± 0.148

0.899706 ± 0.003506

0.999708 ± 0.000007

0.033 ± 0.001

BlockingPy (voyager)

15000

6.777 ± 0.903

0.875978 ± 0.004538

0.999635 ± 0.000008

0.041 ± 0.001

P-Sig

15000

0.388 ± 0.087

0.616241 ± 0.000000

0.996380 ± 0.000000

0.407 ± 0.000

λ-fold LSH

15000

2.016 ± 0.251

0.453983 ± 0.027622

0.991644 ± 0.002993

0.940 ± 0.337

BlockingPy (faiss_hnsw)

150000

250.012 ± 4.232

0.832092 ± 0.000507

0.999967 ± 0.000000

0.375 ± 0.001

BlockingPy (faiss_lsh)

150000

56.544 ± 1.315

0.818256 ± 0.001381

0.999964 ± 0.000000

0.400 ± 0.005

BlockingPy (voyager)

150000

110.341 ± 3.815

0.715393 ± 0.005450

0.999940 ± 0.000002

0.675 ± 0.021

P-Sig

150000

3.444 ± 0.091

0.608723 ± 0.000000

0.996424 ± 0.000000

40.231 ± 0.000

λ-fold LSH

150000

19.848 ± 0.476

0.450190 ± 0.026086

0.991550 ± 0.003045

95.064 ± 34.252

Why BlockingPy outperforms blocklib

  1. Much higher recall

Across all datasets, BlockingPy achieves higher recall then blocklib algorithms. (~0.52 for blocklib vs ~0.88 for BlockingPy).

  1. Better reduction ratio

BlockingPy achieves better reduction ratio than blocklib algorithms, while maintaining higher recall. For instance on a dataset of size 150_000 records the difference in number of pairs between RR of 0.992 (λ-fold LSH) and RR of 0.99994 (voyager) is a difference of 95 milion pairs vs. 0.67 milion pairs requiring comparison.

  1. Minimal setup versus manual tuning

Results shown for BlockingPy can be obtained with just a few lines of code, e.g., blocklib’s p-sig algorithm requires manual setup of blocking features, filters, bloom-filter parameters and signature specifications, which could require significant time and effort to tune.

  1. Scalability

BlockingPy algorithms allow for n_threads selection and most algorithms allow for on-disk index building, where blocklib is missing both of these fetures.

Where is blocklib better

  1. Privacy preserving blocking

blocklib implements privacy preserving blocking algorithms, which are not available in BlockingPy.

  1. Time

blocklib finishes the blocking phase sooner, but the extra minutes that BlockingPy spends are quickly repaid in the matching phase.
In our benchmark (150k dataset) blocklib left ≈ 95 million candidate pairs, whereas BlockingPy left ≈ 0.67 million, that’s a ~140 × reduction.
Even though BlockingPy’s blocking step is ~5 × slower, the downstream classifier now has 140 × less work, so the end-to-end pipeline could still be faster, while achieving much higher recall (0.72 vs. 0.45).

Additionally, we can tune the voyager algorithm to achieve similar recall as blocklib’s algorithms. On those settings the time difference is only ~1.8x, while still getting ~47x less candidate pairs (95 million vs. 2.0 million) compared to λ-fold LSH.

Algorithm

Size

Time [s]

Recall

Reduction Ratio

Pairs (M)

BlockingPy (voyager) - fast

1500

0.215 ± 0.015

0.921726 ± 0.010369

0.996613 ± 0.000181

0.004 ± 0.000

BlockingPy (voyager) - fast

15000

2.495 ± 0.173

0.713395 ± 0.017146

0.999297 ± 0.000041

0.079 ± 0.005

BlockingPy (voyager) - fast

150000

34.395 ± 0.722

0.450416 ± 0.016316

0.999820 ± 0.000009

2.025 ± 0.102

Blockingpy with GPU acceleration

We also ran BlockingPy on GPU (ann="gpu_faiss"; index types: flat, ivf, cagra). On these datasets the GPU variants match the CPU back-ends on recall and reduction ratio, and the blocking step is faster. For presented dataset sizes, GPU gains are not clearly visible, however on larger datasets our GPU version might surpass blocklib’s algorithms in terms of speed, while still achieving much higher recall.

Algorithm

Size

Time [s]

Recall

Reduction Ratio

Pairs (M)

BlockingPy (gpu_faiss cagra)

1500

2.905 ± 0.327

0.959938 ± 0.000000

0.997499 ± 0.000000

0.003 ± 0.000

BlockingPy (gpu_faiss flat)

1500

0.898 ± 0.528

0.963020 ± 0.000000

0.997514 ± 0.000000

0.003 ± 0.000

BlockingPy (gpu_faiss ivf)

1500

0.908 ± 0.062

0.944992 ± 0.004238

0.997176 ± 0.000104

0.003 ± 0.000

BlockingPy (gpu_faiss cagra)

15000

6.835 ± 0.629

0.912838 ± 0.000294

0.999724 ± 0.000000

0.031 ± 0.000

BlockingPy (gpu_faiss flat)

15000

1.662 ± 0.165

0.913380 ± 0.000000

0.999724 ± 0.000000

0.031 ± 0.000

BlockingPy (gpu_faiss ivf)

15000

2.376 ± 0.244

0.866605 ± 0.003575

0.999627 ± 0.000006

0.042 ± 0.001

BlockingPy (gpu_faiss cagra)

150000

51.084 ± 0.802

0.826633 ± 0.000155

0.999966 ± 0.000000

0.380 ± 0.000

BlockingPy (gpu_faiss flat)

150000

22.511 ± 0.508

0.839217 ± 0.000000

0.999967 ± 0.000000

0.367 ± 0.000

BlockingPy (gpu_faiss ivf)

150000

71.458 ± 2.185

0.801427 ± 0.003905

0.999957 ± 0.000002

0.487 ± 0.021