Deduplication on GPU (adaptation of Example No. 2)
This example reproduces the Deduplication No. 2
walkthrough but with the GPU build of BlockingPy.
The GPU version (blockingpy-gpu) accelerates blocking with FAISS-GPU
, offering significant speedups on large datasets.
Installation
You cannot get FAISS-GPU from PyPI wheels directly, so installation requires conda/mamba for FAISS and pip for BlockingPy:
# 1) Create environment
mamba create -n blockingpy-gpu python=3.10 -y
conda activate blockingpy-gpu
# 2) Install FAISS GPU (nightly cuVS build) - this was tested
mamba install -c pytorch/label/nightly \
faiss-gpu-cuvs=1.11.0=py3.10_ha3bacd1_55_cuda12.4.0_nightly -y
# 3) Install BlockingPy GPU package
pip install blockingpy-gpu
Data preparation
Firstly, we need to prepare the dataset:
import pandas as pd
from blockingpy import Blocker
from blockingpy.datasets import load_deduplication_data
data = load_deduplication_data()
data = data.fillna('')
data[['by', 'bm', 'bd']] = data[['by', 'bm', 'bd']].astype(str)
data['txt'] = (
data["fname_c1"] +
data["fname_c2"] +
data['lname_c1'] +
data['lname_c2'] +
data['by'] +
data['bm'] +
data['bd']
)
Deduplication
Now, we can deduplicate the dataset using ann='gpu_faiss':
blocker = Blocker()
dedup_result = blocker.block(
x=data['txt'],
ann='gpu_faiss',
verbose=1,
random_seed=42,
)
print(dedup_result)
# ========================================================
# Blocking based on the gpu_faiss method.
# Number of blocks: 2737
# Number of columns created for blocking: 674
# Reduction ratio: 0.999583
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
# 2 | 965
# 3 | 722
# 4 | 421
# 5 | 247
# 6 | 140
# 7 | 98
# 8 | 49
# 9 | 35
# 10 | 26
# 11 | 13
# 12 | 7
# 13 | 3
# 14 | 4
# 15 | 1
# 16 | 1
# 17 | 1
# 18 | 2
# 20 | 1
# 66 | 1
print(dedup_result.result.head())
# x y block dist
# 0 3402 0 0 0.128420
# 1 1179 1 1 0.165676
# 2 2457 2 2 0.104868
# 3 1956 3 3 0.042670
# 4 4448 4 4 0.187500
Customizing GPU Index with control_ann
We can customize gpu_faiss through the control_ann dict. Let’s set the algorithm to cagra:
gpu_controls = {
"gpu_faiss": {
"index_type": "cagra", # flat, ivf, ivfpq, cagra
"distance": "cosine",
# here you can tweak the parameters of CAGRA and others.
}
}
blocker = Blocker()
dedup_result = blocker.block(
x=data['txt'],
ann='gpu_faiss',
control_ann=gpu_controls,
verbose=1,
random_seed=42,
)
Evaluation
Now, we can evaluate the algorithm. For that we need to prepare the true_blocks ground-truth dataset:
df_eval = data.copy()
df_eval['block'] = df_eval['true_id']
df_eval['x'] = range(len(df_eval))
true_blocks_dedup = df_eval[['x', 'block']]
And now, evaluate it:
blocker = Blocker()
eval_result = blocker.block(
x=df_eval['txt'],
ann='gpu_faiss',
true_blocks=true_blocks_dedup,
control_ann=gpu_controls, # evaluation with chosen GPU index
verbose=1,
random_seed=42,
)
print(eval_result.reduction_ratio)
# 0.9995822182218221
print(eval_result.metrics)
# recall 1.000000
# precision 0.047895
# fpr 0.000398
# fnr 0.000000
# accuracy 0.999602
# specificity 0.999602
# f1_score 0.091412
# dtype: float64
print(eval_result.confusion)
# Predicted Positive Predicted Negative
# Actual Positive 1000 0
# Actual Negative 19879 49974121
When evaluated with the ground truth, CAGRA achieves very high recall (in this dataset 100%), meaning no true duplicate pairs are lost, while still reaching an excellent reduction ratio (99.95%).
CAGRA is conceptually similar to HNSW—both are graph-based ANN algorithms—but unlike HNSW (CPU), CAGRA is fully GPU-optimized, allowing much higher throughput on large, high-dimensional datasets.
Note: this dataset is too small to demonstrate the speed advantage; the benefits of CAGRA become clear on larger inputs where GPU parallelism matters.
We encourage you to try blockingpy-gpu yourself!