blockingpy.faiss_blocker.FaissBlocker
- class blockingpy.faiss_blocker.FaissBlocker[source]
A class for performing blocking using the FAISS (Facebook AI Similarity Search) algorithm.
This class implements blocking functionality using Facebook’s FAISS library for efficient similarity search and nearest neighbor queries. It supports multiple distance metrics and is optimized for high-performance computing.
- Parameters:
None
- index
The FAISS index used for nearest neighbor search
- Type:
faiss.Index
- x_columns
Column names of the reference dataset
- Type:
array-like or None
- METRIC_MAP
Mapping of distance metric names to FAISS metric types
- Type:
dict
See also
BlockingMethodAbstract base class defining the blocking interface
faiss.IndexThe underlying FAISS index implementation
Notes
The available Index types from FAISS are: ‘flat’, ‘hnsw’, and ‘lsh’. - ‘flat’ is a brute-force exact search (most accurate but slowest) - ‘hnsw’ is a Hierarchical Navigable Small World graph algorithm
(good balance of speed and accuracy)
- ‘lsh’ is a Locality Sensitive Hashing algorithm
(fastest but approximate results)
For more details about the FAISS library and implementation, see: https://github.com/facebookresearch/faiss
Some distance metrics require special handling: - Cosine similarity is implemented through L2 normalization - Jensen-Shannon and Canberra metrics require smoothing to handle zero values - Selected distance metrics does not affect the algorithm if ‘lsh’ was selected
Faiss does not support random_seed parameter. Instead, it handles reproducibility inside the algorithm. For more details, see: https://gist.github.com/mdouze/1892178b5663b80e85ab076966c59c28
- __init__()[source]
Initialize the FaissBlocker instance.
Creates a new FaissBlocker with empty index.
Methods
__init__()Initialize the FaissBlocker instance.
block(x, y, k, verbose, controls)Perform blocking using the FAISS algorithm.
Attributes
- METRIC_MAP = {'bray_curtis': faiss.METRIC_BrayCurtis, 'canberra': faiss.METRIC_Canberra, 'cosine': faiss.METRIC_INNER_PRODUCT, 'euclidean': faiss.METRIC_L2, 'inner_product': faiss.METRIC_INNER_PRODUCT, 'jensen_shannon': faiss.METRIC_JensenShannon, 'l1': faiss.METRIC_L1, 'l2': faiss.METRIC_L2, 'linf': faiss.METRIC_Linf, 'manhattan': faiss.METRIC_L1}
- block(x, y, k, verbose, controls)[source]
Perform blocking using the FAISS algorithm.
- Parameters:
x (DataHandler) – Reference dataset containing features for indexing
y (DataHandler) – Query dataset to find nearest neighbors for
k (int) – Number of nearest neighbors to find
verbose (bool, optional) – If True, print detailed progress information
controls (dict) –
Algorithm control parameters with the following structure: {
- ’faiss’: {
‘index_type’: [‘flat’, ‘hnsw’, ‘lsh’], ‘distance’: str, ‘k_search’: int, ‘path’: str,
’hnsw_M’: int, ‘hnsw_ef_construction’: int, ‘hnsw_ef_search’: int,
’lsh_nbits’: int, (gets multiplied by dimensions) ‘lsh_rotate_data’: bool,
}
}
- Returns:
DataFrame containing the blocking results with columns: - ‘y’: indices from query dataset - ‘x’: indices of matched items from reference dataset - ‘dist’: distances to matched items
- Return type:
pandas.DataFrame
Notes
Special preprocessing is applied for certain metrics: - For cosine similarity, vectors are L2-normalized - For Jensen-Shannon and Canberra metrics, small constant is added
to prevent undefined values
For LSH index, the distance calculation is determined by the hash function, not directly by the selected distance metric