blockingpy.faiss_blocker.FaissBlocker

class blockingpy.faiss_blocker.FaissBlocker[source]

A class for performing blocking using the FAISS (Facebook AI Similarity Search) algorithm.

This class implements blocking functionality using Facebook’s FAISS library for efficient similarity search and nearest neighbor queries. It supports multiple distance metrics and is optimized for high-performance computing.

Parameters:

None

index

The FAISS index used for nearest neighbor search

Type:

faiss.Index

x_columns

Column names of the reference dataset

Type:

array-like or None

METRIC_MAP

Mapping of distance metric names to FAISS metric types

Type:

dict

See also

BlockingMethod

Abstract base class defining the blocking interface

faiss.Index

The underlying FAISS index implementation

Notes

The available Index types from FAISS are: ‘flat’, ‘hnsw’, and ‘lsh’. - ‘flat’ is a brute-force exact search (most accurate but slowest) - ‘hnsw’ is a Hierarchical Navigable Small World graph algorithm

(good balance of speed and accuracy)

  • ‘lsh’ is a Locality Sensitive Hashing algorithm

    (fastest but approximate results)

For more details about the FAISS library and implementation, see: https://github.com/facebookresearch/faiss

Some distance metrics require special handling: - Cosine similarity is implemented through L2 normalization - Jensen-Shannon and Canberra metrics require smoothing to handle zero values - Selected distance metrics does not affect the algorithm if ‘lsh’ was selected

Faiss does not support random_seed parameter. Instead, it handles reproducibility inside the algorithm. For more details, see: https://gist.github.com/mdouze/1892178b5663b80e85ab076966c59c28

__init__()[source]

Initialize the FaissBlocker instance.

Creates a new FaissBlocker with empty index.

Methods

__init__()

Initialize the FaissBlocker instance.

block(x, y, k, verbose, controls)

Perform blocking using the FAISS algorithm.

Attributes

METRIC_MAP

METRIC_MAP = {'bray_curtis': faiss.METRIC_BrayCurtis, 'canberra': faiss.METRIC_Canberra, 'cosine': faiss.METRIC_INNER_PRODUCT, 'euclidean': faiss.METRIC_L2, 'inner_product': faiss.METRIC_INNER_PRODUCT, 'jensen_shannon': faiss.METRIC_JensenShannon, 'l1': faiss.METRIC_L1, 'l2': faiss.METRIC_L2, 'linf': faiss.METRIC_Linf, 'manhattan': faiss.METRIC_L1}
block(x, y, k, verbose, controls)[source]

Perform blocking using the FAISS algorithm.

Parameters:
  • x (DataHandler) – Reference dataset containing features for indexing

  • y (DataHandler) – Query dataset to find nearest neighbors for

  • k (int) – Number of nearest neighbors to find

  • verbose (bool, optional) – If True, print detailed progress information

  • controls (dict) –

    Algorithm control parameters with the following structure: {

    ’faiss’: {

    ‘index_type’: [‘flat’, ‘hnsw’, ‘lsh’], ‘distance’: str, ‘k_search’: int, ‘path’: str,

    ’hnsw_M’: int, ‘hnsw_ef_construction’: int, ‘hnsw_ef_search’: int,

    ’lsh_nbits’: int, (gets multiplied by dimensions) ‘lsh_rotate_data’: bool,

    }

    }

Returns:

DataFrame containing the blocking results with columns: - ‘y’: indices from query dataset - ‘x’: indices of matched items from reference dataset - ‘dist’: distances to matched items

Return type:

pandas.DataFrame

Notes

Special preprocessing is applied for certain metrics: - For cosine similarity, vectors are L2-normalized - For Jensen-Shannon and Canberra metrics, small constant is added

to prevent undefined values

  • For LSH index, the distance calculation is determined by the hash function, not directly by the selected distance metric