blockingpy.hnsw_blocker.HNSWBlocker

class blockingpy.hnsw_blocker.HNSWBlocker[source]

A class for performing blocking using the Hierarchical Navigable Small World (HNSW) algorithm.

This class implements blocking functionality using the HNSW algorithm for efficient similarity search and nearest neighbor queries.

Parameters:

None

index

The HNSW index used for nearest neighbor search

Type:

hnswlib.Index or None

x_columns

Column names of the reference dataset

Type:

array-like or None

SPACE_MAP

Mapping of distance metric names to their HNSW implementations

Type:

dict

See also

BlockingMethod

Abstract base class defining the blocking interface

Notes

For more details about the HNSW algorithm, see: https://github.com/nmslib/hnswlib

__init__()[source]

Initialize the HNSWBlocker instance.

Creates a new HNSWBlocker with empty index.

Methods

__init__()

Initialize the HNSWBlocker instance.

block(x, y, k, verbose, controls)

Perform blocking using the HNSW algorithm.

Attributes

SPACE_MAP

SPACE_MAP = {'cosine': 'cosine', 'euclidean': 'l2', 'ip': 'ip', 'l2': 'l2'}
block(x, y, k, verbose, controls)[source]

Perform blocking using the HNSW algorithm.

Parameters:
  • x (pandas.DataFrame) – Reference dataset containing features for indexing

  • y (pandas.DataFrame) – Query dataset to find nearest neighbors for

  • k (int) – Number of nearest neighbors to find. If k is larger than the number of reference points, it will be automatically adjusted

  • verbose (bool, optional) – If True, print detailed progress information

  • controls (dict) –

    Algorithm control parameters with the following structure: {

    ’random_seed’: int, ‘hnsw’: {

    ’k_search’: int, ‘distance’: str, ‘n_threads’: int, ‘path’: str, ‘ef_c’: int, ‘ef_s’: int, ‘M’: int,

    }

    }

Returns:

DataFrame containing the blocking results with columns: - ‘y’: indices from query dataset - ‘x’: indices of matched items from reference dataset - ‘dist’: distances to matched items

Return type:

pandas.DataFrame

Notes

The function builds an HNSW index from the reference dataset and finds the k-nearest neighbors for each point in the query dataset. The index parameters ef_c (construction) and ef_s (search) control the trade-off between search accuracy and speed.