blockingpy.hnsw_blocker.HNSWBlocker

class blockingpy.hnsw_blocker.HNSWBlocker[source]

A class for performing blocking using the Hierarchical Navigable Small World (HNSW) algorithm.

This class implements blocking functionality using the HNSW algorithm for efficient similarity search and nearest neighbor queries.

Parameters:: None

index

The HNSW index used for nearest neighbor search

Type:: hnswlib.Index or None

x_columns

Column names of the reference dataset

Type:: array-like or None

SPACE_MAP

Mapping of distance metric names to their HNSW implementations

Type:: dict

See also

BlockingMethod: Abstract base class defining the blocking interface

Notes

For more details about the HNSW algorithm, see: https://github.com/nmslib/hnswlib

__init__()[source]

Initialize the HNSWBlocker instance.

Creates a new HNSWBlocker with empty index.

Methods

`__init__`()	Initialize the HNSWBlocker instance.
`block`(x, y, k, verbose, controls)	Perform blocking using the HNSW algorithm.

Attributes

SPACE_MAP

SPACE_MAP = {'cosine': 'cosine', 'euclidean': 'l2', 'ip': 'ip', 'l2': 'l2'}

block(x, y, k, verbose, controls)[source]

Perform blocking using the HNSW algorithm.

Parameters:

x (pandas.DataFrame) – Reference dataset containing features for indexing
y (pandas.DataFrame) – Query dataset to find nearest neighbors for
k (int) – Number of nearest neighbors to find. If k is larger than the number of reference points, it will be automatically adjusted
verbose (bool, optional) – If True, print detailed progress information
controls (dict) –
Algorithm control parameters with the following structure: {

’random_seed’: int, ‘hnsw’: {

’k_search’: int, ‘distance’: str, ‘n_threads’: int, ‘path’: str, ‘ef_c’: int, ‘ef_s’: int, ‘M’: int,

}

}

Returns:

DataFrame containing the blocking results with columns: - ‘y’: indices from query dataset - ‘x’: indices of matched items from reference dataset - ‘dist’: distances to matched items

Return type:

pandas.DataFrame

Notes

The function builds an HNSW index from the reference dataset and finds the k-nearest neighbors for each point in the query dataset. The index parameters ef_c (construction) and ef_s (search) control the trade-off between search accuracy and speed.