blockingpy.hnsw_blocker.HNSWBlocker
- class blockingpy.hnsw_blocker.HNSWBlocker[source]
A class for performing blocking using the Hierarchical Navigable Small World (HNSW) algorithm.
This class implements blocking functionality using the HNSW algorithm for efficient similarity search and nearest neighbor queries.
- Parameters:
None
- index
The HNSW index used for nearest neighbor search
- Type:
hnswlib.Index or None
- x_columns
Column names of the reference dataset
- Type:
array-like or None
- SPACE_MAP
Mapping of distance metric names to their HNSW implementations
- Type:
dict
See also
BlockingMethodAbstract base class defining the blocking interface
Notes
For more details about the HNSW algorithm, see: https://github.com/nmslib/hnswlib
- __init__()[source]
Initialize the HNSWBlocker instance.
Creates a new HNSWBlocker with empty index.
Methods
__init__()Initialize the HNSWBlocker instance.
block(x, y, k, verbose, controls)Perform blocking using the HNSW algorithm.
Attributes
- SPACE_MAP = {'cosine': 'cosine', 'euclidean': 'l2', 'ip': 'ip', 'l2': 'l2'}
- block(x, y, k, verbose, controls)[source]
Perform blocking using the HNSW algorithm.
- Parameters:
x (pandas.DataFrame) – Reference dataset containing features for indexing
y (pandas.DataFrame) – Query dataset to find nearest neighbors for
k (int) – Number of nearest neighbors to find. If k is larger than the number of reference points, it will be automatically adjusted
verbose (bool, optional) – If True, print detailed progress information
controls (dict) –
Algorithm control parameters with the following structure: {
’random_seed’: int, ‘hnsw’: {
’k_search’: int, ‘distance’: str, ‘n_threads’: int, ‘path’: str, ‘ef_c’: int, ‘ef_s’: int, ‘M’: int,
}
}
- Returns:
DataFrame containing the blocking results with columns: - ‘y’: indices from query dataset - ‘x’: indices of matched items from reference dataset - ‘dist’: distances to matched items
- Return type:
pandas.DataFrame
Notes
The function builds an HNSW index from the reference dataset and finds the k-nearest neighbors for each point in the query dataset. The index parameters ef_c (construction) and ef_s (search) control the trade-off between search accuracy and speed.