blockingpy.voyager_blocker.VoyagerBlocker

class blockingpy.voyager_blocker.VoyagerBlocker[source]

A class for performing blocking using the Voyager algorithm from Spotify.

This class implements blocking functionality using Spotify’s Voyager algorithm for efficient approximate nearest neighbor search. It supports multiple distance metrics and is designed for high-dimensional data.

Parameters:

None

index

The Voyager index used for nearest neighbor search

Type:

voyager.Index or None

x_columns

Column names of the reference dataset

Type:

array-like or None

METRIC_MAP

Mapping of distance metric names to Voyager Space types

Type:

dict

See also

BlockingMethod

Abstract base class defining the blocking interface

voyager.Index

The underlying Voyager index implementation

Raises:

ValueError – If path is provided but incorrect

Notes

For more details about the Voyager algorithm and implementation, see: https://github.com/spotify/voyager

__init__()[source]

Initialize the VoyagerBlocker instance.

Creates a new VoyagerBlocker with empty index.

Methods

__init__()

Initialize the VoyagerBlocker instance.

block(x, y, k, verbose, controls)

Perform blocking using the Voyager algorithm.

Attributes

METRIC_MAP

METRIC_MAP = {'cosine': voyager.Space.Cosine, 'euclidean': voyager.Space.Euclidean, 'inner_product': voyager.Space.InnerProduct}
block(x, y, k, verbose, controls)[source]

Perform blocking using the Voyager algorithm.

Parameters:
  • x (DataHandler) – Reference dataset containing features for indexing

  • y (DataHandler) – Query dataset to find nearest neighbors for

  • k (int) – Number of nearest neighbors to find

  • verbose (bool, optional) – If True, print detailed progress information

  • controls (dict) –

    Algorithm control parameters with the following structure: {

    ’random_seed’: int, ‘voyager’: {

    ’distance’: str, ‘k_search’: int, ‘path’: str, ‘M’: int, ‘ef_construction’: int, ‘max_elements’: int, ‘num_threads’: int, ‘query_ef’: int

    }

    }

Returns:

DataFrame containing the blocking results with columns: - ‘y’: indices from query dataset - ‘x’: indices of matched items from reference dataset - ‘dist’: distances to matched items

Return type:

pandas.DataFrame

Notes

The algorithm uses a graph-based approach for approximate nearest neighbor search. The quality of approximation can be controlled through parameters like ef_construction and query_ef.