blockingpy.mlpack_blocker.MLPackBlocker

class blockingpy.mlpack_blocker.MLPackBlocker[source]

A class for performing blocking using MLPack algorithms (LSH or k-d tree).

This class implements blocking functionality using either Locality-Sensitive Hashing (LSH) or k-d tree algorithms from the MLPack library for efficient similarity search and nearest neighbor queries.

Parameters:

None

algo

The selected algorithm (‘lsh’ or ‘kd’)

Type:

str or None

ALGO_MAP

Mapping of algorithm names to their MLPack implementations

Type:

dict

See also

BlockingMethod

Abstract base class defining the blocking interface

Notes

For more details about the MLPack library and its algorithms, see: https://github.com/mlpack

__init__()[source]

Initialize the MLPackBlocker instance.

Creates a new MLPackBlocker with no algorithm selected.

Methods

__init__()

Initialize the MLPackBlocker instance.

block(x, y, k, verbose, controls)

Perform blocking using MLPack algorithm (LSH or k-d tree).

block(x, y, k, verbose, controls)[source]

Perform blocking using MLPack algorithm (LSH or k-d tree).

Parameters:
  • x (DataHandler) – Reference dataset containing features for indexing

  • y (DataHandler) – Query dataset to find nearest neighbors for

  • k (int) – Number of nearest neighbors to find

  • verbose (bool, optional) – If True, print detailed progress information

  • controls (dict) –

    Algorithm control parameters with the following structure: {

    ’random_seed’: int, ‘algo’: str # ‘lsh’ or ‘kd’, ‘lsh’: { # if using LSH

    ’k_search’: int, ‘bucket_size’: int, ‘hash_width’: float, ‘num_probes’: int, ‘projections’: int, ‘tables’: int

    }, ‘kd’: { # if using k-d tree

    ’k_search’: int, ‘algorithm’: str, ‘leaf_size’: int, ‘tree_type’: str, ‘epsilon’: float, ‘rho’: float, ‘tau’: float, ‘random_basis’: bool

    }

    }

Returns:

DataFrame containing the blocking results with columns: - ‘y’: indices from query dataset - ‘x’: indices of matched items from reference dataset - ‘dist’: distances to matched items

Return type:

pandas.DataFrame

Notes

The function supports two different algorithms: - LSH (Locality-Sensitive Hashing): Better for high-dimensional data - k-d tree: Better for low-dimensional data