blockingpy.mlpack_blocker.MLPackBlocker

class blockingpy.mlpack_blocker.MLPackBlocker[source]

A class for performing blocking using MLPack algorithms (LSH or k-d tree).

This class implements blocking functionality using either Locality-Sensitive Hashing (LSH) or k-d tree algorithms from the MLPack library for efficient similarity search and nearest neighbor queries.

Parameters:: None

algo

The selected algorithm (‘lsh’ or ‘kd’)

Type:: str or None

ALGO_MAP

Mapping of algorithm names to their MLPack implementations

Type:: dict

See also

BlockingMethod: Abstract base class defining the blocking interface

Notes

For more details about the MLPack library and its algorithms, see: https://github.com/mlpack

__init__()[source]

Initialize the MLPackBlocker instance.

Creates a new MLPackBlocker with no algorithm selected.

Methods

`__init__`()	Initialize the MLPackBlocker instance.
`block`(x, y, k, verbose, controls)	Perform blocking using MLPack algorithm (LSH or k-d tree).

block(x, y, k, verbose, controls)[source]

Perform blocking using MLPack algorithm (LSH or k-d tree).

Parameters:

x (DataHandler) – Reference dataset containing features for indexing
y (DataHandler) – Query dataset to find nearest neighbors for
k (int) – Number of nearest neighbors to find
verbose (bool, optional) – If True, print detailed progress information
controls (dict) –
Algorithm control parameters with the following structure: {

’random_seed’: int, ‘algo’: str # ‘lsh’ or ‘kd’, ‘lsh’: { # if using LSH

’k_search’: int, ‘bucket_size’: int, ‘hash_width’: float, ‘num_probes’: int, ‘projections’: int, ‘tables’: int

}, ‘kd’: { # if using k-d tree

’k_search’: int, ‘algorithm’: str, ‘leaf_size’: int, ‘tree_type’: str, ‘epsilon’: float, ‘rho’: float, ‘tau’: float, ‘random_basis’: bool

}

}

Returns:

DataFrame containing the blocking results with columns: - ‘y’: indices from query dataset - ‘x’: indices of matched items from reference dataset - ‘dist’: distances to matched items

Return type:

pandas.DataFrame

Notes

The function supports two different algorithms: - LSH (Locality-Sensitive Hashing): Better for high-dimensional data - k-d tree: Better for low-dimensional data