blockingpy.mlpack_blocker.MLPackBlocker
- class blockingpy.mlpack_blocker.MLPackBlocker[source]
A class for performing blocking using MLPack algorithms (LSH or k-d tree).
This class implements blocking functionality using either Locality-Sensitive Hashing (LSH) or k-d tree algorithms from the MLPack library for efficient similarity search and nearest neighbor queries.
- Parameters:
None
- algo
The selected algorithm (‘lsh’ or ‘kd’)
- Type:
str or None
- ALGO_MAP
Mapping of algorithm names to their MLPack implementations
- Type:
dict
See also
BlockingMethodAbstract base class defining the blocking interface
Notes
For more details about the MLPack library and its algorithms, see: https://github.com/mlpack
- __init__()[source]
Initialize the MLPackBlocker instance.
Creates a new MLPackBlocker with no algorithm selected.
Methods
__init__()Initialize the MLPackBlocker instance.
block(x, y, k, verbose, controls)Perform blocking using MLPack algorithm (LSH or k-d tree).
- block(x, y, k, verbose, controls)[source]
Perform blocking using MLPack algorithm (LSH or k-d tree).
- Parameters:
x (DataHandler) – Reference dataset containing features for indexing
y (DataHandler) – Query dataset to find nearest neighbors for
k (int) – Number of nearest neighbors to find
verbose (bool, optional) – If True, print detailed progress information
controls (dict) –
Algorithm control parameters with the following structure: {
’random_seed’: int, ‘algo’: str # ‘lsh’ or ‘kd’, ‘lsh’: { # if using LSH
’k_search’: int, ‘bucket_size’: int, ‘hash_width’: float, ‘num_probes’: int, ‘projections’: int, ‘tables’: int
}, ‘kd’: { # if using k-d tree
’k_search’: int, ‘algorithm’: str, ‘leaf_size’: int, ‘tree_type’: str, ‘epsilon’: float, ‘rho’: float, ‘tau’: float, ‘random_basis’: bool
}
}
- Returns:
DataFrame containing the blocking results with columns: - ‘y’: indices from query dataset - ‘x’: indices of matched items from reference dataset - ‘dist’: distances to matched items
- Return type:
pandas.DataFrame
Notes
The function supports two different algorithms: - LSH (Locality-Sensitive Hashing): Better for high-dimensional data - k-d tree: Better for low-dimensional data