Blocker

Contains the main Blocker class for record linkage and deduplication blocking.

class blockingpy.blocker.Blocker[source]

Bases: object

A class implementing various blocking methods for record linkage and deduplication.

block(x, y=None, deduplication=True, ann='faiss', true_blocks=None, verbose=0, control_txt=None, control_ann=None, x_colnames=None, y_colnames=None, random_seed=None)[source]

Perform blocking using the specified algorithm.

Parameters:
  • x (pandas.Series or scipy.sparse.csr_matrix or numpy.ndarray) – Reference dataset for blocking

  • y (numpy.ndarray or pandas.Series or scipy.sparse.csr_matrix, optional) – Query dataset (defaults to x for deduplication)

  • deduplication (bool, default True) – Whether to perform deduplication instead of record linkage

  • ann (str, default "faiss") – Approximate Nearest Neighbor algorithm to use

  • true_blocks (pandas.DataFrame, optional) – True blocking information for evaluation

  • verbose (int, default 0) – Verbosity level (0-3). Controls logging level: - 0: WARNING level - 1-3: INFO level with increasing detail

  • control_txt (dict, default {}) – Text processing parameters

  • control_ann (dict, default {}) – ANN algorithm parameters

  • x_colnames (list of str, optional) – Column names for reference dataset used with csr_matrix or np.ndarray

  • y_colnames (list of str, optional) – Column names for query dataset used with csr_matrix or np.ndarray

  • random_seed (int, optional) – Random seed for reproducibility (default is None)

Raises:

ValueError – If one of the input validations fails

Returns:

Object containing blocking results and evaluation metrics

Return type:

BlockingResult

Notes

The function supports three input types: 1. Text data (pandas.Series) 2. Sparse matrices (scipy.sparse.csr_matrix) as a Document-Term Matrix (DTM) 3. Dense matrices (numpy.ndarray) as a Document-Term Matrix (DTM)

Evaluation of larger datasets can be done separately using the eval method.

For text data, additional preprocessing is performed using the parameters in control_txt.

See also

BlockingResult

Class containing blocking results

controls_ann

Function to create ANN control parameters

controls_txt

Function to create text control parameters

eval(blocking_result, true_blocks)[source]

Evaluate blocking results against true block assignments and return new BlockingResult.

This method calculates evaluation metrics and confusion matrix by comparing predicted blocks with known true blocks and returns a new BlockingResult instance containing the evaluation results along with the original blocking results.

Parameters:
  • blocking_result (BlockingResult) – Original blocking result to evaluate

  • true_blocks (pandas.DataFrame) – DataFrame with true block assignments For deduplication: columns [‘x’, ‘block’] For record linkage: columns [‘x’, ‘y’, ‘block’]

Returns:

A new BlockingResult instance with added evaluation results and original blocking results

Return type:

BlockingResult

Examples

>>> blocker = Blocker()
>>> result = blocker.block(x, y)
>>> evaluated = blocker.eval(result, true_blocks)
>>> print(evaluated.metrics)

See also

block

Main blocking method that includes evaluation

BlockingResult

Class for analyzing blocking results

More information