Blocker
Contains the main Blocker class for record linkage and deduplication blocking.
- class blockingpy.blocker.Blocker[source]
Bases:
objectA class implementing various blocking methods for record linkage and deduplication.
- block(x, y=None, deduplication=True, ann='faiss', true_blocks=None, verbose=0, control_txt=None, control_ann=None, x_colnames=None, y_colnames=None, random_seed=None)[source]
Perform blocking using the specified algorithm.
- Parameters:
x (pandas.Series or scipy.sparse.csr_matrix or numpy.ndarray) – Reference dataset for blocking
y (numpy.ndarray or pandas.Series or scipy.sparse.csr_matrix, optional) – Query dataset (defaults to x for deduplication)
deduplication (bool, default True) – Whether to perform deduplication instead of record linkage
ann (str, default "faiss") – Approximate Nearest Neighbor algorithm to use
true_blocks (pandas.DataFrame, optional) – True blocking information for evaluation
verbose (int, default 0) – Verbosity level (0-3). Controls logging level: - 0: WARNING level - 1-3: INFO level with increasing detail
control_txt (dict, default {}) – Text processing parameters
control_ann (dict, default {}) – ANN algorithm parameters
x_colnames (list of str, optional) – Column names for reference dataset used with csr_matrix or np.ndarray
y_colnames (list of str, optional) – Column names for query dataset used with csr_matrix or np.ndarray
random_seed (int, optional) – Random seed for reproducibility (default is None)
- Raises:
ValueError – If one of the input validations fails
- Returns:
Object containing blocking results and evaluation metrics
- Return type:
Notes
The function supports three input types: 1. Text data (pandas.Series) 2. Sparse matrices (scipy.sparse.csr_matrix) as a Document-Term Matrix (DTM) 3. Dense matrices (numpy.ndarray) as a Document-Term Matrix (DTM)
Evaluation of larger datasets can be done separately using the eval method.
For text data, additional preprocessing is performed using the parameters in control_txt.
See also
BlockingResultClass containing blocking results
controls_annFunction to create ANN control parameters
controls_txtFunction to create text control parameters
- eval(blocking_result, true_blocks)[source]
Evaluate blocking results against true block assignments and return new BlockingResult.
This method calculates evaluation metrics and confusion matrix by comparing predicted blocks with known true blocks and returns a new BlockingResult instance containing the evaluation results along with the original blocking results.
- Parameters:
blocking_result (BlockingResult) – Original blocking result to evaluate
true_blocks (pandas.DataFrame) – DataFrame with true block assignments For deduplication: columns [‘x’, ‘block’] For record linkage: columns [‘x’, ‘y’, ‘block’]
- Returns:
A new BlockingResult instance with added evaluation results and original blocking results
- Return type:
Examples
>>> blocker = Blocker() >>> result = blocker.block(x, y) >>> evaluated = blocker.eval(result, true_blocks) >>> print(evaluated.metrics)
See also
blockMain blocking method that includes evaluation
BlockingResultClass for analyzing blocking results
More information
For details about
control_annandcontrol_txtsee Configuration and Tuning.For other details see User Guide.