BlockingResult
Contains the BlockingResult class for analyzing and printing blocking results.
- class blockingpy.blocking_result.BlockingResult(x_df, ann, deduplication, n_original_records, true_blocks, eval_metrics, confusion, colnames_xy, reduction_ratio=None)[source]
Bases:
objectA class to represent and analyze the results of a blocking operation.
This class provides functionality to analyze and evaluate blocking results, including calculation of reduction ratios, metrics evaluation.
- Parameters:
x_df (pandas.DataFrame) – DataFrame containing blocking results with columns [‘x’, ‘y’, ‘block’, ‘dist’]
ann (str) – The blocking method used (e.g., ‘nnd’, ‘hnsw’, ‘annoy’, etc.)
deduplication (bool) – Whether the blocking was performed for deduplication
true_blocks (pandas.DataFrame, optional) – DataFrame with true blocks to calculate evaluation metrics
n_original_records (tuple[int, int]) – Number of records in the original dataset(s)
eval_metrics (pandas.Series, optional) – Evaluation metrics if true blocks were provided
confusion (pandas.DataFrame, optional) – Confusion matrix if true blocks were provided
colnames_xy (numpy.ndarray) – Column names used in the blocking process
reduction_ratio (float, optional) – Pre-calculated reduction ratio (default None)
- result
The blocking results containing [‘x’, ‘y’, ‘block’, ‘dist’] columns
- Type:
pandas.DataFrame
- method
Name of the blocking method used
- Type:
str
- deduplication
Indicates if this was a deduplication operation
- Type:
bool
- metrics
Evaluation metrics if true blocks were provided
- Type:
pandas.Series or None
- confusion
Confusion matrix if true blocks were provided
- Type:
pandas.DataFrame or None
- colnames
Names of columns used in blocking
- Type:
numpy.ndarray
- n_original_records
Number of records in the original dataset(s)
- Type:
tuple[int, int]
- reduction_ratio
Reduction ratio calculated for the blocking method
- Type:
float
Notes
The class provides methods for calculating reduction ratio and formatting evaluation metrics for blocking quality assessment.
- add_block_column(df_left, df_right=None, id_col_left=None, id_col_right=None, block_col='block')[source]
Attach block IDs back onto the original DataFrame(s), filling any records with no assignment into their own singleton blocks.
Deduplication: pass only df_left; returns one DataFrame.
Record-linkage: pass both df_left and df_right; returns a tuple (left_with_blocks, right_with_blocks).
- Parameters:
df_left – If dedup: your input DataFrame. If rec-lin: the “x” DataFrame.
df_right – If rec-lin: the “y” DataFrame. Otherwise None.
id_col_left – Column in df_left matching integer index into self.result.x; if None, uses the DataFrame’s positional index.
id_col_right – Column in df_right matching integer index into self.result.y; if None, uses that DataFrame’s positional index.
block_col – Name of the new block-ID column.
- Return type:
Single DataFrame (dedup) or tuple of two DataFrames (rec-lin).
Examples
>>> x = blocking_result.add_block_column(org_x_df) # dedup >>> x, y = blocking_result.add_block_column(org_x_df, org_y_df) # rec-lin