BlockingResult

Contains the BlockingResult class for analyzing and printing blocking results.

class blockingpy.blocking_result.BlockingResult(x_df, ann, deduplication, n_original_records, true_blocks, eval_metrics, confusion, colnames_xy, reduction_ratio=None)[source]

Bases: object

A class to represent and analyze the results of a blocking operation.

This class provides functionality to analyze and evaluate blocking results, including calculation of reduction ratios, metrics evaluation.

Parameters:

x_df (pandas.DataFrame) – DataFrame containing blocking results with columns [‘x’, ‘y’, ‘block’, ‘dist’]
ann (str) – The blocking method used (e.g., ‘nnd’, ‘hnsw’, ‘annoy’, etc.)
deduplication (bool) – Whether the blocking was performed for deduplication
true_blocks (pandas.DataFrame, optional) – DataFrame with true blocks to calculate evaluation metrics
n_original_records (tuple[int, int]) – Number of records in the original dataset(s)
eval_metrics (pandas.Series, optional) – Evaluation metrics if true blocks were provided
confusion (pandas.DataFrame, optional) – Confusion matrix if true blocks were provided
colnames_xy (numpy.ndarray) – Column names used in the blocking process
reduction_ratio (float, optional) – Pre-calculated reduction ratio (default None)

result

The blocking results containing [‘x’, ‘y’, ‘block’, ‘dist’] columns

Type:: pandas.DataFrame

method

Name of the blocking method used

Type:: str

deduplication

Indicates if this was a deduplication operation

Type:: bool

metrics

Evaluation metrics if true blocks were provided

Type:: pandas.Series or None

confusion

Confusion matrix if true blocks were provided

Type:: pandas.DataFrame or None

colnames

Names of columns used in blocking

Type:: numpy.ndarray

n_original_records

Number of records in the original dataset(s)

Type:: tuple[int, int]

reduction_ratio

Reduction ratio calculated for the blocking method

Type:: float

Notes

The class provides methods for calculating reduction ratio and formatting evaluation metrics for blocking quality assessment.

add_block_column(df_left, df_right=None, id_col_left=None, id_col_right=None, block_col='block')[source]

Attach block IDs back onto the original DataFrame(s), filling any records with no assignment into their own singleton blocks.

Deduplication: pass only df_left; returns one DataFrame.
Record-linkage: pass both df_left and df_right; returns a tuple (left_with_blocks, right_with_blocks).

Parameters:

df_left – If dedup: your input DataFrame. If rec-lin: the “x” DataFrame.
df_right – If rec-lin: the “y” DataFrame. Otherwise None.
id_col_left – Column in df_left matching integer index into self.result.x; if None, uses the DataFrame’s positional index.
id_col_right – Column in df_right matching integer index into self.result.y; if None, uses that DataFrame’s positional index.
block_col – Name of the new block-ID column.

Return type:

Single DataFrame (dedup) or tuple of two DataFrames (rec-lin).

Examples

>>> x = blocking_result.add_block_column(org_x_df)  # dedup
>>> x, y = blocking_result.add_block_column(org_x_df, org_y_df)  # rec-lin