BlockingResult

Contains the BlockingResult class for analyzing and printing blocking results.

class blockingpy.blocking_result.BlockingResult(x_df, ann, deduplication, n_original_records, true_blocks, eval_metrics, confusion, colnames_xy, reduction_ratio=None)[source]

Bases: object

A class to represent and analyze the results of a blocking operation.

This class provides functionality to analyze and evaluate blocking results, including calculation of reduction ratios, metrics evaluation.

Parameters:
  • x_df (pandas.DataFrame) – DataFrame containing blocking results with columns [‘x’, ‘y’, ‘block’, ‘dist’]

  • ann (str) – The blocking method used (e.g., ‘nnd’, ‘hnsw’, ‘annoy’, etc.)

  • deduplication (bool) – Whether the blocking was performed for deduplication

  • true_blocks (pandas.DataFrame, optional) – DataFrame with true blocks to calculate evaluation metrics

  • n_original_records (tuple[int, int]) – Number of records in the original dataset(s)

  • eval_metrics (pandas.Series, optional) – Evaluation metrics if true blocks were provided

  • confusion (pandas.DataFrame, optional) – Confusion matrix if true blocks were provided

  • colnames_xy (numpy.ndarray) – Column names used in the blocking process

  • reduction_ratio (float, optional) – Pre-calculated reduction ratio (default None)

result

The blocking results containing [‘x’, ‘y’, ‘block’, ‘dist’] columns

Type:

pandas.DataFrame

method

Name of the blocking method used

Type:

str

deduplication

Indicates if this was a deduplication operation

Type:

bool

metrics

Evaluation metrics if true blocks were provided

Type:

pandas.Series or None

confusion

Confusion matrix if true blocks were provided

Type:

pandas.DataFrame or None

colnames

Names of columns used in blocking

Type:

numpy.ndarray

n_original_records

Number of records in the original dataset(s)

Type:

tuple[int, int]

reduction_ratio

Reduction ratio calculated for the blocking method

Type:

float

Notes

The class provides methods for calculating reduction ratio and formatting evaluation metrics for blocking quality assessment.

add_block_column(df_left, df_right=None, id_col_left=None, id_col_right=None, block_col='block')[source]

Attach block IDs back onto the original DataFrame(s), filling any records with no assignment into their own singleton blocks.

  • Deduplication: pass only df_left; returns one DataFrame.

  • Record-linkage: pass both df_left and df_right; returns a tuple (left_with_blocks, right_with_blocks).

Parameters:
  • df_left – If dedup: your input DataFrame. If rec-lin: the “x” DataFrame.

  • df_right – If rec-lin: the “y” DataFrame. Otherwise None.

  • id_col_left – Column in df_left matching integer index into self.result.x; if None, uses the DataFrame’s positional index.

  • id_col_right – Column in df_right matching integer index into self.result.y; if None, uses that DataFrame’s positional index.

  • block_col – Name of the new block-ID column.

Return type:

Single DataFrame (dedup) or tuple of two DataFrames (rec-lin).

Examples

>>> x = blocking_result.add_block_column(org_x_df)  # dedup
>>> x, y = blocking_result.add_block_column(org_x_df, org_y_df)  # rec-lin