Input Data Handling

Supported Input Formats

BlockingPy is flexible in terms of input data formats. The package accepts three main types of input:

  • Text Data: pandas.Series containing raw text

  • Sparse Matrices: scipy.sparse.csr_matrix for pre-computed document-term matrices

  • Dense Arrays: numpy.ndarray for numeric feature vectors

Text Processing Options

When working with text data, Blockingpy provides two main options for processing:

1. Character shingle encoding (default)

This method creates features based on character n-grams. Futher options can be set in the control_txt dictionary.

import pandas as pd
from blockingpy import Blocker

texts = pd.Series([
    "john smith",
    "smith john",
    "jane doe"
])

control_txt = {
    'encoder': 'shingle',
    'shingle': {
        'n_shingles': 2,
        'max_features': 5000,
        'lowercase': True,
        'strip_non_alphanum': True
    }
}

blocker = Blocker()
result = blocker.block(x=texts, control_txt=control_txt)

2. Embedding encoding

You can also utilize pre-trained embeddings for more semantically meaningful blocking via model2vec library:

control_txt = {
    'encoder': 'embedding',
    'embedding': {
        'model': 'minishlab/potion-base-8M',
        'normalize': True,
        'max_length': 512,
        'emb_batch_size': 1024
    }
}

result = blocker.block(x=texts, control_txt=control_txt)

For more details on the embedding options, refer to the model2vec documentation

Dataframes

If you have a DataFrame with multiple columns (like name, address, etc.), we recommend combining these columns into a single text column before passing it to the blocker:

import pandas as pd
from blockingpy import Blocker

# Example DataFrame with multiple columns
df = pd.DataFrame({
    'name': ['John Smith', 'Jane Doe', 'Smith John'],
    'city': ['New York', 'Boston', 'NYC'],
    'occupation': ['engineer', 'doctor', 'engineer']
})

# Combine relevant columns into a single text field
# You can adjust the separator and columns based on your needs (and also with control_txt to a degree)
df['blocking_key'] = df['name'] + ' ' + df['city'] + ' ' + df['occupation']

# Pass the combined text column to the blocker
blocker = Blocker()
result = blocker.block(x=df['blocking_key'])

Pre-computed Document-Term Matrices

If you have already vectorized your text data or are working with numeric features, you can pass a sparse document-term matrix:

from scipy import sparse

# Example sparse DTMs
dtm_1 = sparse.csr_matrix((n_docs, n_features))
dtm_2 = sparse.csr_matrix((n_docs_2, n_features_2))

# Column names are required for sparse matrices
feature_names_1 = [f'feature_{i}' for i in range(n_features)]
feature_names_2 = [f'feature_{i}' for i in range(n_features_2)]

result = blocker.block(
    x=dtm_1,
    y=dtm_2, 
    x_colnames=feature_names_1,
    y_colnames=feature_names_2,
)

Dense Numeric Arrays

For dense feature vectors, use numpy arrays:

import numpy as np

# Example feature matrix
features = np.array([
    [1.0, 2.0, 0.0],
    [2.0, 0.0, 0.0],
    [2.0, 1.0, 1.0]
])

# Column names are required for numpy arrays
feature_names = ['feat_1', 'feat_2', 'feat_3']

result = blocker.block(
    x=features, 
    x_colnames=feature_names
)

Input Validation

BlockingPy performs several validations on input data:

  • Format Checking: Ensures inputs are in supported formats

  • Compatibility: Verifies feature compatibility between datasets

  • Column Names: Validates presence of required column names

  • Dimensions: Checks for appropriate matrix dimensions

If validation fails, clear error messages are provided indicating the issue.