Input Data Handling
Supported Input Formats
BlockingPy is flexible in terms of input data formats. The package accepts three main types of input:
Text Data:
pandas.Seriescontaining raw textSparse Matrices:
scipy.sparse.csr_matrixfor pre-computed document-term matricesDense Arrays:
numpy.ndarrayfor numeric feature vectors
Text Processing Options
When working with text data, Blockingpy provides two main options for processing:
1. Character shingle encoding (default)
This method creates features based on character n-grams. Futher options can be set in the control_txt dictionary.
import pandas as pd
from blockingpy import Blocker
texts = pd.Series([
"john smith",
"smith john",
"jane doe"
])
control_txt = {
'encoder': 'shingle',
'shingle': {
'n_shingles': 2,
'max_features': 5000,
'lowercase': True,
'strip_non_alphanum': True
}
}
blocker = Blocker()
result = blocker.block(x=texts, control_txt=control_txt)
2. Embedding encoding
You can also utilize pre-trained embeddings for more semantically meaningful blocking via model2vec library:
control_txt = {
'encoder': 'embedding',
'embedding': {
'model': 'minishlab/potion-base-8M',
'normalize': True,
'max_length': 512,
'emb_batch_size': 1024
}
}
result = blocker.block(x=texts, control_txt=control_txt)
For more details on the embedding options, refer to the model2vec documentation
Dataframes
If you have a DataFrame with multiple columns (like name, address, etc.), we recommend combining these columns into a single text column before passing it to the blocker:
import pandas as pd
from blockingpy import Blocker
# Example DataFrame with multiple columns
df = pd.DataFrame({
'name': ['John Smith', 'Jane Doe', 'Smith John'],
'city': ['New York', 'Boston', 'NYC'],
'occupation': ['engineer', 'doctor', 'engineer']
})
# Combine relevant columns into a single text field
# You can adjust the separator and columns based on your needs (and also with control_txt to a degree)
df['blocking_key'] = df['name'] + ' ' + df['city'] + ' ' + df['occupation']
# Pass the combined text column to the blocker
blocker = Blocker()
result = blocker.block(x=df['blocking_key'])
Pre-computed Document-Term Matrices
If you have already vectorized your text data or are working with numeric features, you can pass a sparse document-term matrix:
from scipy import sparse
# Example sparse DTMs
dtm_1 = sparse.csr_matrix((n_docs, n_features))
dtm_2 = sparse.csr_matrix((n_docs_2, n_features_2))
# Column names are required for sparse matrices
feature_names_1 = [f'feature_{i}' for i in range(n_features)]
feature_names_2 = [f'feature_{i}' for i in range(n_features_2)]
result = blocker.block(
x=dtm_1,
y=dtm_2,
x_colnames=feature_names_1,
y_colnames=feature_names_2,
)
Dense Numeric Arrays
For dense feature vectors, use numpy arrays:
import numpy as np
# Example feature matrix
features = np.array([
[1.0, 2.0, 0.0],
[2.0, 0.0, 0.0],
[2.0, 1.0, 1.0]
])
# Column names are required for numpy arrays
feature_names = ['feat_1', 'feat_2', 'feat_3']
result = blocker.block(
x=features,
x_colnames=feature_names
)
Input Validation
BlockingPy performs several validations on input data:
Format Checking: Ensures inputs are in supported formats
Compatibility: Verifies feature compatibility between datasets
Column Names: Validates presence of required column names
Dimensions: Checks for appropriate matrix dimensions
If validation fails, clear error messages are provided indicating the issue.