Configuration and Tuning
Overview
BlockingPy provides two main configuration interfaces:
control_txt: Text processing parameters
control_ann: ANN algorithm parameters
Text Processing Configuration (control_txt)
The control_txt dictionary controls how text data is processed before blocking:
control_txt = {
"encoder": "shingle",
"embedding": {
"model": "minishlab/potion-base-8M",
"normalize": True,
"max_length": 512,
"emb_batch_size": 1024,
"show_progress_bar": False,
"use_multiprocessing": True,
"multiprocessing_threshold": 10000,
},
"shingle": {
"n_shingles": 2,
"lowercase": True,
"strip_non_alphanum": True,
"max_features": 5000,
},
}
Parameter Details
n_shingles (default: 2)
Controls the size of character n-grams
Larger values capture more context but increase dimensionality
Common values: 2-4
Impact: Higher values more precise but slower
max_features (default: 5000)
Maximum number of features in the document-term matrix
Controls memory usage and processing speed
Higher values may improve accuracy but increase memory usage
Adjust based on your dataset size and available memory
lowercase (default: True)
Whether to convert text to lowercase
Usually keep True for better matching
Set to False if case is meaningful for your data
strip_non_alphanum (default: True)
Remove non-alphanumeric characters
Usually keep True for cleaner matching
Set to False if special characters are important
NOTE: control_txt is used only if the input is pd.Series as the other options were already processed.
ANN Algorithm Configuration (control_ann)
Each algorithm has its own set of parameters in the control_ann dictionary. Overall control_ann should be in the following structure:
control_ann = {
"random_seed" : None, # Alternative to setting the seed directly in the `Blocker`
"faiss" : {
# parameters here
},
"voyager" : {},
"annoy" : {},
"lsh" : {},
"kd" : {},
"hnsw": {},
# you can specify only the dict of the algorithm you are using
"algo" : "lsh" or "kd" # specify if using lsh or kd
}
FAISS Configuration
control_ann = {
'faiss': {
'index_type': ['flat', 'hnsw', 'lsh'], # Index type (default: 'hnsw')
'distance': 'cosine', # Distance metric
'k_search': 30, # Number of neighbors to search
'path': None, # Optional path to save index
'hnsw_M': 32, # Number of connections per element
'hnsw_ef_construction': 200, # Size of dynamic candidate list (construction)
'hnsw_ef_search': 200, # Size of dynamic candidate list (search)
'lsh_nbits': 2, # (gets multiplied by dimensions) Number of bits for LSH
'lsh_rotate_data': True, # Rotate data for LSH
}
}
Supported distance metrics:
euclideancosine(default)inner_productl1manhattanlinfcanberrabray_curtisjensen_shannon
NOTE : Distance metrics do not apply to lsh index type.
For more information about faiss see here.
Voyager Configuration
control_ann = {
'random_seed': None, # Random seed
'voyager': {
'distance': 'cosine', # Distance metric
'k_search': 30, # Number of neighbors to search
'path': None, # Optional path to save index
'M': 12, # Number of connections per element
'ef_construction': 200, # Size of dynamic candidate list (construction)
'max_elements': 1, # Maximum number of elements
'num_threads': -1, # Number of threads (-1 for auto)
'query_ef': -1 # Query expansion factor (-1 for auto)
}
}
Supported distance metrics:
cosineinner_producteuclidean(default)
For more information about voyager see here.
HNSW Configuration
control_ann = {
'random_seed': None, # Random seed
'hnsw': {
'distance': 'cosine', # Distance metric
'k_search': 30, # Number of neighbors to search
'n_threads': 1, # Number of threads
'path': None, # Optional path to save index
'M': 25, # Number of connections per element
'ef_c': 200, # Size of dynamic candidate list (construction)
'ef_s': 200, # Size of dynamic candidate list (search)
}
}
Supported distance metrics:
cosine(default)l2euclidean(same as l2)ip(Inner Product)
For more information about hnsw configuration see here.
Annoy Configuration
control_ann = {
'random_seed': None, # Random seed
'annoy': {
'distance': 'angular', # Distance metric
'k_search': 30, # Number of neighbors to search
'path': None, # Optional path to save index
'n_trees': 250, # Number of trees
'build_on_disk': False # Build index on disk
}
}
Supported distance metrics:
angular(default)dothammingmanhattaneuclidean
For more information about annoy configuratino see here.
LSH Configuration
control_ann = {
'random_seed': None, # Random seed
'lsh': {
'k_search': 30, # Number of neighbors to search
'bucket_size': 500, # Hash bucket size
'hash_width': 10.0, # Hash function width
'num_probes': 0, # Number of probes
'projections': 10, # Number of projections
'tables': 30 # Number of hash tables
}
}
For more information about lsh see here.
K-d Tree Configuration
control_ann = {
'random_seed': None, # Random seed
'kd': {
'k_search': 30, # Number of neighbors to search
'algorithm': 'dual_tree', # Algorithm type
'leaf_size': 20, # Leaf size for tree
'random_basis': False, # Use random basis
'rho': 0.7, # Overlapping size
'tau': 0.0, # Early termination parameter
'tree_type': 'kd', # Type of tree to use
'epsilon': 0.0 # Search approximation parameter
}
}
For more information about kd see here.
NND Configuration
control_ann = {
'random_seed': None, # Random seed
'nnd': {
'metric': 'euclidean', # Distance metric
'k_search': 30, # Number of neighbors to search
'n_threads': None, # Number of threads
'leaf_size': None, # Leaf size for tree building
'n_trees': None, # Number of trees
'diversify_prob': 1.0, # Probability of including diverse neighbors
'low_memory': True, # Use low memory mode
}
}
For more information about nnd see here.
GPU Faiss Configuration
This applies only if you use blockingpy-gpu. The structure is similar to the rest of ANNs.
control_ann = {
"gpu_faiss": {
"index_type": "flat", #ivf, ivfpq, cagra
"k_search": 30,
"distance": "cosine",
"path": None,
"ivf_nlist": 100,
"ivf_nprobe": 10,
"ivfpq_nlist": 100,
"ivfpq_m": 8,
"ivfpq_nbits": 8,
"ivfpq_nprobe": 10,
"ivfpq_useFloat16": False,
"ivfpq_usePrecomputed": False,
"ivfpq_reserveVecs": 0,
"ivfpq_use_cuvs": False,
"cagra": {
"graph_degree": 64,
"intermediate_graph_degree": 128,
"build_algo": "ivf_pq",
"nn_descent_niter": 20,
"itopk_size": 64,
"max_queries": 0,
"algo": "auto",
"team_size": 0,
"search_width": 1,
"min_iterations": 0,
"max_iterations": 0,
"thread_block_size": 0,
"hashmap_mode": "auto",
"hashmap_min_bitlen": 0,
"hashmap_max_fill_rate": 0.5,
"num_random_samplings": 1,
"seed": 0x128394,
},
},
}