Configuration and Tuning

Overview

BlockingPy provides two main configuration interfaces:

  • control_txt: Text processing parameters

  • control_ann: ANN algorithm parameters

Text Processing Configuration (control_txt)

The control_txt dictionary controls how text data is processed before blocking:

control_txt = {
        "encoder": "shingle",
        "embedding": {
            "model": "minishlab/potion-base-8M",
            "normalize": True,
            "max_length": 512,
            "emb_batch_size": 1024,
            "show_progress_bar": False,
            "use_multiprocessing": True,
            "multiprocessing_threshold": 10000,
        },
        "shingle": {
            "n_shingles": 2,
            "lowercase": True,
            "strip_non_alphanum": True,
            "max_features": 5000,
        },
    }

Parameter Details

n_shingles (default: 2)

  • Controls the size of character n-grams

  • Larger values capture more context but increase dimensionality

  • Common values: 2-4

  • Impact: Higher values more precise but slower

max_features (default: 5000)

  • Maximum number of features in the document-term matrix

  • Controls memory usage and processing speed

  • Higher values may improve accuracy but increase memory usage

  • Adjust based on your dataset size and available memory

lowercase (default: True)

  • Whether to convert text to lowercase

  • Usually keep True for better matching

  • Set to False if case is meaningful for your data

strip_non_alphanum (default: True)

  • Remove non-alphanumeric characters

  • Usually keep True for cleaner matching

  • Set to False if special characters are important

NOTE: control_txt is used only if the input is pd.Series as the other options were already processed.

ANN Algorithm Configuration (control_ann)

Each algorithm has its own set of parameters in the control_ann dictionary. Overall control_ann should be in the following structure:

control_ann = {
    "random_seed" : None, # Alternative to setting the seed directly in the `Blocker` 
    "faiss" : {
        # parameters here
    },
    "voyager" : {},
    "annoy" : {},
    "lsh" : {},
    "kd" : {},
    "hnsw": {},
    # you can specify only the dict of the algorithm you are using

    "algo" : "lsh" or "kd" # specify if using lsh or kd

}

FAISS Configuration

control_ann = {
    'faiss': {
        'index_type': ['flat', 'hnsw', 'lsh'], # Index type (default: 'hnsw')
        'distance': 'cosine', # Distance metric
        'k_search': 30, # Number of neighbors to search
        'path': None,     # Optional path to save index

        'hnsw_M': 32,               # Number of connections per element
        'hnsw_ef_construction': 200, # Size of dynamic candidate list (construction)
        'hnsw_ef_search': 200,       # Size of dynamic candidate list (search)

        'lsh_nbits': 2,        # (gets multiplied by dimensions) Number of bits for LSH
        'lsh_rotate_data': True, # Rotate data for LSH
    }
}

Supported distance metrics:

  • euclidean

  • cosine (default)

  • inner_product

  • l1

  • manhattan

  • linf

  • canberra

  • bray_curtis

  • jensen_shannon

NOTE : Distance metrics do not apply to lsh index type.

For more information about faiss see here.

Voyager Configuration

control_ann = {
    'random_seed': None, # Random seed
    'voyager': {
        'distance': 'cosine',   # Distance metric
        'k_search': 30,         # Number of neighbors to search
        'path': None,           # Optional path to save index
        'M': 12,                # Number of connections per element
        'ef_construction': 200, # Size of dynamic candidate list (construction)
        'max_elements': 1,      # Maximum number of elements
        'num_threads': -1,      # Number of threads (-1 for auto)
        'query_ef': -1          # Query expansion factor (-1 for auto)
    }
}

Supported distance metrics:

  • cosine

  • inner_product

  • euclidean (default)

For more information about voyager see here.

HNSW Configuration

control_ann = {
    'random_seed': None, # Random seed
    'hnsw': {
        'distance': 'cosine', # Distance metric
        'k_search': 30,       # Number of neighbors to search
        'n_threads': 1,       # Number of threads
        'path': None,         # Optional path to save index
        'M': 25,              # Number of connections per element
        'ef_c': 200,          # Size of dynamic candidate list (construction)
        'ef_s': 200,          # Size of dynamic candidate list (search)
    }
}

Supported distance metrics:

  • cosine (default)

  • l2

  • euclidean (same as l2)

  • ip (Inner Product)

For more information about hnsw configuration see here.

Annoy Configuration

control_ann = {
    'random_seed': None, # Random seed
    'annoy': {
        'distance': 'angular', # Distance metric
        'k_search': 30,        # Number of neighbors to search
        'path': None,          # Optional path to save index
        'n_trees': 250,        # Number of trees
        'build_on_disk': False # Build index on disk
    }
}

Supported distance metrics:

  • angular(default)

  • dot

  • hamming

  • manhattan

  • euclidean

For more information about annoy configuratino see here.

LSH Configuration

control_ann = {
    'random_seed': None, # Random seed
    'lsh': {
        'k_search': 30,        # Number of neighbors to search
        'bucket_size': 500,    # Hash bucket size
        'hash_width': 10.0,    # Hash function width
        'num_probes': 0,       # Number of probes
        'projections': 10,     # Number of projections
        'tables': 30           # Number of hash tables
    }
}

For more information about lsh see here.

K-d Tree Configuration

control_ann = {
    'random_seed': None, # Random seed
    'kd': {
        'k_search': 30,           # Number of neighbors to search
        'algorithm': 'dual_tree', # Algorithm type
        'leaf_size': 20,          # Leaf size for tree
        'random_basis': False,    # Use random basis
        'rho': 0.7,               # Overlapping size
        'tau': 0.0,               # Early termination parameter
        'tree_type': 'kd',        # Type of tree to use
        'epsilon': 0.0            # Search approximation parameter
    }
}

For more information about kd see here.

NND Configuration

control_ann = {
    'random_seed': None, # Random seed
    'nnd': {
        'metric': 'euclidean',  # Distance metric
        'k_search': 30,         # Number of neighbors to search
        'n_threads': None,      # Number of threads
        'leaf_size': None,      # Leaf size for tree building
        'n_trees': None,        # Number of trees
        'diversify_prob': 1.0,  # Probability of including diverse neighbors
        'low_memory': True,     # Use low memory mode
    }
}

For more information about nnd see here.

GPU Faiss Configuration

This applies only if you use blockingpy-gpu. The structure is similar to the rest of ANNs.

control_ann = {
    "gpu_faiss": {
            "index_type": "flat", #ivf, ivfpq, cagra
            "k_search": 30,
            "distance": "cosine",
            "path": None,

            "ivf_nlist": 100,
            "ivf_nprobe": 10,

            "ivfpq_nlist": 100,
            "ivfpq_m": 8,
            "ivfpq_nbits": 8,
            "ivfpq_nprobe": 10,
            "ivfpq_useFloat16": False,
            "ivfpq_usePrecomputed": False,
            "ivfpq_reserveVecs": 0,
            "ivfpq_use_cuvs": False,

            "cagra": {
                "graph_degree": 64,
                "intermediate_graph_degree": 128,
                "build_algo": "ivf_pq",
                "nn_descent_niter": 20,
                "itopk_size": 64,
                "max_queries": 0,
                "algo": "auto",
                "team_size": 0,
                "search_width": 1,
                "min_iterations": 0,
                "max_iterations": 0,
                "thread_block_size": 0,
                "hashmap_mode": "auto",
                "hashmap_min_bitlen": 0,
                "hashmap_max_fill_rate": 0.5,
                "num_random_samplings": 1,
                "seed": 0x128394,
            },
        },
}