(configuration_tuning)=
# Configuration and Tuning

## Overview

BlockingPy provides two main configuration interfaces:

- control_txt: Text processing parameters
- control_ann: ANN algorithm parameters

## Text Processing Configuration (`control_txt`)

The `control_txt` dictionary controls how text data is processed before blocking:

```python
control_txt = {
        "encoder": "shingle",
        "embedding": {
            "model": "minishlab/potion-base-8M",
            "normalize": True,
            "max_length": 512,
            "emb_batch_size": 1024,
            "show_progress_bar": False,
            "use_multiprocessing": True,
            "multiprocessing_threshold": 10000,
        },
        "shingle": {
            "n_shingles": 2,
            "lowercase": True,
            "strip_non_alphanum": True,
            "max_features": 5000,
        },
    }
```
### Parameter Details

`n_shingles` (default: `2`)

- Controls the size of character n-grams
- Larger values capture more context but increase dimensionality
- Common values: 2-4
- Impact: Higher values more precise but slower


`max_features` (default: `5000`)

- Maximum number of features in the document-term matrix
- Controls memory usage and processing speed
- Higher values may improve accuracy but increase memory usage
- Adjust based on your dataset size and available memory


`lowercase` (default: `True`)

- Whether to convert text to lowercase
- Usually keep True for better matching
- Set to False if case is meaningful for your data


`strip_non_alphanum` (default: `True`)

- Remove non-alphanumeric characters
- Usually keep True for cleaner matching
- Set to False if special characters are important

NOTE: `control_txt` is used only if the input is `pd.Series` as the other options were already processed.

## ANN Algorithm Configuration (`control_ann`)

Each algorithm has its own set of parameters in the `control_ann` dictionary. Overall `control_ann` should be in the following structure:

```python
control_ann = {
    "random_seed" : None, # Alternative to setting the seed directly in the `Blocker` 
    "faiss" : {
        # parameters here
    },
    "voyager" : {},
    "annoy" : {},
    "lsh" : {},
    "kd" : {},
    "hnsw": {},
    # you can specify only the dict of the algorithm you are using

    "algo" : "lsh" or "kd" # specify if using lsh or kd

}
```


### FAISS Configuration
```python
control_ann = {
    'faiss': {
        'index_type': ['flat', 'hnsw', 'lsh'], # Index type (default: 'hnsw')
        'distance': 'cosine', # Distance metric
        'k_search': 30, # Number of neighbors to search
        'path': None,     # Optional path to save index

        'hnsw_M': 32,               # Number of connections per element
        'hnsw_ef_construction': 200, # Size of dynamic candidate list (construction)
        'hnsw_ef_search': 200,       # Size of dynamic candidate list (search)

        'lsh_nbits': 2,        # (gets multiplied by dimensions) Number of bits for LSH
        'lsh_rotate_data': True, # Rotate data for LSH
    }
}
```

**Supported distance metrics**:

- `euclidean` 
- `cosine` (default)
- `inner_product`
- `l1`
- `manhattan`
- `linf`
- `canberra`
- `bray_curtis`
- `jensen_shannon`

***NOTE*** : Distance metrics do not apply to `lsh` index type.

For more information about `faiss` see [here](https://github.com/facebookresearch/faiss/wiki/MetricType-and-distances).

## Voyager Configuration

```python
control_ann = {
    'random_seed': None, # Random seed
    'voyager': {
        'distance': 'cosine',   # Distance metric
        'k_search': 30,         # Number of neighbors to search
        'path': None,           # Optional path to save index
        'M': 12,                # Number of connections per element
        'ef_construction': 200, # Size of dynamic candidate list (construction)
        'max_elements': 1,      # Maximum number of elements
        'num_threads': -1,      # Number of threads (-1 for auto)
        'query_ef': -1          # Query expansion factor (-1 for auto)
    }
}
```

**Supported distance metrics**:

- `cosine`
- `inner_product`
- `euclidean` (default)

For more information about `voyager` see [here](https://github.com/spotify/voyager).

## HNSW Configuration

```python
control_ann = {
    'random_seed': None, # Random seed
    'hnsw': {
        'distance': 'cosine', # Distance metric
        'k_search': 30,       # Number of neighbors to search
        'n_threads': 1,       # Number of threads
        'path': None,         # Optional path to save index
        'M': 25,              # Number of connections per element
        'ef_c': 200,          # Size of dynamic candidate list (construction)
        'ef_s': 200,          # Size of dynamic candidate list (search)
    }
}
```
**Supported distance metrics**:

- `cosine` (default)
- `l2`
- `euclidean` (same as l2)
- `ip` (Inner Product)

For more information about `hnsw` configuration see [here](https://github.com/nmslib/hnswlib).

## Annoy Configuration

```python
control_ann = {
    'random_seed': None, # Random seed
    'annoy': {
        'distance': 'angular', # Distance metric
        'k_search': 30,        # Number of neighbors to search
        'path': None,          # Optional path to save index
        'n_trees': 250,        # Number of trees
        'build_on_disk': False # Build index on disk
    }
}
```
**Supported distance metrics**:

- `angular`(default)
- `dot`
- `hamming`
- `manhattan`
- `euclidean`

For more information about `annoy` configuratino see [here](https://github.com/spotify/annoy).

## LSH Configuration

```python
control_ann = {
    'random_seed': None, # Random seed
    'lsh': {
        'k_search': 30,        # Number of neighbors to search
        'bucket_size': 500,    # Hash bucket size
        'hash_width': 10.0,    # Hash function width
        'num_probes': 0,       # Number of probes
        'projections': 10,     # Number of projections
        'tables': 30           # Number of hash tables
    }
}
```
For more information about `lsh` see [here](https://github.com/mlpack).

### K-d Tree Configuration

```python
control_ann = {
    'random_seed': None, # Random seed
    'kd': {
        'k_search': 30,           # Number of neighbors to search
        'algorithm': 'dual_tree', # Algorithm type
        'leaf_size': 20,          # Leaf size for tree
        'random_basis': False,    # Use random basis
        'rho': 0.7,               # Overlapping size
        'tau': 0.0,               # Early termination parameter
        'tree_type': 'kd',        # Type of tree to use
        'epsilon': 0.0            # Search approximation parameter
    }
}
```

For more information about `kd` see [here](https://github.com/mlpack).

## NND Configuration

```python
control_ann = {
    'random_seed': None, # Random seed
    'nnd': {
        'metric': 'euclidean',  # Distance metric
        'k_search': 30,         # Number of neighbors to search
        'n_threads': None,      # Number of threads
        'leaf_size': None,      # Leaf size for tree building
        'n_trees': None,        # Number of trees
        'diversify_prob': 1.0,  # Probability of including diverse neighbors
        'low_memory': True,     # Use low memory mode
    }
}
```
For more information about `nnd` see [here](https://pynndescent.readthedocs.io/en/latest/api.html).

## GPU Faiss Configuration
This applies only if you use `blockingpy-gpu`. The structure is similar to the rest of ANNs.

```python
control_ann = {
    "gpu_faiss": {
            "index_type": "flat", #ivf, ivfpq, cagra
            "k_search": 30,
            "distance": "cosine",
            "path": None,

            "ivf_nlist": 100,
            "ivf_nprobe": 10,

            "ivfpq_nlist": 100,
            "ivfpq_m": 8,
            "ivfpq_nbits": 8,
            "ivfpq_nprobe": 10,
            "ivfpq_useFloat16": False,
            "ivfpq_usePrecomputed": False,
            "ivfpq_reserveVecs": 0,
            "ivfpq_use_cuvs": False,

            "cagra": {
                "graph_degree": 64,
                "intermediate_graph_degree": 128,
                "build_algo": "ivf_pq",
                "nn_descent_niter": 20,
                "itopk_size": 64,
                "max_queries": 0,
                "algo": "auto",
                "team_size": 0,
                "search_width": 1,
                "min_iterations": 0,
                "max_iterations": 0,
                "thread_block_size": 0,
                "hashmap_mode": "auto",
                "hashmap_min_bitlen": 0,
                "hashmap_max_fill_rate": 0.5,
                "num_random_samplings": 1,
                "seed": 0x128394,
            },
        },
}
```