(basic_operations)=
# Basic Operations

## Overview

BlockingPy provides three main operations:

- Record Linkage: Finding matching records between two datasets
- Deduplication: Finding duplicate records within a single dataset
- Evaluation: Evaluating blocking when true blocks are known (for both record linkage and deduplication) either inside the `block` method or separate `eval` method.

This guide covers the basic usage patterns for these operations.

## Record Linkage

### Basic usage

```python
from blockingpy import Blocker
import pandas as pd

# Example datasets
dataset1 = pd.Series([
    "john smith new york",
    "janee doe Boston",
    "robert brow chicagoo"
])

dataset2 = pd.Series([
    "smith john ny",
    "jane doe boston",
    "rob brown chicago"
])

# Initialize blocker
blocker = Blocker()

# Perform blocking
blocking_result = blocker.block(
    x=dataset1,  # Reference dataset
    y=dataset2,  # Query dataset
    ann="hnsw"   # Choose ANN algorithm (`hnsw` here)
)
```

## Results

The blocking operation returns a BlockingResult object with several useful attributes:

```python
# print blocking results
print(blocking_result)
# Shows:
# - Number of blocks created
# - Number of features created for blocking from text representation
# - Reduction ratio (how much the comparison space was reduced)
# - Distribution of block sizes

# Access detailed results
blocking_result.result  # DataFrame with columns: x, y, block, dist
blocking_result.method  # ANN algorithm used
blocking_result.colnames  # Features used for blocking
```

## Deduplication

### Basic Usage

```python
data = pd.Series([
    "john smith new york",
    "smith john ny",
    "jane doe boston",
    "j smith new york",
    "jane doe boston ma"
])

# Perform deduplication
result = blocker.block(
    x=data,
    ann="voyager"
)
```

Printing result gives similar results as in record linkage

## Evaluating Blocking Quality

If you have ground truth data, you can evaluate blocking quality: 

### Example ground truth for deduplication

```python
data = # your data

true_blocks = pd.DataFrame({
    'x': [0, 1, 2, 3, 4],      # Record indices
    'block': [0, 0, 1, 1, 1]   # True block assignments
})

result = blocker.block(
    x=data,
    true_blocks=true_blocks
)

# Access evaluation metrics
print(result.metrics)    # Shows precision, recall, F1-score, etc.
print(result.confusion)  # Confusion matrix
```
or alternatively with the use of `eval` method:

```python
data = # your data

true_blocks = pd.DataFrame({
    'x': [0, 1, 2, 3, 4],  
    'block': [0, 0, 1, 1, 1]   
})

result = blocker.block(
    x=data,
)
evals = blocker.eval(
    blocking_result=result,
    true_blocks=true_blocks,
)
print(evals.metrics)
print(evals.confusion) 
```

### Example ground truth for record linkage

```python
data_1 = # your data
data_2 = # your data

true_blocks = pd.DataFrame({
    'x': [0, 1, 2, 3, 4],     # Record indices (reference)
    'y': [3, 1, 4, 0, 2]      # Record indices (Query) 
    'block': [0, 1, 2, 0, 2]  # True block assignments
})

result = blocker.block(
    x=data_1,
    y=data_2,
    true_blocks=true_blocks
)

# Access evaluation metrics
print(result.metrics)    # Shows precision, recall, F1-score, etc.
print(result.confusion)  # Confusion matrix
```
and with `eval` method:

```python
data_1 = # your data
data_2 = # your data

true_blocks = pd.DataFrame({
    'x': [0, 1, 2, 3, 4],    
    'y': [3, 1, 4, 0, 2]     
    'block': [0, 1, 2, 0, 2]  
})

result = blocker.block(
    x=data_1,
    y=data_2,
)
evals = blocker.eval(
    blocking_result=result,
    true_blocks=true_blocks
)
print(evals.metrics) 
print(evals.confusion) 
```

## Choosing an ANN Algorithm

BlockingPy supports multiple ANN algorithms, each with its strengths:

```python
# FAISS (default) - Supports LSH, HNSW and Flat Index
result = blocker.block(x=data, ann="faiss")

# Annoy
result = blocker.block(x=data, ann="annoy")

# HNSW
result = blocker.block(x=data, ann="hnsw")

# Other options: "voyager", "lsh", "kd", "nnd"
```

## Working with lsh or kd algorithm

When the selected algo is lsh or kd, you should specify it in the `control_ann`:

```python
control_ann = {
    "algo" : "lsh",
    "lsh" : {
        # ...
        # your parameters for lsh here
        # ...
    }
}

result = blocker.block(
    x=data,
    ann="lsh",
    control_ann=control_ann,
)
```

## Working with faiss implementation:

When the selected algo is faiss, you should specify which index to use in `control_ann`:

```python
control_ann = {
    "faiss" : {
        "index_type": "flat" or "hnsw" or "lsh"
    }
}
```