blockingpy.controls.controls_txt

blockingpy.controls.controls_txt(controls, **kwargs)[source]

Create configuration dictionary for text processing operations.

Parameters:
  • controls (dict) – Dictionary of control parameters to override defaults

  • **kwargs (dict) – Additional keyword arguments for direct parameter updates

Returns:

Configuration dictionary with the following structure: {

’encoder’: str, ‘embedding’: {

’model’: str, ‘normalize’: bool, ‘max_length’: int, ‘emb_batch_size’: int, ‘show_progress_bar’: bool, ‘use_multiprocessing’: bool, ‘multiprocessing_threshold’: int,

}, ‘shingle’: {

’n_shingles’: int, ‘lowercase’: bool, ‘strip_non_alphanum’: bool, ‘max_features’: int,

},

}

Return type:

dict

Notes

Configuration options: - encoder: Type of text encoder (‘shingle’ or ‘embedding’) For ‘embedding’, additional parameters are required:

  • model: Pretrained model identifier or path

  • normalize: Normalize output vectors if True

  • max_length: Maximum sequence length for encoding

  • emb_batch_size: Batch size for encoding

  • show_progress_bar: Show progress bar if True

  • use_multiprocessing: Use multiprocessing if True

  • multiprocessing_threshold: Threshold for multiprocessing

For ‘shingle’, additional parameters are required:
  • n_shingles: Number of consecutive characters to combine

  • max_features: Maximum number of features to keep

  • lowercase: Convert text to lowercase if True

  • strip_non_alphanum: Remove non-alphanumeric characters if True