blockingpy.text_encoders.embedding_encoder.EmbeddingEncoder

class blockingpy.text_encoders.embedding_encoder.EmbeddingEncoder(model='minishlab/potion-base-8M', normalize=None, max_length=512, emb_batch_size=1024, show_progress_bar=False, use_multiprocessing=True, multiprocessing_threshold=10000)[source]

Dense-vector encoder that wraps model2vec.StaticModel.

The encoder converts a pandas.Series of text strings into a DataHandler whose data attribute is a C-contiguous np.ndarray of shape (n_samples, embedding_dim) and whose cols are the synthetic column names emb_0 emb_{d-1}.

__init__(model='minishlab/potion-base-8M', normalize=None, max_length=512, emb_batch_size=1024, show_progress_bar=False, use_multiprocessing=True, multiprocessing_threshold=10000)[source]

Methods

__init__([model, normalize, max_length, ...])

fit(X[, y])

No-op fit for scikit-learn compatibility.

fit_transform(X[, y])

Fit the encoder on X and return the transformed matrix.

transform(X)

Encode X into dense numeric vectors.

fit(X, y=None)[source]

No-op fit for scikit-learn compatibility.

fit_transform(X, y=None)

Fit the encoder on X and return the transformed matrix.

Equivalent to calling fit() followed by transform().

Parameters:
  • X – Series of input strings.

  • y – Ignored.

Returns:

The encoded feature matrix together with its column names.

Return type:

DataHandler

transform(X)[source]

Encode X into dense numeric vectors.

Parameters:

X – Series of raw text strings.

Returns:

data is np.ndarray (n_samples, d) in float32; cols contains synthetic names emb_0 emb_{d-1}.

Return type:

DataHandler