blockingpy.text_encoders.shingle_encoder.NgramEncoder

class blockingpy.text_encoders.shingle_encoder.NgramEncoder(n_shingles=2, lowercase=True, strip_non_alphanum=True, max_features=5000)[source]

Encoder that converts text strings into a sparse document-term matrix of character n-gram counts, packaged in a DataHandler.

__init__(n_shingles=2, lowercase=True, strip_non_alphanum=True, max_features=5000)[source]

Create a character n-gram encoder.

Parameters:
  • n_shingles – Number of characters per shingle.

  • lowercase – If True, convert text to lowercase before tokenisation.

  • strip_non_alphanum – If True, remove non-alphanumeric characters before shingling.

  • max_features – Maximum number of unique shingles to keep.

Methods

__init__([n_shingles, lowercase, ...])

Create a character n-gram encoder.

fit(X[, y])

Stateless encoder; fitting is a no-op, returned for API parity.

fit_transform(X[, y])

Fit the encoder on X and return the transformed matrix.

transform(X)

Transform a series of strings into a sparse shingle count matrix.

fit(X, y=None)[source]

Stateless encoder; fitting is a no-op, returned for API parity.

fit_transform(X, y=None)

Fit the encoder on X and return the transformed matrix.

Equivalent to calling fit() followed by transform().

Parameters:
  • X – Series of input strings.

  • y – Ignored.

Returns:

The encoded feature matrix together with its column names.

Return type:

DataHandler

transform(X)[source]

Transform a series of strings into a sparse shingle count matrix.

Parameters:

X – Series of text strings.

Returns:

data: csr_matrix of shape (n_samples, n_features); cols: list of shingle strings.

Return type:

DataHandler