blockingpy.text_encoders.shingle_encoder.NgramEncoder
- class blockingpy.text_encoders.shingle_encoder.NgramEncoder(n_shingles=2, lowercase=True, strip_non_alphanum=True, max_features=5000)[source]
Encoder that converts text strings into a sparse document-term matrix of character n-gram counts, packaged in a
DataHandler.- __init__(n_shingles=2, lowercase=True, strip_non_alphanum=True, max_features=5000)[source]
Create a character n-gram encoder.
- Parameters:
n_shingles – Number of characters per shingle.
lowercase – If True, convert text to lowercase before tokenisation.
strip_non_alphanum – If True, remove non-alphanumeric characters before shingling.
max_features – Maximum number of unique shingles to keep.
Methods
__init__([n_shingles, lowercase, ...])Create a character n-gram encoder.
fit(X[, y])Stateless encoder; fitting is a no-op, returned for API parity.
fit_transform(X[, y])Fit the encoder on X and return the transformed matrix.
transform(X)Transform a series of strings into a sparse shingle count matrix.
- fit_transform(X, y=None)
Fit the encoder on X and return the transformed matrix.
Equivalent to calling
fit()followed bytransform().- Parameters:
X – Series of input strings.
y – Ignored.
- Returns:
The encoded feature matrix together with its column names.
- Return type: