StemmingLemmatizationRefiner

About 351 wordsAbout 1 min

2025-10-09

📘 Overview

StemmingLemmatizationRefiner operator is designed to perform stemming or lemmatization on text, converting words to their base or root forms. This helps standardize text and reduce word variations, thereby improving performance of subsequent processing tasks. This operator supports Porter stemming algorithm and WordNet lemmatization methods.

init function

def __init__(self, method: str = "stemming"):

init parameter description

Parameter	Type	Default	Description
method	str	"stemming"	Specifies processing method. Options are 'stemming' (word stemming) or 'lemmatization' (word lemmatization).

run function

def run(self, storage: DataFlowStorage, input_key: str):

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	Data flow storage instance for reading and writing data.
input_key	str	Required	Input column name specifying text field in DataFrame to process.

📦 Dependency Configuration

This operator depends on NLTK's WordNet data.

Method 1: Use Pre-downloaded NLTK Data (Recommended)

Download NLTK data packages from https://github.com/nltk/nltk_data, ensuring they include:
- wordnet/
- omw-1.4/
Set environment variable pointing to data path:
```
export NLTK_DATA=/path/to/nltk_data
```

Method 2: Automatic Download

On first use, operator will automatically download required data to default location (~/nltk_data or ./dataflow_cache/nltk_data)

🧠 Example Usage

from dataflow.operators.general_text import StemmingLemmatizationRefiner
from dataflow.utils.storage import FileStorage

class StemmingLemmatizationRefinerTest():
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="./dataflow/example/GeneralTextPipeline/stemming_lemmatization_test_input.jsonl",
            cache_path="./cache",
            file_name_prefix="dataflow_cache_step",
            cache_type="jsonl",
        )
        
        self.refiner = StemmingLemmatizationRefiner()  # Default uses stemming
        
    def forward(self):
        self.refiner.run(
            storage=self.storage.step(),
            input_key='text'
        )

if __name__ == "__main__":
    test = StemmingLemmatizationRefinerTest()
    test.forward()

🧾 Default Output Format

Field	Type	Description
text	str	Text after stemming or lemmatization

📋 Sample Input

{"text":"running jumps quickly"}
{"text":"cats dogs playing"}
{"text":"studied studying studies"}

📤 Sample Output (method="stemming")

{"text":"run jump quickli"}
{"text":"cat dog play"}
{"text":"studi studi studi"}

📊 Results Analysis

Sample 1: "running" → "run", "jumps" → "jump", "quickly" → "quickli" Sample 2: "cats" → "cat", "dogs" → "dog", "playing" → "play" Sample 3: All three forms "studied" "studying" "studies" become "studi"

Use Cases:

Text standardization and normalization
Word matching in information retrieval
Feature extraction for text classification
Reduce vocabulary size

Notes:

Stemming: Fast but may produce non-real words (like "quickli")
Lemmatization: Accurate but slower, requires WordNet data
Only applicable to English text

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

refine

generate

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

generate

filter

eval

filter

generate

refine

StemmingLemmatizationRefiner