RemoveStopwordsRefiner

About 334 wordsAbout 1 min

2025-10-09

📘 Overview

RemoveStopwordsRefiner is a text optimization operator designed to remove English stopwords (such as "the", "is", "in", and other high-frequency words with little meaning) from input text. This operator uses NLTK library's stopwords corpus to filter text in specified fields, aiming to improve text feature density and prepare for subsequent natural language processing tasks.

`init` function

def __init__(self, model_cache_dir: str = './dataflow_cache')

init parameter description

Parameter	Type	Default	Description
model_cache_dir	str	'./dataflow_cache'	Cache directory path for storing NLTK stopwords data.

`run` function

def run(self, storage: DataFlowStorage, input_key: str)

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	Data flow storage instance for reading and writing data.
input_key	str	Required	Name of input column containing text to remove stopwords from.

📦 NLTK Data Configuration

This operator depends on NLTK's stopwords corpus.

Recommended Method: Use Pre-downloaded Data (Avoid Network Issues)

Download required packages from https://github.com/nltk/nltk_data:
- stopwords/
Set environment variable pointing to data path:
```
export NLTK_DATA=/path/to/nltk_data
```

Automatic Download Method:

On first use, the operator will automatically detect and download required data. If network issues cause download to hang, use the manual download method above.

🧠 Example Usage

from dataflow.operators.general_text import RemoveStopwordsRefiner
from dataflow.utils.storage import FileStorage

class RemoveStopwordsRefinerTest():
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="./dataflow/example/GeneralTextPipeline/remove_stopwords_test_input.jsonl",
            cache_path="./cache",
            file_name_prefix="dataflow_cache_step",
            cache_type="jsonl",
        )
        
        self.refiner = RemoveStopwordsRefiner()
        
    def forward(self):
        self.refiner.run(
            storage=self.storage.step(),
            input_key='text'
        )

if __name__ == "__main__":
    test = RemoveStopwordsRefinerTest()
    test.forward()

🧾 Default Output Format

Field	Type	Description
text	str	Text with stopwords removed

📋 Sample Input

{"text":"This is a simple test"}
{"text":"The quick brown fox jumps"}
{"text":"I am going to the store"}

📤 Sample Output

{"text":"simple test"}
{"text":"quick brown fox jumps"}
{"text":"going store"}

📊 Results Analysis

Sample 1: Removed "This" "is" "a" Sample 2: Removed "The" Sample 3: Removed "I" "am" "to" "the"

Use Cases:

NLP text preprocessing
Keyword extraction
Feature extraction before text classification

Notes:

Uses NLTK English stopwords list
Only applicable to English text

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

refine

generate

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

generate

filter

eval

filter

generate

refine

RemoveStopwordsRefiner