UniqueWordsFilter

About 491 wordsAbout 2 min

2025-10-09

📘 Overview

UniqueWordsFilter is a text filtering operator that filters data based on whether the ratio of unique words in text reaches a preset threshold.

init Function

def __init__(self, threshold: float=0.1)

Init Parameters

Parameter	Type	Default	Description
threshold	float	0.1	Threshold for unique word ratio; text below this threshold will be filtered out.

run Function

def run(self, storage: DataFlowStorage, input_key: str, output_key: str='unique_words_filter')

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	DataFlow storage instance for reading and writing data.
input_key	str	Required	Input column name corresponding to the text field to check.
output_key	str	'unique_words_filter'	Output column name for storing filter result flag (value of 1).

🧠 Example Usage

from dataflow.operators.general_text import UniqueWordsFilter
from dataflow.utils.storage import FileStorage

class UniqueWordsFilterTest():
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="./dataflow/example/GeneralTextPipeline/unique_words_test_input.jsonl",
            cache_path="./cache",
            file_name_prefix="dataflow_cache_step",
            cache_type="jsonl",
        )
        
        self.filter = UniqueWordsFilter(
            threshold=0.1
        )
        
    def forward(self):
        self.filter.run(
            storage=self.storage.step(),
            input_key='text',
            output_key='unique_words_filter'
        )

if __name__ == "__main__":
    test = UniqueWordsFilterTest()
    test.forward()

The operator returns a filtered DataFrame containing only rows where the unique word ratio is greater than threshold. The DataFrame will have a new column specified by output_key, with a constant value of 1.

Field	Type	Description
`<input_key>`	str	Original input text field (retained).
`<output_key>`	int	Filter result flag; value is always 1 in output DataFrame.

📋 Example Input

{"text": "The quick brown fox jumps over the lazy dog"}
{"text": "good good good good good good good good good good"}
{"text": "This is a simple test with various different words"}

📤 Example Output

{"text": "The quick brown fox jumps over the lazy dog", "unique_words_filter": 1}
{"text": "This is a simple test with various different words", "unique_words_filter": 1}

📊 Result Analysis

In this test, 2 texts passed the filter and 1 was filtered out:

Sample 1 (Passed) - High uniqueness text:

Total word count: 9
Unique word count: 8 ("the" appears twice)
Unique word ratio: 8 / 9 ≈ 0.889 (88.9%)
Result: Passes filter ✓ (0.889 > 0.1 threshold)

Sample 2 (Filtered) - Extremely low uniqueness text:

Total word count: 10
Unique word count: 1 (only "good")
Unique word ratio: 1 / 10 = 0.1 (10%)
Result: Filtered out ✗ (0.1 ≤ 0.1 threshold, must be strictly greater than)

Sample 3 (Passed) - Fully unique text:

Total word count: 9
Unique word count: 9 (all words are distinct)
Unique word ratio: 9 / 9 = 1.0 (100%)
Result: Passes filter ✓ (1.0 > 0.1 threshold)

How It Works:

Convert text to lowercase
Split into word list using spaces
Calculate unique word count using set
Calculate unique word ratio = unique words / total words
Retain if ratio > threshold

Use Cases:

Filter text with excessive repetition
Detect low-quality machine-generated text
Identify language monotony issues
Dataset diversity quality control

Notes:

Case-insensitive (converted to lowercase for comparison)
Uses space tokenization
Higher threshold means stricter filtering
Default threshold=0.1 is very lenient, only filters extremely repetitive text

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

generate

refine

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

refine

UniqueWordsFilter