NoPuncFilter

About 467 wordsAbout 2 min

2025-10-09

📘 Overview

NoPuncFilter detects and filters text lacking proper punctuation. It splits text by punctuation marks and checks the word count of each sentence fragment. If any sentence fragment exceeds the threshold word count, it indicates the text lacks necessary punctuation, and such text will be filtered out.

init Function

def __init__(self, threshold: int=112)

Init Parameters

Parameter	Type	Default	Description
threshold	int	112	Threshold for maximum sentence word count.

run Function

def run(self, storage: DataFlowStorage, input_key: str, output_key: str='no_punc_filter_label')

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	DataFlow storage instance for reading and writing data.
input_key	str	Required	Input column name corresponding to the text field to check.
output_key	str	'no_punc_filter_label'	Output column name for storing filter result labels.

🧠 Example Usage

from dataflow.operators.general_text import NoPuncFilter
from dataflow.utils.storage import FileStorage

class NoPuncFilterTest():
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="./dataflow/example/GeneralTextPipeline/no_punc_test_input.jsonl",
            cache_path="./cache",
            file_name_prefix="dataflow_cache_step",
            cache_type="jsonl",
        )
        
        self.filter = NoPuncFilter(
            threshold=112
        )
        
    def forward(self):
        self.filter.run(
            storage=self.storage.step(),
            input_key='text',
            output_key='no_punc_filter_label'
        )

if __name__ == "__main__":
    test = NoPuncFilterTest()
    test.forward()

🧾 Default Output Format

After execution, the operator adds a column specified by output_key (default no_punc_filter_label) to the original data, with a value of 1 indicating that the data passed the filter check. Finally, the operator filters and writes back all rows that pass the check.

Field	Type	Description
no_punc_filter_label	int	Filter result label. Value of 1 indicates the text contains proper punctuation (all sentence fragments have word counts within the threshold).

📋 Example Input

{"text": "This is a normal sentence. It has proper punctuation."}
{"text": "Thisisaverylongsentencewithoutanyspacesorpunctuationwhichwillexceedthethresholdbecauseithasmanymanywordsthatcannotbecountedproperlywithoutspacesandthiswillcauseittobefiltered"}
{"text": "Short text. Another sentence. Good punctuation throughout the entire document which is very helpful."}

📤 Example Output

{"text": "This is a normal sentence. It has proper punctuation.", "no_punc_filter_label": 1}
{"text": "Thisisaverylongsentencewithoutanyspacesorpunctuationwhichwillexceedthethresholdbecauseithasmanymanywordsthatcannotbecountedproperlywithoutspacesandthiswillcauseittobefiltered", "no_punc_filter_label": 1}
{"text": "Short text. Another sentence. Good punctuation throughout the entire document which is very helpful.", "no_punc_filter_label": 1}

📊 Result Analysis

Sample 1 (Normal sentences):

Split by punctuation: ["This is a normal sentence", " It has proper punctuation", ""]
Max word count: About 5-6 words
Threshold: 112
Passes filter (5-6 ≤ 112)

Sample 2 (Long sentence without spaces):

Split by punctuation: Entire segment as one sentence
Tokenized with split(): 1 "word" (no spaces)
Max word count: 1
Threshold: 112
Passes filter (1 ≤ 112)

Sample 3 (Multiple short sentences):

Split into multiple sentences by punctuation
Max word count per sentence < 112
Passes filter (all sentences within threshold)

Use Cases:

Filter text lacking proper punctuation
Ensure text has good readability and structure
Data quality control
Filter abnormal machine-generated text or unstructured text

Notes:

Uses regex [–.!?,;•/|…] to split sentence fragments
Uses space split() to count words
Identifies text lacking punctuation by detecting overly long sentence fragments
Default threshold is 112 words
Text without spaces is treated as a single word

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

generate

refine

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

refine

NoPuncFilter