SentenceNumberFilter

About 412 wordsAbout 1 min

2025-10-09

📘 Overview

SentenceNumberFilter is a text filtering operator that checks whether the number of sentences in input text falls within a specified minimum and maximum range. It counts sentences using regular expressions and only retains text rows that meet the criteria.

init Function

def __init__(self, min_sentences: int=3, max_sentences: int=7500)

Init Parameters

Parameter	Type	Default	Description
min_sentences	int	3	Minimum number of sentences text should contain.
max_sentences	int	7500	Maximum number of sentences text should contain.

run Function

def run(self, storage: DataFlowStorage, input_key: str, output_key: str = 'sentence_number_filter_label')

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	DataFlow storage instance for reading and writing data.
input_key	str	Required	Input column name corresponding to the text field to filter.
output_key	str	'sentence_number_filter_label'	Output column name for storing filter result labels.

🧠 Example Usage

from dataflow.operators.general_text import SentenceNumberFilter
from dataflow.utils.storage import FileStorage

class SentenceNumberFilterTest():
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="./dataflow/example/GeneralTextPipeline/sentence_number_test_input.jsonl",
            cache_path="./cache",
            file_name_prefix="dataflow_cache_step",
            cache_type="jsonl",
        )
        
        self.filter = SentenceNumberFilter(
            min_sentences=3,
            max_sentences=7500
        )
        
    def forward(self):
        self.filter.run(
            storage=self.storage.step(),
            input_key='text',
            output_key='sentence_number_filter_label'
        )

if __name__ == "__main__":
    test = SentenceNumberFilterTest()
    test.forward()

🧾 Default Output Format

After execution, the operator adds a new field (default sentence_number_filter_label) to the original data with a value of 1, retaining only rows that pass filtering.

Field	Type	Description
sentence_number_filter_label	int	Filter result label; value of 1 indicates this data row passed filtering.

📋 Example Input

{"text": "Hi"}
{"text": "Hello world. This is a test. It has three sentences."}
{"text": "First sentence. Second sentence. Third sentence. Fourth sentence. Fifth sentence. Sixth sentence."}

📤 Example Output

{"text": "Hello world. This is a test. It has three sentences.", "sentence_number_filter_label": 1}
{"text": "First sentence. Second sentence. Third sentence. Fourth sentence. Fifth sentence. Sixth sentence.", "sentence_number_filter_label": 1}

📊 Result Analysis

Sample 1 ("Hi"):

Matched sentence count: 1
Sentence count range: [3, 7500]
Filtered out (1 < 3)

Sample 2 ("Hello world. This is a test. It has three sentences."):

Matched sentence count: About 3-4 (using regex \b[^.!?\n]+[.!?]*)
Sentence count range: [3, 7500]
Passes filter (within range)

Sample 3 (6 complete sentences):

Matched sentence count: 6
Sentence count range: [3, 7500]
Passes filter (within range)

Use Cases:

Filter overly short text fragments
Filter overly long documents
Dataset quality control
Ensure text has sufficient content depth

Notes:

Uses regex \b[^.!?\n]+[.!?]* to match sentences
Supports Chinese and English sentence-ending punctuation (。！？.!?)
Empty text will be filtered
Sentence segmentation rules may differ across languages

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

generate

refine

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

refine

SentenceNumberFilter