LineEndWithEllipsisFilter

About 448 wordsAbout 1 min

2025-10-09

📘 Overview

LineEndWithEllipsisFilter is a data filtering operator that detects the proportion of lines ending with ellipsis (... or …) in text. When this proportion is below the preset threshold, the text is considered valid and retained; otherwise, it is filtered out. This operator is commonly used in data cleaning to identify and remove incomplete expressions.

`init` Function

def __init__(self, threshold: float=0.3)

Init Parameters

Parameter	Type	Default	Description
threshold	float	0.3	Floating-point threshold. If the proportion of lines ending with ellipsis is less than this threshold, the text passes filtering.

`run` Function

def run(self, storage: DataFlowStorage, input_key: str, output_key: str = 'line_end_with_ellipsis_filter_label')

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	DataFlow storage instance for reading and writing data.
input_key	str	Required	Input column name corresponding to the text field to detect.
output_key	str	'line_end_with_ellipsis_filter_label'	Output column name for storing filter result labels (1 for pass, 0 for fail).

🧠 Example Usage

from dataflow.operators.general_text import LineEndWithEllipsisFilter
from dataflow.utils.storage import FileStorage

class LineEndWithEllipsisFilterTest():
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="./dataflow/example/GeneralTextPipeline/line_end_with_ellipsis_test_input.jsonl",
            cache_path="./cache",
            file_name_prefix="dataflow_cache_step",
            cache_type="jsonl",
        )
        
        self.filter = LineEndWithEllipsisFilter(
            threshold=0.3
        )
        
    def forward(self):
        self.filter.run(
            storage=self.storage.step(),
            input_key='text',
            output_key='line_end_with_ellipsis_filter_label'
        )

if __name__ == "__main__":
    test = LineEndWithEllipsisFilterTest()
    test.forward()

🧾 Default Output Format

The operator adds a new column specified by output_key to the DataFrame to mark filter results (1 for pass, 0 for fail), and ultimately writes back only rows that pass filtering (i.e., marked as 1).

Field	Type	Description
text	str	Original input text
line_end_with_ellipsis_filter_label	int	Filter label (1 for pass, 0 for fail)

📋 Example Input

{"text": "This is a complete sentence without any issues."}
{"text": "This is incomplete...\nAnother line that ends with...\nAnd one more..."}
{"text": "First line is fine.\nSecond line is also good.\nThird line is complete too."}

📤 Example Output

{"text": "This is a complete sentence without any issues.", "line_end_with_ellipsis_filter_label": 1}
{"text": "First line is fine.\nSecond line is also good.\nThird line is complete too.", "line_end_with_ellipsis_filter_label": 1}

📊 Result Analysis

Sample 1 (Complete text):

Total lines: 1
Lines ending with ellipsis: 0
Ellipsis line ratio: 0/1 = 0.0 (0%)
Passes filter (< 0.3 threshold)

Sample 2 (Many ellipses):

Total lines: 3
Lines ending with ellipsis: 3
Ellipsis line ratio: 3/3 = 1.0 (100%)
Filtered out (≥ 0.3 threshold)

Sample 3 (No ellipsis):

Total lines: 3
Lines ending with ellipsis: 0
Ellipsis line ratio: 0/3 = 0.0 (0%)
Passes filter (< 0.3 threshold)

Use Cases:

Filter incomplete expressions or unfinished text
Remove low-quality content containing excessive ellipses
Ensure text completeness and expression integrity

Notes:

The operator detects two ellipsis formats: ... and …
Text is split by lines for detection
Default threshold is 0.3, adjustable based on actual needs

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

generate

refine

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

refine

LineEndWithEllipsisFilter