CharNumberFilter

About 512 wordsAbout 2 min

2025-10-09

📘 Overview

The CharNumberFilter operator filters text data based on character count. It calculates the total number of characters in the specified text field after removing whitespace characters, and only retains records where the character count is greater than or equal to a preset threshold.

init Function

def __init__(self, threshold: int=100)

Initialization Parameters

Parameter Name	Type	Default	Description
threshold	int	100	Minimum character count threshold. After removing whitespace characters (spaces, newlines, tabs), the character count of the text must be greater than or equal to this value to be retained.

run Function

def run(self, storage: DataFlowStorage, input_key: str, output_key: str='char_number_filter_label')

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	DataFlow storage instance responsible for reading and writing data.
input_key	str	Required	Input column name corresponding to the text field to be filtered.
output_key	str	'char_number_filter_label'	Output column name for storing the filter result label (1 means passed, 0 means failed).

🧠 Example Usage

from dataflow.operators.general_text import CharNumberFilter
from dataflow.utils.storage import FileStorage

class CharNumberFilterTest():
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="./dataflow/example/GeneralTextPipeline/char_number_test_input.jsonl",
            cache_path="./cache",
            file_name_prefix="dataflow_cache_step",
            cache_type="jsonl",
        )
        
        self.filter = CharNumberFilter(
            threshold=100
        )
        
    def forward(self):
        self.filter.run(
            storage=self.storage.step(),
            input_key='text',
            output_key='char_number_filter_label'
        )

if __name__ == "__main__":
    test = CharNumberFilterTest()
    test.forward()

🧾 Default Output Format

This operator adds a new column specified by output_key to the input DataFrame, with values of 1 (passed filter) or 0 (failed filter). Finally, the operator writes the filtered DataFrame (rows where the new column value is 1) back to storage.

Field	Type	Description
text	str	Original input text
char_number_filter_label	int	Filter result label, `1` means character count meets threshold condition, `0` means does not meet.

📋 Sample Input

{"text": "Short"}
{"text": "This is a medium length text that should pass the character count filter with enough characters to meet the threshold."}
{"text": "A"}
{"text": "The quick brown fox jumps over the lazy dog. This sentence contains enough characters to pass the minimum threshold for the character number filter."}
{"text": "x"}

📤 Sample Output

{"text": "The quick brown fox jumps over the lazy dog. This sentence contains enough characters to pass the minimum threshold for the character number filter.", "char_number_filter_label": 1}

📊 Result Analysis

Sample 1 ("Short"):

Character count after removing whitespace: 5
Failed filter (< 100 threshold)

Sample 2 (Medium Length Text):

Original text: approximately 120 characters
Character count after removing whitespace: approximately 100 characters
Passed filter (≥ 100 threshold)

Sample 3 ("A"):

Character count after removing whitespace: 1
Failed filter (< 100 threshold)

Sample 4 (Long Text):

Original text: approximately 157 characters
Character count after removing whitespace: approximately 135 characters
Passed filter (≥ 100 threshold)

Sample 5 ("x"):

Character count after removing whitespace: 1
Failed filter (< 100 threshold)

Use Cases:

Filter text that is too short to ensure data quality
Remove invalid or incomplete entries
Set minimum text length requirements
Clean noise data in datasets

Notes:

Character counting removes all spaces, newlines, and tabs
Chinese characters, English characters, numbers, and punctuation are all counted
Recommend adjusting the threshold based on specific use cases (default 100 is suitable for paragraph-level text)

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

generate

refine

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

refine

CharNumberFilter