SpecialCharacterFilter

About 410 wordsAbout 1 min

2025-10-09

📘 Overview

SpecialCharacterFilter is a special character filtering operator for removing entries containing specific or non-standard Unicode characters from text. This operator detects and filters text using predefined regular expression patterns to ensure data standardization and cleanliness.

`init` Function

def __init__(self)

`init` Parameters

This operator requires no parameters during initialization.

`run` Function

def run(self, storage: DataFlowStorage, input_key: str, output_key: str='special_character_filter_label')

Executes the operator's main logic, reading the input DataFrame from storage, detecting special characters in the specified column, filtering non-compliant rows, and writing results back to storage.

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	DataFlow storage instance for reading and writing data.
input_key	str	Required	Input column name corresponding to the text field for special character detection.
output_key	str	"special_character_filter_label"	Output label column name for marking whether text passes detection (1 for pass, 0 for fail).

🧠 Example Usage

from dataflow.operators.general_text import SpecialCharacterFilter
from dataflow.utils.storage import FileStorage

class SpecialCharacterFilterTest():
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="./dataflow/example/GeneralTextPipeline/special_char_test_input.jsonl",
            cache_path="./cache",
            file_name_prefix="dataflow_cache_step",
            cache_type="jsonl",
        )
        
        self.filter = SpecialCharacterFilter()
        
    def forward(self):
        self.filter.run(
            storage=self.storage.step(),
            input_key='text',
            output_key='special_character_filter_label'
        )

if __name__ == "__main__":
    test = SpecialCharacterFilterTest()
    test.forward()

🧾 Default Output Format

Field	Type	Description
[input_key]	str	Original input text.
special_character_filter_label	int	Special character detection label; 1 indicates text has no special characters and passes detection.

📋 Example Input

{"text": "This is a normal text without special characters."}
{"text": "This text contains special char u200e which should be filtered."}
{"text": "Another normal text with standard punctuation!"}

📤 Example Output

{"text": "This is a normal text without special characters.", "special_character_filter_label": 1}
{"text": "Another normal text with standard punctuation!", "special_character_filter_label": 1}

📊 Result Analysis

Sample 1 (Normal text):

Detection patterns: r"u200e", r"÷|\? :", r"[�□]|\{\/U\}", r"U\+26[0-F][0-D]|..."
Match result: No match
Passes filter (no special characters)

Sample 2 (Contains u200e):

Pattern match detected: "u200e" found
Filtered out (contains special characters)

Sample 3 (Standard punctuation):

Detection patterns: No match
Passes filter (standard punctuation is not considered special characters)

Detected Special Character Types:

Unicode control characters (e.g., u200e)
HTML entities (e.g., ÷)
Broken characters (e.g., �, □)
Unicode emoji ranges (e.g., U+2600-U+273F)

Use Cases:

Clean web-scraped data
Filter text containing abnormal Unicode characters
Data normalization processing
Ensure text compatibility

Notes:

Uses predefined regex patterns for detection
Empty text will be filtered
Standard punctuation marks are not considered special characters
Special character patterns can be customized as needed

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

generate

refine

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

refine

SpecialCharacterFilter