AlphaWordsFilter
About 622 wordsAbout 2 min
2025-10-09
📘 Overview
The AlphaWordsFilter operator validates whether the ratio of alphabetic words in text meets a specified threshold. It supports two tokenization modes: professional tokenization using the NLTK library, or simple whitespace splitting. This operator filters out text lines that do not meet the ratio condition.
📦 Dependencies
This operator depends on the NLTK (Natural Language Toolkit) library for tokenization. During initialization, the operator automatically downloads the required punkt_tab data package.
NLTK Data Download Issues
If you encounter slow or stuck NLTK data downloads during initialization, you can use the following solutions:
Method 1: Manual Download
- Visit the NLTK data repository: https://github.com/nltk/nltk_data
- Download the
punkt_tabdata package - Place the data package in NLTK's data directory (typically
~/nltk_data/or check vianltk.data.path)
Method 2: Use Custom Download Directory
import nltk
nltk.download('punkt_tab', download_dir='./nltk_data/')Method 3: Use Non-Tokenizer Mode
If you don't need the NLTK tokenizer, you can set use_tokenizer=False during initialization. This will use simple whitespace splitting and won't require downloading NLTK data.
__init__ Function
def __init__(self, threshold: float, use_tokenizer: bool)Initialization Parameters
| Parameter Name | Type | Default | Description |
|---|---|---|---|
| threshold | float | Required | Threshold for the alphabetic word ratio (between 0-1). The ratio of words containing letters to total words must exceed this value to pass the filter. |
| use_tokenizer | bool | Required | Whether to use the NLTK tokenizer. If False, uses simple whitespace splitting. |
run Function
def run(self, storage: DataFlowStorage, input_key: str, output_key: str='alpha_words_filter_label')Parameters
| Name | Type | Default | Description |
|---|---|---|---|
| storage | DataFlowStorage | Required | DataFlow storage instance responsible for reading and writing data. |
| input_key | str | Required | Input column name corresponding to the text field to be filtered. |
| output_key | str | 'alpha_words_filter_label' | Output column name for storing the filter result label (1 means passed, 0 means failed). |
🧠 Example Usage
from dataflow.operators.general_text import AlphaWordsFilter
from dataflow.utils.storage import FileStorage
class AlphaWordsFilterTest():
def __init__(self):
self.storage = FileStorage(
first_entry_file_name="./dataflow/example/GeneralTextPipeline/alpha_words_test_input.jsonl",
cache_path="./cache",
file_name_prefix="dataflow_cache_step",
cache_type="jsonl",
)
self.filter = AlphaWordsFilter(
threshold=0.5,
use_tokenizer=False
)
def forward(self):
self.filter.run(
storage=self.storage.step(),
input_key='text',
output_key='alpha_words_filter_label'
)
if __name__ == "__main__":
test = AlphaWordsFilterTest()
test.forward()🧾 Default Output Format
| Field | Type | Description |
|---|---|---|
| text | str | Original input text |
| alpha_words_filter_label | int | Filter label (1 means passed, 0 means failed) |
📋 Sample Input
{"text": "The quick brown fox jumps over the lazy dog in the beautiful garden."}
{"text": "123456 789 !!!### @@@ $$$ %%% ^^^ &&& *** ((( )))"}
{"text": "Hello123 World456 Test789 ABC xyz 123"}
{"text": "纯中文文本没有任何英文字母内容全部都是中文"}
{"text": "Mixed 混合 content with 50% English and 50% Chinese 中文"}📤 Sample Output
{"text": "The quick brown fox jumps over the lazy dog in the beautiful garden.", "alpha_words_filter_label": 1}
{"text": "Hello123 World456 Test789 ABC xyz 123", "alpha_words_filter_label": 1}
{"text": "Mixed 混合 content with 50% English and 50% Chinese 中文", "alpha_words_filter_label": 1}📊 Result Analysis
Sample 1 (Pure English Text):
- All words contain letters
- Alphabetic word ratio: 11/11 = 1.0 (100%)
- Passed filter (> 0.5 threshold)
Sample 2 (Pure Numbers and Symbols):
- No words contain letters
- Alphabetic word ratio: 0/11 = 0.0 (0%)
- Failed filter (≤ 0.5 threshold)
Sample 3 (Alphanumeric Mix):
- 6 words all contain letters (Hello123, World456, Test789, ABC, xyz, except the last one "123")
- Alphabetic word ratio: 5/6 ≈ 0.83 (83%)
- Passed filter (> 0.5 threshold)
Sample 4 (Pure Chinese):
- Chinese characters do not contain English letters
- Alphabetic word ratio: 0/1 = 0.0 (0%)
- Failed filter (≤ 0.5 threshold)
Sample 5 (Chinese-English Mix):
- Words with letters: Mixed, content, with, English, and, Chinese
- Alphabetic word ratio: 6/10 = 0.6 (60%)
- Passed filter (> 0.5 threshold)
Use Cases:
- Filter non-English or primarily numeric/symbolic text
- Ensure datasets contain sufficient English content
- Clean low-quality text mixed with many non-alphabetic characters

