LangkitFilter

About 918 wordsAbout 3 min

2025-10-09

📘 Overview

LangkitFilter is an operator for data filtering based on text quality metrics. It uses the Langkit toolkit to calculate various readability and complexity scores for text, filtering out data that meets user-defined thresholds.

`init` Function

def __init__(self, 
             min_scores = {
                "flesch_reading_ease": 0,
                "automated_readability_index": 0,
                "aggregate_reading_level": 0,
                "syllable_count": 32.0,
                "lexicon_count": 23.0,
                "sentence_count": 1.0,
                "character_count": 118.0,
                "letter_count": 109.0,
                "polysyllable_count": 0.0,
                "monosyllable_count": 13.0,
                "difficult_words": 4.0,
            },
            max_scores = {
                "flesch_reading_ease": 100,
                "automated_readability_index": 100,
                "aggregate_reading_level": 100,
                "syllable_count": 2331.9,
                "lexicon_count": 1554.0,
                "sentence_count": 89.1,
                "character_count": 7466.3,
                "letter_count": 7193.0,
                "polysyllable_count": 216.4,
                "monosyllable_count": 1044.1,
                "difficult_words": 213.4,
            },
            metrics_to_keep: list = [
                "flesch_reading_ease",
                "automated_readability_index",
                "aggregate_reading_level",
                "syllable_count",
                "lexicon_count",
                "sentence_count",
                "character_count",
                "letter_count",
                "polysyllable_count",
                "monosyllable_count",
                "difficult_words",
             ]):

Init Parameters

Parameter	Type	Default	Description
min_scores	dict	See code	Minimum threshold for each text quality metric score.
max_scores	dict	See code	Maximum threshold for each text quality metric score.
metrics_to_keep	list	See code	List of evaluation metrics to retain and use for filtering.

`run` Function

def run(self, storage: DataFlowStorage, input_key: str, output_keys: list = ["flesch_reading_ease", "automated_readability_index", ...]):

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	DataFlow storage instance for reading and writing data.
input_key	str	Required	Input column name corresponding to the text field to evaluate and filter.
output_keys	list	See code	List of output metric names specifying which metrics to calculate and filter.

🧠 Example Usage

from dataflow.operators.general_text import LangkitFilter
from dataflow.utils.storage import FileStorage

class LangkitFilterTest():
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="./dataflow/example/GeneralTextPipeline/langkit_test_input.jsonl",
            cache_path="./cache",
            file_name_prefix="dataflow_cache_step",
            cache_type="jsonl",
        )
        
        # Use custom thresholds for filtering
        self.filter = LangkitFilter(
            min_scores={
                "flesch_reading_ease": 0,
                "automated_readability_index": 0,
                "aggregate_reading_level": 0,
                "syllable_count": 10.0,
                "lexicon_count": 10.0,
                "sentence_count": 1.0,
                "character_count": 50.0,
                "letter_count": 40.0,
                "polysyllable_count": 0.0,
                "monosyllable_count": 5.0,
                "difficult_words": 0.0,
            },
            max_scores={
                "flesch_reading_ease": 100,
                "automated_readability_index": 50,
                "aggregate_reading_level": 50,
                "syllable_count": 500.0,
                "lexicon_count": 300.0,
                "sentence_count": 20.0,
                "character_count": 2000.0,
                "letter_count": 1800.0,
                "polysyllable_count": 100.0,
                "monosyllable_count": 200.0,
                "difficult_words": 50.0,
            }
        )
        
    def forward(self):
        self.filter.run(
            storage=self.storage.step(),
            input_key='text',
            output_keys=[
                "flesch_reading_ease",
                "automated_readability_index",
                "aggregate_reading_level",
                "syllable_count",
                "lexicon_count",
                "sentence_count",
                "character_count",
                "letter_count",
                "polysyllable_count",
                "monosyllable_count",
                "difficult_words",
            ]
        )

if __name__ == "__main__":
    test = LangkitFilterTest()
    test.forward()

🧾 Default Output Format

The operator adds two columns for each metric: one for the score (with Score suffix) and one for the label (Score_label suffix, 1 for pass, 0 for fail). Finally, only text rows where all metrics pass are retained.

Field	Type	Description
text	str	Original input text
LangkitFleschReadingEaseScore	float	Flesch readability score (0-100, higher is easier)
LangkitAutomatedReadabilityIndexScore	float	Automated Readability Index
LangkitAggregateReadingLevelScore	float	Aggregate reading level
LangkitSyllableCountScore	float	Total syllable count
LangkitLexiconCountScore	float	Lexicon count
LangkitSentenceCountScore	float	Sentence count
LangkitCharacterCountScore	float	Total character count
LangkitLetterCountScore	float	Total letter count
LangkitPolysyllableCountScore	float	Polysyllable word count
LangkitMonosyllableCountScore	float	Monosyllable word count
LangkitDifficultWordsScore	float	Difficult word count
*_label	int	Filter label for each metric (1 for pass, 0 for fail)

📋 Example Input

{"text": "The quick brown fox jumps over the lazy dog. This is a simple sentence for testing."}
{"text": "A"}
{"text": "In the field of natural language processing, various algorithms and methodologies have been developed to analyze, understand, and generate human language in a computationally efficient manner. These sophisticated techniques enable computers to perform complex linguistic tasks such as machine translation, sentiment analysis, named entity recognition, and text summarization with remarkable accuracy and efficiency."}

📤 Example Output

{"text": "The quick brown fox jumps over the lazy dog. This is a simple sentence for testing.", "LangkitFleschReadingEaseScore": 88.74, "LangkitAutomatedReadabilityIndexScore": 2.6, "LangkitAggregateReadingLevelScore": 3.0, "LangkitSyllableCountScore": 20, "LangkitLexiconCountScore": 16, "LangkitSentenceCountScore": 2, "LangkitCharacterCountScore": 68, "LangkitLetterCountScore": 66, "LangkitPolysyllableCountScore": 0, "LangkitMonosyllableCountScore": 12, "LangkitDifficultWordsScore": 0, "LangkitFleschReadingEaseScore_label": 1, "LangkitAutomatedReadabilityIndexScore_label": 1, "LangkitAggregateReadingLevelScore_label": 1, "LangkitSyllableCountScore_label": 1, "LangkitLexiconCountScore_label": 1, "LangkitSentenceCountScore_label": 1, "LangkitCharacterCountScore_label": 1, "LangkitLetterCountScore_label": 1, "LangkitPolysyllableCountScore_label": 1, "LangkitMonosyllableCountScore_label": 1, "LangkitDifficultWordsScore_label": 1}

📊 Result Analysis

Sample 1 (Simple English text):

Text: Standard test sentence
Flesch reading ease score: 88.74 (very easy to read)
Automated readability index: 2.6 (early elementary school level)
Syllables: 20, Lexicon: 16, Sentences: 2
Characters: 68, Letters: 66
Monosyllable words: 12, Polysyllable words: 0
Difficult words: 0
Passes filter (all metrics within set ranges)

Sample 2 (Very short text "A"):

Flesch reading ease score: 121.22 (extremely high readability)
Automated readability index: -16.3 (abnormal value)
Syllables: 1 (< minimum 10)
Lexicon: 1 (< minimum 10)
Characters: 1 (< minimum 50)
Letters: 1 (< minimum 40)
Monosyllable words: 1 (< minimum 5)
Filtered out (multiple metrics below minimum thresholds)

Sample 3 (Complex academic text):

Text: Long paragraph of academic NLP description
Flesch reading ease score: -14.65 (extremely difficult, < minimum 0)
Automated readability index: 24.1 (college and above level)
Aggregate reading level: 22.0 (graduate level)
Syllables: 123, Lexicon: 53, Sentences: 2
Characters: 363, Letters: 355
Polysyllable words: 24, Monosyllable words: 21
Difficult words: 28
Filtered out (readability score is negative, below minimum threshold)

📈 Metric Descriptions

Flesch Reading Ease (readability score):
- Range: Usually 0-100, higher scores mean easier reading
- 90-100: Elementary school level
- 60-70: Middle school level
- 0-30: College and above level
- Negative values: Extremely complex text
Automated Readability Index:
- Based on character count, word count, and sentence count
- Value roughly corresponds to required education grade
Syllable and lexicon statistics:
- syllable_count: Total syllable count
- lexicon_count: Lexicon count (before deduplication)
- monosyllable_count: Monosyllable word count
- polysyllable_count: Polysyllable (≥3 syllables) word count
Difficult Words:
- Count of words not in common word lists

Use Cases:

Filter training text of appropriate complexity
Filter content that is too simple or too complex
Ensure text quality meets specific education levels
Build datasets suitable for specific user groups
Remove low-quality text with statistical anomalies

Notes:

Thresholds need to be adjusted based on specific application scenarios
Academic or technical text may require more lenient complexity thresholds
Very short text may produce statistical anomalies
Recommended to perform statistical analysis on sample data before setting reasonable threshold ranges

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

generate

refine

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

refine

LangkitFilter