LangkitSampleEvaluator

About 640 wordsAbout 2 min

2025-10-09

📘 Overview

LangkitSampleEvaluator is a text quality assessment operator that uses the Langkit toolkit to calculate various statistical metrics of text, helping evaluate text structural complexity and readability. This operator can extract multiple linguistic features, including sentence length, lexical diversity, sentiment orientation, etc.

init

def __init__(self)

This operator requires no parameters during initialization.

run

def run(self, storage: DataFlowStorage, input_key: str)

Executes the operator's main logic, reading the input DataFrame from storage, performing Langkit evaluation on text in the specified column, and adding evaluation results (multiple scores) as new columns back to the DataFrame and writing to storage.

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	DataFlow storage instance for reading and writing data.
input_key	str	Required	Input column name specifying the column containing the text to be evaluated.

🧠 Example Usage

from dataflow.operators.general_text import LangkitSampleEvaluator
from dataflow.utils.storage import FileStorage

class LangkitSampleEvaluatorTest():
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="./dataflow/example/GeneralTextPipeline/eval_test_input.jsonl",
            cache_path="./cache",
            file_name_prefix="dataflow_cache_step",
            cache_type="jsonl",
        )
        
        self.evaluator = LangkitSampleEvaluator()
        
    def forward(self):
        self.evaluator.run(
            storage=self.storage.step(),
            input_key='text'
        )

if __name__ == "__main__":
    test = LangkitSampleEvaluatorTest()
    test.forward()

🧾 Default Output Format

Field	Type	Description
text	str	The original input text
flesch_reading_ease	float	Flesch Reading Ease score (0-100, higher indicates easier to read)
automated_readability_index	float	Automated Readability Index
syllable_count	int	Total number of syllables
lexicon_count	int	Total number of words
sentence_count	int	Total number of sentences
character_count	int	Total number of characters
letter_count	int	Total number of letters
polysyllable_count	int	Number of polysyllabic words
monosyllable_count	int	Number of monosyllabic words
difficult_words	int	Number of difficult words

📋 Example Input

{"text": "The quick brown fox jumps over the lazy dog. The sun is shining brightly in the clear blue sky. Birds are singing melodiously in the tall green trees. Children are playing happily in the beautiful park. Flowers are blooming magnificently everywhere you look. Nature displays its wonder through colorful butterflies dancing among fragrant roses. People enjoy peaceful walks along winding pathways surrounded by lush vegetation."}
{"text": "The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat."}
{"text": "In contemporary discourse surrounding technological advancement, one must acknowledge the multifaceted ramifications of artificial intelligence implementation. The epistemological considerations necessitate comprehensive analysis of socioeconomic implications. Furthermore, the paradigmatic shift toward automation requires meticulous examination of ethical frameworks governing algorithmic decision-making processes. Subsequently, organizational infrastructures must accommodate transformative methodologies while simultaneously addressing unprecedented complexities inherent within technological ecosystems."}

📤 Example Output

{"text": "The quick brown fox...", "flesch_reading_ease": 72.53, "automated_readability_index": 6.94, "syllable_count": 128, "lexicon_count": 68, "sentence_count": 7, "character_count": 396, "letter_count": 325, "polysyllable_count": 6, "monosyllable_count": 47, "difficult_words": 8}
{"text": "The cat sat on the mat...", "flesch_reading_ease": 116.14, "automated_readability_index": -2.15, "syllable_count": 70, "lexicon_count": 84, "sentence_count": 14, "character_count": 348, "letter_count": 288, "polysyllable_count": 0, "monosyllable_count": 84, "difficult_words": 0}
{"text": "In contemporary discourse...", "flesch_reading_ease": -23.94, "automated_readability_index": 27.63, "syllable_count": 167, "lexicon_count": 53, "sentence_count": 4, "character_count": 497, "letter_count": 420, "polysyllable_count": 30, "monosyllable_count": 11, "difficult_words": 32}

📊 Result Analysis

Sample 1 (Normal Descriptive Text):

Flesch Reading Ease: 72.53 (appropriate difficulty, suitable for general readers)
Contains diverse vocabulary with 8 difficult words
Readability level suitable for middle school students

Sample 2 (Highly Repetitive Text):

Flesch Reading Ease: 116.14 (very easy to read)
All words are monosyllabic with 0 difficult words
But high repetition reduces text quality

Sample 3 (Complex Academic Text):

Flesch Reading Ease: -23.94 (very difficult to read)
Contains 32 difficult words and 30 polysyllabic words
Automated Readability Index: 27.63 (requires professional-level education)

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

refine

generate

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

generate

filter

eval

filter

generate

refine

LangkitSampleEvaluator