LLMLanguageFilter

About 519 wordsAbout 2 min

2025-10-09

📘 Overview

LLMLanguageFilter is a large language model (LLM)-based language filtering operator. It identifies the language of input text by calling an LLM and filters data according to a preset list of allowed languages. This operator is mainly used in data cleaning, multilingual data classification, and other scenarios to ensure that text in the data flow meets specific language requirements.

`init` Function

def __init__(self, llm_serving: LLMServingABC = None, allowed_languages: list[str] = ['en'])

Parameter	Type	Default	Description
llm_serving	LLMServingABC	None	Large language model service instance for performing language identification.
allowed_languages	list[str]	['en']	List of allowed languages, using ISO 639-1 two-letter language codes (e.g., 'en' for English, 'zh' for Chinese).

`run` Function

def run(self, storage: DataFlowStorage, input_key: str, output_key: str = 'language_label')

Name	Type	Default	Description
storage	DataFlowStorage	Required	DataFlow storage instance for reading and writing data.
input_key	str	Required	Input column name corresponding to the text field for language identification.
output_key	str	'language_label'	Output column name corresponding to the LLM-generated language label field (1 for pass, 0 for fail).

🧠 Example Usage

from dataflow.operators.general_text import LLMLanguageFilter
from dataflow.serving import LocalModelLLMServing
from dataflow.utils.storage import FileStorage

class LLMLanguageFilterTest():
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="./dataflow/example/GeneralTextPipeline/llm_language_filter_test_input.jsonl",
            cache_path="./cache",
            file_name_prefix="dataflow_cache_step",
            cache_type="jsonl",
        )
        
        # Initialize LLM service
        llm_serving = LocalModelLLMServing(
            model_name="Qwen/Qwen2.5-7B-Instruct",
            device="cuda"
        )
        
        # Allow only English and Chinese
        self.filter = LLMLanguageFilter(
            llm_serving=llm_serving,
            allowed_languages=['en', 'zh']
        )
        
    def forward(self):
        self.filter.run(
            storage=self.storage.step(),
            input_key='text',
            output_key='language_label'
        )

if __name__ == "__main__":
    test = LLMLanguageFilterTest()
    test.forward()

🧾 Default Output Format

After execution, the operator adds a output_key column (default language_label) to the DataFrame to store the identified language labels. Subsequently, it filters rows where this column's value is in the allowed_languages list and writes them back to storage.

Field	Type	Description
text	str	Original input text field.
language_label	str	Language label identified by LLM (e.g., 'en', 'zh', 'fr', etc.).

📋 Example Input

{"text": "Hello, this is a test message in English."}
{"text": "你好，这是一条中文测试消息。"}
{"text": "Bonjour, ceci est un message de test en français."}

📤 Example Output

{"text": "Hello, this is a test message in English.", "language_label": "en"}
{"text": "你好，这是一条中文测试消息。", "language_label": "zh"}

📊 Result Analysis

Sample 1 (English text):

LLM identified language: en
Allowed languages list: ['en', 'zh']
Passes filter (en is in allowed list)

Sample 2 (Chinese text):

LLM identified language: zh
Allowed languages list: ['en', 'zh']
Passes filter (zh is in allowed list)

Sample 3 (French text):

LLM identified language: fr
Allowed languages list: ['en', 'zh']
Filtered out (fr is not in allowed list)

Use Cases:

Use LLM for more accurate language identification
Process complex, mixed-language text
Build high-quality multilingual datasets
Filter training data for specific languages

Notes:

Requires configured LLM service (e.g., LocalModelLLMServing, APILLMServing, etc.)
Uses ISO 639-1 two-letter language codes (e.g., 'en', 'zh', 'fr', 'es', 'ja', etc.)
LLM identification accuracy depends on model quality
Compared to FastText, LLM identification is more accurate but slower
Suitable for smaller-scale datasets requiring high accuracy

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

generate

refine

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

refine

LLMLanguageFilter