PairQualFilter

About 350 wordsAbout 1 min

2025-10-09

📘 Overview

The PairQualFilter is an operator designed to filter data based on quality scores generated by the PairQualScorer. This scorer is a bilingual text quality evaluator trained on GPT-based pairwise comparison annotations using a BGE model. A higher score signifies better text quality. This operator is useful for cleaning datasets by retaining only high-quality text samples.

`init` function

def __init__(self, min_score=0, max_score=10000, model_cache_dir='./dataflow_cache', lang='en'):

Parameter	Type	Default Value	Description
min_score	int	0	The minimum quality score threshold for filtering.
max_score	int	10000	The maximum quality score threshold for filtering.
model_cache_dir	str	'./dataflow_cache'	The directory path for caching the scoring model.
lang	str	'en'	The language of the text to be evaluated ('en' or 'zh').

Prompt Template Descriptions

Prompt Template Name	Main Purpose	Applicable Scenarios	Feature Description

`run` function

def run(self, storage: DataFlowStorage, input_key: str, output_key: str='PairQualScore'):

Executes the main filtering logic. It reads a DataFrame from storage, calculates a quality score for the text in the input_key column, adds this score to a new output_key column, and writes back a new DataFrame containing only the rows that fall within the specified score range.

Parameters

Name	Type	Default Value	Description
storage	DataFlowStorage	Required	The DataFlowStorage instance for reading the input DataFrame and writing the filtered output.
input_key	str	Required	The name of the input column containing the text to be scored.
output_key	str	'PairQualScore'	The name of the output column where the generated quality score will be stored.

🧠 Example Usage

from dataflow.operators.text_pt.filter import PairQualFilter
from dataflow.utils.storage import FileStorage

# Prepare data and storage
storage = FileStorage(first_entry_file_name="pt_input.jsonl")

# Initialize and run the filter
pairqual_filter = PairQualFilter(
    min_score=0,
    max_score=10000,
    model_cache_dir='./dataflow_cache',
    lang='en'
)
pairqual_filter.run(
    storage.step(),
    input_key='raw_content',
    output_key='PairQualScore'
)

🧾 Output Format

Field	Type	Description
...	-	Original columns from the input data.
PairQualScore	float	The quality score calculated by the PairQualScorer. (Column name is determined by `output_key`)

Example Input:

{
    "raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)..."
}

Example Output:

{
    "raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)...",
    "PairQualScore": 3.2509903908
}

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

generate

refine

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

refine

PairQualFilter

📘 Overview

__init__ function

Prompt Template Descriptions

run function

Parameters

🧠 Example Usage

🧾 Output Format

`init` function

`run` function