PairQualFilter
About 350 wordsAbout 1 min
2025-10-09
📘 Overview
The PairQualFilter is an operator designed to filter data based on quality scores generated by the PairQualScorer. This scorer is a bilingual text quality evaluator trained on GPT-based pairwise comparison annotations using a BGE model. A higher score signifies better text quality. This operator is useful for cleaning datasets by retaining only high-quality text samples.
__init__ function
def __init__(self, min_score=0, max_score=10000, model_cache_dir='./dataflow_cache', lang='en'):| Parameter | Type | Default Value | Description |
|---|---|---|---|
| min_score | int | 0 | The minimum quality score threshold for filtering. |
| max_score | int | 10000 | The maximum quality score threshold for filtering. |
| model_cache_dir | str | './dataflow_cache' | The directory path for caching the scoring model. |
| lang | str | 'en' | The language of the text to be evaluated ('en' or 'zh'). |
Prompt Template Descriptions
| Prompt Template Name | Main Purpose | Applicable Scenarios | Feature Description |
|---|---|---|---|
run function
def run(self, storage: DataFlowStorage, input_key: str, output_key: str='PairQualScore'):Executes the main filtering logic. It reads a DataFrame from storage, calculates a quality score for the text in the input_key column, adds this score to a new output_key column, and writes back a new DataFrame containing only the rows that fall within the specified score range.
Parameters
| Name | Type | Default Value | Description |
|---|---|---|---|
| storage | DataFlowStorage | Required | The DataFlowStorage instance for reading the input DataFrame and writing the filtered output. |
| input_key | str | Required | The name of the input column containing the text to be scored. |
| output_key | str | 'PairQualScore' | The name of the output column where the generated quality score will be stored. |
🧠 Example Usage
from dataflow.operators.text_pt.filter import PairQualFilter
from dataflow.utils.storage import FileStorage
# Prepare data and storage
storage = FileStorage(first_entry_file_name="pt_input.jsonl")
# Initialize and run the filter
pairqual_filter = PairQualFilter(
min_score=0,
max_score=10000,
model_cache_dir='./dataflow_cache',
lang='en'
)
pairqual_filter.run(
storage.step(),
input_key='raw_content',
output_key='PairQualScore'
)🧾 Output Format
| Field | Type | Description |
|---|---|---|
| ... | - | Original columns from the input data. |
| PairQualScore | float | The quality score calculated by the PairQualScorer. (Column name is determined by output_key) |
Example Input:
{
"raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)..."
}Example Output:
{
"raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)...",
"PairQualScore": 3.2509903908
}
