PerplexityFilter

About 331 wordsAbout 1 min

2025-10-09

📘 Overview

The PerplexityFilter is an operator designed to filter data based on perplexity scores. It calculates the perplexity of text using a Hugging Face model, where lower scores generally indicate higher fluency and quality. This operator is useful for cleaning datasets by removing low-quality or nonsensical text entries.

`init`

def __init__(self, min_score: float = 10.0, max_score: float = 500.0, model_name: str = 'gpt2', device='cuda')

Parameter	Type	Default	Description
min_score	float	10.0	The minimum perplexity score for a record to be kept.
max_score	float	500.0	The maximum perplexity score for a record to be kept.
model_name	str	'gpt2'	The name or path of the Hugging Face model to use for scoring.
device	str	'cuda'	The device on which the model will run (e.g., 'cuda' or 'cpu').

Prompt Template Descriptions

Prompt Template Name	Primary Use	Applicable Scenarios	Feature Description

`run`

def run(self, storage: DataFlowStorage, input_key: str, output_key: str = 'PerplexityScore')

Parameter	Type	Default	Description
storage	DataFlowStorage	Required	The DataFlow storage instance for reading and writing data.
input_key	str	Required	The name of the input column containing the text to be scored.
output_key	str	"PerplexityScore"	The name of the new column that will store the calculated perplexity score.

🧠 Example Usage

from dataflow.operators.text_pt.filter import PerplexityFilter
from dataflow.utils.storage import FileStorage

# Prepare data and storage
storage = FileStorage(first_entry_file_name="pt_input.jsonl")

# Initialize and run the filter
perplexity_filter = PerplexityFilter(
    min_score=10.0,
    max_score=500.0,
    model_name='gpt2',
    device='cuda'
)
perplexity_filter.run(
    storage.step(),
    input_key='raw_content',
    output_key='PerplexityScore'
)

🧾 Default Output Format

The operator adds a new column (specified by output_key) with the perplexity score to the existing data and filters the rows based on min_score and max_score.

Field	Type	Description
original_fields	any	The original fields from the input data.
PerplexityScore	float	The calculated perplexity score for the text in the `input_key` column.

Example Input:

{
"raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)..."
}

Example Output (if it passes the filter):

{
"raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)...",
"PerplexityScore": 49.2016410828
}

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

generate

refine

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

refine

PerplexityFilter

📘 Overview

__init__

Prompt Template Descriptions

run

🧠 Example Usage

🧾 Default Output Format

`init`

`run`