TextbookFilter

About 368 wordsAbout 1 min

2025-10-09

概述

TextbookFilter is an operator that filters data based on scores from the TextbookScorer. It uses a FastText classifier to assess the educational value of text, determining if it is suitable as educational material. The classifier is trained to identify text with educational significance, a clear structure, and accurate knowledge, making it suitable for building educational datasets.

init

def __init__(self, min_score=0.99, max_score=1, model_cache_dir:str='./dataflow_cache')

init Parameters

Parameter	Type	Default	Description
min_score	float	0.99	The minimum educational value score threshold for retaining samples.
max_score	int	1	The maximum educational value score threshold for retaining samples.
model_cache_dir	str	'./dataflow_cache'	The directory for caching the scoring model.

Prompt Template Descriptions

Prompt Template Name	Primary Use	Applicable Scenarios	Feature Description

run

def run(self, storage: DataFlowStorage, input_key: str, output_key: str='TextbookScore')

Executes the main logic of the operator. It reads an input DataFrame from storage, calculates the educational value score for the specified text column, adds the score as a new column, and writes back a new DataFrame containing only the rows that fall within the [min_score, max_score] range.

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	The DataFlow storage instance, responsible for reading and writing data.
input_key	str	Required	The column name of the input text to be evaluated.
output_key	str	"TextbookScore"	The column name for the generated educational value score.

🧠 Example Usage

from dataflow.operators.text_pt.filter import TextbookFilter
from dataflow.utils.storage import FileStorage

# Prepare data and storage
storage = FileStorage(first_entry_file_name="pt_input.jsonl")

# Initialize and run the filter
textbook_filter = TextbookFilter(
    min_score=0.99,
    max_score=1,
    model_cache_dir='./dataflow_cache'
)
textbook_filter.run(
    storage.step(),
    input_key='raw_content',
    output_key='TextbookScore'
)

🧾 Default Output Format

The operator filters the input data and adds a new column containing the educational score. Only rows where the score is between min_score and max_score are kept.

Field	Type	Description
...	...	Original columns from the input data.
TextbookScore	float	The educational value score calculated for the text in `input_key`. The name of this field is determined by the `output_key` parameter.

Example Input:

{
    "raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)..."
}

Example Output (assuming the score passes the filter):

{
    "raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)...",
    "TextbookScore": 2.9629482031
}

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

generate

refine

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

refine

TextbookFilter