DeitaComplexityFilter

About 491 wordsAbout 2 min

2025-10-09

📘 Overview

The DeitaComplexityFilter is an operator designed to filter data based on complexity scores. It utilizes the Llama-based DeitaComplexitySampleEvaluator to assess the complexity of given instructions and retains only the data entries whose scores fall within a specified range [min_score, max_score].

`init` function

__init__(self, min_score=3.0, max_score=5.0, device='cuda', model_cache_dir='./dataflow_cache', max_length=512)

Parameter	Type	Default	Description
min_score	float	`3.0`	The minimum complexity score threshold. Data with scores below this value will be filtered out.
max_score	float	`5.0`	The maximum complexity score threshold. Data with scores above this value will be filtered out.
device	str	`'cuda'`	The device on which the complexity scoring model will run (e.g., 'cuda', 'cpu').
model_cache_dir	str	`'./dataflow_cache'`	The directory to store and cache the downloaded scoring model.
max_length	int	`512`	The maximum sequence length that the model can process.

Prompt Template Descriptions

`run` function

run(self, storage: DataFlowStorage, input_instruction_key: str = 'instruction', input_output_key : str = 'output', output_key: str = "DeitaComplexityScore")

Parameter	Type	Default	Description
storage	DataFlowStorage	Required	The DataFlow storage instance used for reading the input DataFrame and writing the filtered output.
input_instruction_key	str	`'instruction'`	The name of the column in the input data containing the instruction text to be evaluated.
input_output_key	str	`'output'`	The name of the column in the input data containing the corresponding output/response text.
output_key	str	`'DeitaComplexityScore'`	The name of the new column that will be added to the DataFrame to store the calculated complexity score.

🧠 Example Usage

from dataflow.operators.text_sft.filter import DeitaComplexityFilter
from dataflow.utils.storage import FileStorage

# Prepare storage with instruction-output pairs
storage = FileStorage(first_entry_file_name="sft_data.jsonl")

# Initialize and run the filter
complexity_filter = DeitaComplexityFilter(
    min_score=2.0,
    max_score=5.0,
    device="cuda",
    model_cache_dir="./dataflow_cache",
)
complexity_filter.run(
    storage.step(),
    input_instruction_key="instruction",
    input_output_key="output",
    output_key="DeitaComplexityScore",
)

🧾 Output Format

The operator adds a new column (default name DeitaComplexityScore) to the input data and filters the rows based on the score.

Field	Type	Description
instruction	str	The input instruction text.
output	str	The input response text.
DeitaComplexityScore	float	The calculated complexity score for the instruction.

Example Input:

{
  "instruction":"Provide a detailed comparison between the 'list' and 'tuple' data structures in Python, focusing on mutability, performance, and common use cases.",
  "output":"Certainly. The primary distinction between lists and tuples in Python lies in their mutability. Lists are mutable, meaning their elements can be added, removed, or modified after creation. Tuples are immutable; once created, their contents cannot be altered. This immutability makes tuples slightly more memory-efficient and faster to access."
}

Example Output:

{
  "instruction":"Provide a detailed comparison between the 'list' and 'tuple' data structures in Python, focusing on mutability, performance, and common use cases.",
  "output":"Certainly. The primary distinction between lists and tuples in Python lies in their mutability. Lists are mutable, meaning their elements can be added, removed, or modified after creation. Tuples are immutable; once created, their contents cannot be altered. This immutability makes tuples slightly more memory-efficient and faster to access.",
  "DeitaComplexityScore":2.9713823783
}

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

generate

refine

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

refine

DeitaComplexityFilter

📘 Overview

__init__ function

Prompt Template Descriptions

run function

🧠 Example Usage

🧾 Output Format

`init` function

`run` function