DeitaQualityFilter
About 656 wordsAbout 2 min
2025-10-09
📘 Overview
The DeitaQualityFilter is an operator designed to filter data based on quality scores. It utilizes the DeitaQualitySampleEvaluator, which is based on a Llama model, to assess the quality of instruction-output pairs. Data entries that fall within a specified score range are retained.
__init__ function
def __init__(self, min_score=2.5, max_score=10000.0, device='cuda', model_cache_dir='./dataflow_cache', max_length=512)| Parameter | Type | Default Value | Description |
|---|---|---|---|
| min_score | float | 2.5 | The minimum score threshold for an entry to be kept. |
| max_score | float | 10000.0 | The maximum score threshold for an entry to be kept. |
| device | str | 'cuda' | The device on which the scoring model will run (e.g., 'cuda', 'cpu'). |
| model_cache_dir | str | './dataflow_cache' | The directory to store and load the cached scoring model. |
| max_length | int | 512 | The maximum sequence length for the scoring model's input. |
Prompt Template Descriptions
| Prompt Template Name | Primary Use | Applicable Scenarios | Feature Description |
|---|---|---|---|
run function
def run(self, storage, input_instruction_key="instruction", input_output_key="output", output_key="DeitaQualityScore")Executes the main logic of the operator. It reads an input DataFrame from storage, calculates a quality score for each entry, and writes back the filtered DataFrame to storage.
| Parameter | Type | Default Value | Description |
|---|---|---|---|
| storage | DataFlowStorage | Required | The data flow storage instance used for reading and writing data. |
| input_instruction_key | str | "instruction" | The column name in the input data that contains the instruction text. |
| input_output_key | str | "output" | The column name in the input data that contains the corresponding output text. |
| output_key | str | "DeitaQualityScore" | The column name where the generated quality score will be stored. |
🧠 Example Usage
from dataflow.operators.text_sft.filter import DeitaQualityFilter
from dataflow.utils.storage import FileStorage
# Prepare storage with instruction-output pairs
storage = FileStorage(first_entry_file_name="sft_data.jsonl")
# Initialize and run the filter
quality_filter = DeitaQualityFilter(
min_score=2.5,
max_score=6.0,
device="cuda",
model_cache_dir="./dataflow_cache",
)
quality_filter.run(
storage.step(),
input_instruction_key="instruction",
input_output_key="output",
output_key="DeitaQualityScore",
)🧾 Default Output Format
The operator adds a new column (specified by output_key) to the input data, containing the calculated quality score. It then filters the rows based on min_score and max_score.
| Field | Type | Description |
|---|---|---|
| instruction | str | The input instruction text. |
| output | str | The input output text. |
| DeitaQualityScore | float | The quality score generated by the Deita model. |
Example Input:
{
"instruction": "Can you provide a list of healthy habits to maintain a healthy lifestyle? Please format your response as an HTML page with bullet points.",
"output": "Here's an HTML page with bullet points for healthy habits:\n<html>\n <body>\n <h3>Healthy Habits:</h3>\n <ul>\n <li>Eating a balanced diet with plenty of fruits and vegetables.</li>\n <li>Engaging in regular physical activity, such as walking, running, or cycling.</li>\n <li>Getting enough sleep each night, ideally 7-8 hours.</li>\n <li>Staying hydrated by drinking plenty of water throughout the day.</li>\n <li>Limiting alcohol consumption and avoiding smoking.</li>\n <li>Managing stress through relaxation techniques like meditation or yoga.</li>\n <li>Regularly visiting a healthcare provider for check-ups and preventative care.</li>\n </ul>\n </body>\n</html>"
}Example Output (if it passes the filter):
{
"instruction": "Can you provide a list of healthy habits to maintain a healthy lifestyle? Please format your response as an HTML page with bullet points.",
"output": "Here's an HTML page with bullet points for healthy habits:\n<html>\n <body>\n <h3>Healthy Habits:</h3>\n <ul>\n <li>Eating a balanced diet with plenty of fruits and vegetables.</li>\n <li>Engaging in regular physical activity, such as walking, running, or cycling.</li>\n <li>Getting enough sleep each night, ideally 7-8 hours.</li>\n <li>Staying hydrated by drinking plenty of water throughout the day.</li>\n <li>Limiting alcohol consumption and avoiding smoking.</li>\n <li>Managing stress through relaxation techniques like meditation or yoga.</li>\n <li>Regularly visiting a healthcare provider for check-ups and preventative care.</li>\n </ul>\n </body>\n</html>",
"DeitaQualityScore": 4.0573154859
}
