TreeinstructFilter

About 355 wordsAbout 1 min

2025-10-09

📘 Overview

The TreeinstructFilter operator filters data based on scores from the TreeinstructScore evaluator. It measures instruction complexity by the number of nodes in the generated syntax tree; more nodes indicate more complex instructions. This is suitable for selecting instruction data within specific complexity ranges, balancing dataset difficulty distribution, and optimizing model training effectiveness.

`init`

def __init__(self, min_score: int = 7, max_score: int = 100, llm_serving: LLMServingABC = None)

Parameter	Type	Default	Description
min_score	int	7	The minimum score threshold for retaining samples.
max_score	int	100	The maximum score threshold for retaining samples.
llm_serving	LLMServingABC	None	The Large Language Model serving instance.

Prompt Template Descriptions

Prompt Template Name	Primary Purpose	Applicable Scenarios	Feature Description

`run`

def run(self, storage: DataFlowStorage, input_key: str, output_key: str = 'TreeinstructScore')

Parameter	Type	Default	Description
storage	DataFlowStorage	Required	The DataFlow storage instance, responsible for reading and writing data.
input_key	str	Required	The input column name, corresponding to the instruction field.
output_key	str	"TreeinstructScore"	The output column name, corresponding to the generated score field.

🧠 Example Usage

from dataflow.operators.text_sft.filter import TreeinstructFilter
from dataflow.utils.storage import FileStorage
from dataflow.utils.llm_serving import APILLMServing_request

# Prepare storage with instruction data
storage = FileStorage(first_entry_file_name="sft_data.jsonl")

# Initialize LLM serving
llm_serving = APILLMServing_request(
    api_url="http://<your_llm_api_endpoint>",
    model_name="<your_model_name>"
)

# Initialize and run the filter
treeinstruct_filter = TreeinstructFilter(
    min_score=7,
    max_score=100,
    llm_serving=llm_serving,
)
treeinstruct_filter.run(
    storage.step(),
    input_key="instruction",
    output_key="TreeinstructScore",
)

🧾 Output Format

The operator adds a new column (specified by output_key) with the complexity score to the input data and then filters the rows based on the min_score and max_score thresholds.

Field	Type	Description
...	...	Original fields from the input data.
TreeinstructScore	int	The calculated complexity score based on the syntax tree node count.

Example Input:

{
  "instruction": "Generate a list of ten essential items a person might need for a camping trip in a specific region, taking into consideration the weather, terrain, and local wildlife."
}

Example Output (if it passes the filter):

{
  "instruction": "Generate a list of ten essential items a person might need for a camping trip in a specific region, taking into consideration the weather, terrain, and local wildlife.",
  "TreeinstructScore": 10.0
}

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

generate

refine

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

refine

TreeinstructFilter

📘 Overview

__init__

Prompt Template Descriptions

run

🧠 Example Usage

🧾 Output Format

`init`

`run`