KCenterGreedyFilter

About 274 wordsLess than 1 minute

2025-10-09

📘 Overview

The KCenterGreedyFilter is an operator designed to select a representative subset of document fragments from a large dataset. It utilizes the K-Center Greedy algorithm, based on text embeddings, to ensure the selected samples are diverse and effectively cover the feature space of the original data. This operator is particularly useful for creating a smaller, high-quality dataset for subsequent tasks, such as generating seed Question-Answer pairs.

`init` function

def __init__(self, num_samples: int, embedding_serving : LLMServingABC = None)

Parameter	Type	Default	Description
num_samples	int	Required	The target number of samples to select from the dataset.
embedding_serving	LLMServingABC	None	An embedding service instance used to generate vector representations of the input text.

Prompt Template Descriptions

(This operator does not use prompt templates.)

`run` function

def run(self, storage: DataFlowStorage, input_key: str = "content")

Parameter	Type	Default	Description
storage	DataFlowStorage	Required	The data flow storage instance for reading the input DataFrame and writing the filtered output.
input_key	str	"content"	The column name in the input DataFrame that contains the text content to be filtered.

🧠 Example Usage

# No example usage provided

🧾 Output Format

The operator filters the rows of the input DataFrame and writes the resulting subset back to storage. The schema of the output data is identical to the input, but it contains only the num_samples rows selected by the algorithm.

Example Input Data: A DataFrame with multiple rows and a column specified by input_key (e.g., "content").

Example Output Data: A new DataFrame stored to a file, containing a subset of the original rows. All original columns are preserved for the selected rows.

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

generate

refine

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

eval

filter

generate

refine

KCenterGreedyFilter

📘 Overview

__init__ function

Prompt Template Descriptions

run function

🧠 Example Usage

🧾 Output Format

`init` function

`run` function