VendiDatasetEvaluator

About 588 wordsAbout 2 min

2025-10-09

📘 Overview

VendiDatasetEvaluator is an operator for evaluating dataset diversity. It achieves this by calculating VendiScore, using pre-trained BERT and SimCSE models to generate text embeddings, and computing final diversity scores based on these embeddings.

Key Features:

Uses BERT and SimCSE models to generate text embedding vectors
Calculates Vendi diversity scores based on embedding vectors
Returns dataset-level diversity evaluation results
Supports GPU-accelerated computation

Use Cases: Evaluating overall dataset semantic diversity, not individual samples

init

def __init__(self, device='cuda')

init Parameters

Parameter	Type	Default	Description
device	str	`'cuda'`	Computing device.

run

def run(self, storage: DataFlowStorage, input_key: str, use_simcse: bool = True)

Parameters

Name	Type	Default	Description
storage	DataFlowStorage	Required	DataFlow storage instance for reading and writing data.
input_key	str	Required	Input column name corresponding to the text field to be evaluated.
use_simcse	bool	True	Whether to use SimCSE model for score calculation.

🧠 Example Usage

from dataflow.operators.general_text import VendiDatasetEvaluator
from dataflow.utils.storage import FileStorage

class VendiDatasetEvaluatorTest():
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="./dataflow/example/GeneralTextPipeline/vendi_test_input.jsonl",
            cache_path="./cache",
            file_name_prefix="dataflow_cache_step",
            cache_type="jsonl",
        )
        
        self.evaluator = VendiDatasetEvaluator(
            device='cuda'
        )
        
    def forward(self):
        result = self.evaluator.run(
            storage=self.storage.step(),
            input_key='text'
        )
        return result

if __name__ == "__main__":
    test = VendiDatasetEvaluatorTest()
    result = test.forward()
    print(f"Vendi Result: {result}")

🧾 Default Output Format

Field	Type	Description
BERTVendiScore	float	Diversity score based on BERT.
SimCSEVendiScore	float	Diversity score based on SimCSE.

📋 Example Input

{"text": "The stock market showed significant gains today as investors responded positively to the Federal Reserve's latest policy announcement."}
{"text": "Scientists discovered a new species of deep-sea fish in the Mariana Trench during a recent expedition."}
{"text": "The championship game ended in a thrilling overtime victory for the home team."}
{"text": "A new study reveals that regular exercise can significantly improve cognitive function in older adults."}
{"text": "The tech company announced plans to launch its innovative smartphone model next quarter."}
{"text": "Climate change activists organized a massive protest in the capital city demanding immediate action."}
{"text": "The award-winning chef opened a new restaurant featuring fusion cuisine from around the world."}
{"text": "Researchers developed a breakthrough treatment that shows promise for treating rare genetic disorders."}
{"text": "The museum unveiled a rare collection of ancient artifacts discovered in Egypt."}
{"text": "Economic analysts predict steady growth in the renewable energy sector over the next decade."}

📤 Example Output

{
  "BERTVendiScore": 1.25,
  "SimCSEVendiScore": 8.72
}

📊 Result Analysis

Input dataset contains 10 texts with different topics: financial markets, marine biology, sports, medical research, technology products, climate change, cuisine, genetics, archaeology, energy economics, etc.

Output Interpretation:

BERTVendiScore: 1.25 - Diversity score calculated based on BERT embeddings. As a general language model, BERT has a relatively smooth embedding space, resulting in a lower score.
SimCSEVendiScore: 8.72 - Diversity score calculated based on SimCSE embeddings. SimCSE is specifically optimized for sentence semantic similarity and can better distinguish texts with different topics, resulting in a higher score that better reflects the actual diversity of the dataset.

Score Interpretation: VendiScore theoretically ranges from 1 to the number of samples (10 in this example). Scores closer to the number of samples indicate more diverse datasets; scores closer to 1 indicate more homogeneous datasets. A SimCSEVendiScore of 8.72 indicates that these 10 texts have very high semantic diversity, covering multiple different topic areas.

Application Value: This operator can be used to evaluate the semantic coverage and diversity of training datasets, helping determine whether the dataset contains sufficiently rich semantic information to train models with stronger generalization capabilities. Compared to Task2Vec, VendiScore focuses more on direct semantic embedding diversity assessment.

eval

generate

eval

generate

eval

filter

generate

eval

filter

generate

generate

eval

filter

refine

generate

generate

generate

eval

filter

refine

generate

generate

eval

filter

generate

eval

filter

generate

eval

generate

filter

eval

filter

generate

refine

VendiDatasetEvaluator