easy_evaluation

About 414 wordsAbout 1 min

2025-10-17

This guide explains how to use the DataFlow evaluation pipeline to assess model-generated answers against ground-truth answers using either semantic or exact match comparison.
Two evaluation modes are supported:

Direct Comparison Mode: Compare existing model outputs with ground truth answers.
Generate-and-Evaluate Mode: First generate model answers, then compare them with ground truth answers.

🧩 Step 1: Install the Evaluation Environment

cd DataFlow
pip install -e .

This installs DataFlow in editable mode, making it easier for local development and debugging.

📁 Step 2: Create and Enter the Workspace

mkdir workspace
cd workspace

All configuration files and cached evaluation data will be stored in this workspace directory.

⚙️ Step 3: Initialize the Evaluation Configuration

Run the following command to initialize the evaluation configuration:

dataflow init

After initialization, the directory structure will look like this:

api_pipelines/
├── core_text_bencheval_semantic_pipeline.py                # Evaluator for API models
├── core_text_bencheval_semantic_pipeline_question.py        # Evaluator for local models (requires question)
└── core_text_bencheval_semantic_pipeline_question_single_step.py # Evaluator for local models (generate + evaluate)

🚀 Step 4: Run the Evaluation

Navigate to the api_pipelines folder:

cd api_pipelines

Select the corresponding script based on your evaluation mode:

🧩 Task Type	❓ Requires Question	🧠 Generates Answers	▶️ Script to Run
Compare existing answers (no Question required)	❌	❌	`core_text_bencheval_semantic_pipeline.py`
Compare existing answers (requires Question)	✅	❌	`core_text_bencheval_semantic_pipeline_question.py`
Generate answers then compare (requires Question)	✅	✅	`core_text_bencheval_semantic_pipeline_question_single_step.py`

Example:

python core_text_bencheval_semantic_pipeline_question_single_step.py

🗂️ Data Storage Configuration

Evaluation data paths are managed by FileStorage, which can be customized in the script:

self.storage = FileStorage(
    first_entry_file_name="../example_data/chemistry/matched_sample_10.json",
    cache_path="./cache_all_17_24_gpt_5",
    file_name_prefix="math_QA",
    cache_type="json",
)

first_entry_file_name — Path to the evaluation dataset (e.g., example data)
cache_path — Directory for caching intermediate evaluation results
file_name_prefix — Prefix for cached files
cache_type — File type for cache (typically json)

🧠 Step 5: Define Evaluation Keys

Specify the field mappings between model outputs and ground-truth labels:

self.evaluator_step.run(
    storage=self.storage.step(),
    input_test_answer_key="model_answer",
    input_gt_answer_key="golden_label",
)

input_test_answer_key — Key name for model-generated answers
input_gt_answer_key — Key name for ground-truth answers

Make sure the field names match the corresponding keys in your dataset.