easy_evaluation
About 414 wordsAbout 1 min
2025-10-17
This guide explains how to use the DataFlow evaluation pipeline to assess model-generated answers against ground-truth answers using either semantic or exact match comparison.
Two evaluation modes are supported:
- Direct Comparison Mode: Compare existing model outputs with ground truth answers.
- Generate-and-Evaluate Mode: First generate model answers, then compare them with ground truth answers.
🧩 Step 1: Install the Evaluation Environment
cd DataFlow
pip install -e .This installs DataFlow in editable mode, making it easier for local development and debugging.
📁 Step 2: Create and Enter the Workspace
mkdir workspace
cd workspaceAll configuration files and cached evaluation data will be stored in this workspace directory.
⚙️ Step 3: Initialize the Evaluation Configuration
Run the following command to initialize the evaluation configuration:
dataflow initAfter initialization, the directory structure will look like this:
api_pipelines/
├── core_text_bencheval_semantic_pipeline.py # Evaluator for API models
├── core_text_bencheval_semantic_pipeline_question.py # Evaluator for local models (requires question)
└── core_text_bencheval_semantic_pipeline_question_single_step.py # Evaluator for local models (generate + evaluate)🚀 Step 4: Run the Evaluation
Navigate to the api_pipelines folder:
cd api_pipelinesSelect the corresponding script based on your evaluation mode:
| 🧩 Task Type | ❓ Requires Question | 🧠 Generates Answers | ▶️ Script to Run |
|---|---|---|---|
| Compare existing answers (no Question required) | ❌ | ❌ | core_text_bencheval_semantic_pipeline.py |
| Compare existing answers (requires Question) | ✅ | ❌ | core_text_bencheval_semantic_pipeline_question.py |
| Generate answers then compare (requires Question) | ✅ | ✅ | core_text_bencheval_semantic_pipeline_question_single_step.py |
Example:
python core_text_bencheval_semantic_pipeline_question_single_step.py🗂️ Data Storage Configuration
Evaluation data paths are managed by FileStorage, which can be customized in the script:
self.storage = FileStorage(
first_entry_file_name="../example_data/chemistry/matched_sample_10.json",
cache_path="./cache_all_17_24_gpt_5",
file_name_prefix="math_QA",
cache_type="json",
)- first_entry_file_name — Path to the evaluation dataset (e.g., example data)
- cache_path — Directory for caching intermediate evaluation results
- file_name_prefix — Prefix for cached files
- cache_type — File type for cache (typically
json)
🧠 Step 5: Define Evaluation Keys
Specify the field mappings between model outputs and ground-truth labels:
self.evaluator_step.run(
storage=self.storage.step(),
input_test_answer_key="model_answer",
input_gt_answer_key="golden_label",
)- input_test_answer_key — Key name for model-generated answers
- input_gt_answer_key — Key name for ground-truth answers
Make sure the field names match the corresponding keys in your dataset.

