Model QA Capability Assessment Pipeline
About 503 wordsAbout 2 min
2025-10-20
Only supports QA pair format evaluation
Quick Start
cd DataFlow
pip install -e .[eval]
cd ..
mkdir workspace
cd workspace
# Place the files you want to evaluate in the working directory
# Initialize evaluation configuration files
dataflow eval init
# IMPORTANT: You must modify the configuration files eval_api.py or eval_local.py
# By default, it finds the latest fine-tuned model and compares it with its base model
# Default evaluation method is semantic evaluation
# Evaluation metric is accuracy
dataflow eval api / dataflow eval localStep 1: Install Evaluation Environment
Download evaluation environment
cd DataFlow
pip install -e .[eval]
cd ..Step 2: Create and Enter DataFlow Working Directory
mkdir workspace
cd workspaceStep 3: Prepare Evaluation Data and Initialize Configuration Files
Initialize configuration files
dataflow eval initAfter initialization, the project directory structure becomes:
Project Root/
├── eval_api.py # Configuration file for API model evaluator
└── eval_local.py # Configuration file for local model evaluatorStep 4: Prepare Evaluation Data
Method 1: JSON Format
Please prepare a JSON format file with data structure similar to the example below:
[
{
"input": "What properties indicate that material PI-1 has excellent processing characteristics during manufacturing processes?",
"output": "Material PI-1 has high tensile strength between 85-105 MPa.\nPI-1 exhibits low melt viscosity below 300 Pa·s indicating good flowability.\n\nThe combination of its high tensile strength and low melt viscosity indicates that it can be easily processed without breaking during manufacturing."
}
]In this example data:
inputis the question (can also be question + answer choices merged into one input)outputis the standard answer
Method 2: Custom Field Mapping
You can also skip data preprocessing (as long as you have clear question and standard answer fields) and configure field name mapping through eval_api.py and eval_local.py:
EVALUATOR_RUN_CONFIG = {
"input_test_answer_key": "model_generated_answer", # Field name for model-generated answers
"input_gt_answer_key": "output", # Field name for standard answers (from original data)
"input_question_key": "input" # Field name for questions (from original data)
}Step 5: Configure Parameters
If you want to use a local model as the evaluator, please modify the parameters in the eval_local.py file.
If you want to use an API model as the evaluator, please modify the parameters in the eval_api.py file.
# Target Models Configuration (same as API mode)
TARGET_MODELS = [
# Demonstrating all usage methods
# The following methods can be used in combination
# 1. Local path
# "./Qwen2.5-3B-Instruct",
# 2. HuggingFace path
# "Qwen/Qwen2.5-7B-Instruct"
# 3. Custom configuration
# Add more models...
# {
# "name": "llama_8b",
# "path": "meta-llama/Llama-3-8B-Instruct",
# "tensor_parallel_size": 2,
# "max_tokens": 2048,
# "gpu_memory_utilization": 0.9,
# # You can customize prompts for each model. If not specified, defaults to the template in build_prompt function.
# # Default prompt for evaluated models
# # IMPORTANT: This is the prompt for models being evaluated, NOT for the judge model!!!
# "answer_prompt": """please answer the questions:
# question:{question}
# answer:"""
# }
]Step 6: Run Evaluation
Run local evaluation:
dataflow eval localRun API evaluation:
dataflow eval api
