Model Capability Assessment Pipeline
About 554 wordsAbout 2 min
2025-08-30
⚠️Only supports QA pair format evaluation
Quick Start
cd DataFlow
pip install -e .[eval]
cd ..
mkdir workspace
cd workspace
# Place the files you want to evaluate in the working directory
# Initialize evaluation configuration files
dataflow eval init
# IMPORTANT: You must modify the configuration files eval_api.py or eval_local.py
# By default, it finds the latest fine-tuned model and compares it with its base model
# Default evaluation method is semantic evaluation
# Evaluation metric is accuracy
dataflow eval api / dataflow eval localStep 1: Install Evaluation Environment
Download evaluation environment
cd DataFlow
pip install -e .[eval]
cd ..Step 2: Create and Enter DataFlow Working Directory
mkdir workspace
cd workspaceStep 3: Prepare Evaluation Data and Initialize Configuration Files
Initialize configuration files
dataflow eval init💡After initialization, the project directory structure becomes:
Project Root/
├── eval_api.py # Configuration file for API model evaluator
└── eval_local.py # Configuration file for local model evaluatorStep 4: Prepare Evaluation Data
Method 1: JSON Format
Please prepare a JSON format file with data structure similar to the example below:
[
{
"input": "What properties indicate that material PI-1 has excellent processing characteristics during manufacturing processes?",
"output": "Material PI-1 has high tensile strength between 85-105 MPa.\nPI-1 exhibits low melt viscosity below 300 Pa·s indicating good flowability.\n\nThe combination of its high tensile strength and low melt viscosity indicates that it can be easily processed without breaking during manufacturing."
}
]💡In this example data:
inputis the question (can also be question + answer choices merged into one input)outputis the standard answer
Method 2: Custom Field Mapping
You can also skip data preprocessing (as long as you have clear question and standard answer fields) and configure field name mapping through eval_api.py and eval_local.py:
EVALUATOR_RUN_CONFIG = {
"input_test_answer_key": "model_generated_answer", # Field name for model-generated answers
"input_gt_answer_key": "output", # Field name for standard answers (from original data)
"input_question_key": "input" # Field name for questions (from original data)
}Step 5: Configure Parameters
Model Parameter Configure
If you want to use a local model as the evaluator, please modify the parameters in the eval_local.py file.
If you want to use an API model as the evaluator, please modify the parameters in the eval_api.py file.
# Target Models Configuration (same as API mode)
TARGET_MODELS = [
# Demonstrating all usage methods
# The following methods can be used in combination
# 1. Local path
# "./Qwen2.5-3B-Instruct",
# 2. HuggingFace path
# "Qwen/Qwen2.5-7B-Instruct"
# 3. Custom configuration
# Add more models...
{
"name": "qwen_7b", # Model name
"path": "./Qwen2.5-7B-Instruct", # Model path
# Large language models can use different parameters
"vllm_tensor_parallel_size": 4, # Number of GPUs
"vllm_temperature": 0.1, # Randomness
"vllm_top_p": 0.9, # Top-p sampling
"vllm_max_tokens": 2048, # Maximum number of tokens
"vllm_repetition_penalty": 1.0, # Repetition penalty
"vllm_seed": None, # Random seed
"vllm_gpu_memory_utilization": 0.9, # Maximum GPU memory utilization
# Custom prompt can be defined for each model
"answer_prompt": """please answer the following question:"""
}
]Bench Parameter Configuration
Supports batch configuration of benchmarks
BENCH_CONFIG = [
{
"name": "bench_name", # Benchmark name
"input_file": "path_to_your_qa/qa.json", # Data file
"question_key": "input", # Question field name
"reference_answer_key": "output", # Reference answer field name
"output_dir": "path/bench_name", # Output directory
},
{
"name": "other_bench_name",
"input_file": "path_to_your_qa/other_qa.json",
"question_key": "input",
"reference_answer_key": "output",
"output_dir": "path/other_bench_name",
}
]Step 6: Run Evaluation
Run local evaluation:
dataflow eval localRun API evaluation:
dataflow eval api
