Rollout Inference & Evaluation
About 1698 wordsAbout 6 min
2025-02-04
Overview
The Rollout module is the inference and evaluation engine of AgentFlow. It is designed for batch-executing Agent tasks on benchmark datasets with automatic result evaluation. It provides a multi-layered API ranging from one-liner quick calls to fully configurable pipelines, supporting various evaluation metrics, parallel execution, and complete conversation trajectory recording.
Quick Start
One-Liner with rollout()
Use the rollout() function to complete the entire inference and evaluation workflow in a single call:
from rollout import rollout
result = rollout(
config_path="configs/rollout/rag_benchmark.json",
data_path="benchmark/benchmark.jsonl"
)Quick Single-Question Test with quick_rollout()
For a quick inference test on a single question, use quick_rollout():
from rollout import quick_rollout
result = quick_rollout(
"What is the capital of France?",
tools=["web:search", "web:browse"],
model_name="gpt-4.1-2025-04-14",
api_key="sk-xxx",
base_url="https://api.openai.com/v1",
max_turns=10,
sandbox_url="http://127.0.0.1:18890"
)
print(result["answer"]) # The model's answer
print(result["success"]) # Whether the task completed successfully
print(result["trajectory"]) # Full conversation trajectoryCore API
rollout()
All-in-one inference and evaluation entry point. Loads configuration, executes tasks, evaluates results, and returns a summary dictionary.
def rollout(
*,
config_path: str, # Path to configuration file (JSON/YAML)
data_path: Optional[str] = None, # Override benchmark data path
output_dir: Optional[str] = None, # Override output directory
model_name: Optional[str] = None, # Override model name
max_tasks: Optional[int] = None, # Limit the number of tasks
task_ids: Optional[List[str]] = None, # Specific task IDs to run
evaluate: bool = True, # Whether to evaluate results
metric: Optional[str] = None, # Override evaluation metric
) -> Dict[str, Any]Returns: A RolloutSummary dictionary containing benchmark_name, total_tasks, successful_tasks, failed_tasks, average_score, metric, total_time_seconds, and other fields.
quick_rollout()
Quick inference for a single question without a configuration file. Ideal for debugging and interactive testing.
def quick_rollout(
question: str, # The question to answer
*,
tools: Optional[List[str]] = None, # List of available tools
model_name: str = "gpt-4.1-2025-04-14", # Model name
api_key: str = "", # API key
base_url: str = "", # API base URL
max_turns: int = 10, # Maximum conversation turns
sandbox_url: str = "http://127.0.0.1:18890", # Sandbox server URL
) -> Dict[str, Any]Returns: A dictionary with question, answer, success, error, and trajectory fields.
load_config()
Load configuration from a JSON or YAML file.
def load_config(config_path: str) -> RolloutConfigSupports .json, .yaml, and .yml formats.
load_tasks()
Load benchmark tasks from a file or list.
def load_tasks(
tasks_or_path: Union[str, List[Dict[str, Any]]]
) -> List[BenchmarkItem]Accepts a file path (JSONL/JSON) or a list of task dictionaries directly.
Full RolloutConfig Parameter Reference
RolloutConfig is the core configuration dataclass of the Rollout module. It supports loading from JSON/YAML files or creation from dictionaries.
I/O Paths
| Parameter | Type | Default | Description |
|---|---|---|---|
data_path | Optional[str] | None | Benchmark data file path (JSONL format) |
output_dir | Optional[str] | None | Output directory for results |
Model Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name | str | "gpt-4.1-2025-04-14" | Name of the model to use |
api_key | str | "" | Model API key |
base_url | str | "" | Model API base URL |
Agent Execution Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
max_turns | int | 100 | Maximum conversation turns per task |
max_retries | int | 3 | Maximum retries per LLM call |
max_workers | int | 1 | Maximum number of parallel workers |
available_tools | List[str] | [] | List of available tools (e.g., ["vm:screenshot", "web:search"]) |
System Prompt
| Parameter | Type | Default | Description |
|---|---|---|---|
system_prompt | str | "" | Custom system prompt (uses built-in default if empty) |
system_prompt_file | Optional[str] | None | Load system prompt from a file |
Evaluation Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
evaluate_results | bool | True | Whether to evaluate results |
evaluation_metric | str | "exact_match" | Evaluation metric name |
evaluator_model_name | Optional[str] | None | Model name for the evaluator (used with llm_judgement) |
evaluator_api_key | Optional[str] | None | API key for the evaluator model |
evaluator_base_url | Optional[str] | None | API base URL for the evaluator model |
evaluator_temperature | float | 0.0 | Temperature for the evaluator model (range 0.0 - 2.0) |
evaluator_max_retries | int | 3 | Maximum retries for evaluator calls |
evaluator_extra_params | Dict[str, Any] | {} | Extra parameters for the evaluator |
Resource Configuration (Sandbox)
| Parameter | Type | Default | Description |
|---|---|---|---|
resource_types | List[str] | [] | List of required resource types |
resource_init_configs | Dict[str, Dict[str, Any]] | {} | Initialization configs for each resource type |
Sandbox Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
sandbox_server_url | str | "http://127.0.0.1:18890" | Sandbox server URL |
sandbox_auto_start | bool | False | Whether to auto-start the sandbox server |
sandbox_config_path | Optional[str] | None | Path to sandbox configuration file |
sandbox_timeout | int | 120 | Sandbox operation timeout in seconds |
Benchmark Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
benchmark_name | Optional[str] | None | Name of the benchmark |
number_of_tasks | Optional[int] | None | Limit the number of tasks to run (for testing) |
task_ids | Optional[List[str]] | None | Specific task IDs to run |
parallel | bool | False | Whether to enable parallel execution |
Result Saving
| Parameter | Type | Default | Description |
|---|---|---|---|
save_results | bool | True | Whether to save inference results |
save_trajectories | bool | True | Whether to save full conversation trajectories |
trajectory_only | bool | False | Whether to save only minimal trajectory payload |
save_summary | bool | True | Whether to save the summary file (summary_<benchmark>_<timestamp>.json) |
Configuration File Example
Below is a complete JSON configuration file example:
{
"data_path": "benchmark/rag_benchmark.jsonl",
"output_dir": "results/rag_eval",
"model_name": "gpt-4.1-2025-04-14",
"api_key": "sk-your-api-key",
"base_url": "https://api.openai.com/v1",
"max_turns": 50,
"max_retries": 3,
"max_workers": 4,
"available_tools": ["web:search", "web:browse", "doc:read"],
"system_prompt": "",
"system_prompt_file": "prompts/rag_system.txt",
"evaluate_results": true,
"evaluation_metric": "f1_score",
"evaluator_model_name": "gpt-4.1-2025-04-14",
"evaluator_api_key": "sk-your-api-key",
"evaluator_base_url": "https://api.openai.com/v1",
"evaluator_temperature": 0.0,
"evaluator_max_retries": 3,
"resource_types": ["web", "doc"],
"resource_init_configs": {
"doc": {
"content": {
"seed_path": "/data/documents"
}
}
},
"sandbox_server_url": "http://127.0.0.1:18890",
"sandbox_auto_start": false,
"sandbox_timeout": 120,
"benchmark_name": "rag_benchmark_v1",
"number_of_tasks": null,
"task_ids": null,
"parallel": true,
"save_results": true,
"save_trajectories": true,
"trajectory_only": false,
"save_summary": true
}Evaluation Metrics
Rollout supports the following evaluation metrics, specified via the evaluation_metric configuration parameter:
| Metric | Description | Use Case |
|---|---|---|
exact_match | Exact match: scores 1.0 if the predicted answer is identical to the ground truth, 0.0 otherwise | Tasks with a single definitive answer, such as factual QA |
f1_score | F1 score based on token-level precision and recall | Tasks where the answer may contain partially correct content |
contains_answer | Checks whether the predicted answer contains the ground truth | Generative answers that need to include key information |
numeric_match | Extracts and compares numeric values | Math calculations, statistical questions |
llm_judgement | Uses an LLM as a judge to evaluate answer quality, returning a score from 0.0 to 1.0 | Open-ended QA, evaluations requiring semantic understanding |
similarity | Text similarity based on string similarity algorithms | Scenarios requiring fuzzy matching |
LLM Judgement Configuration Example
When using the llm_judgement metric, you need to configure the evaluator model:
{
"evaluation_metric": "llm_judgement",
"evaluator_model_name": "gpt-4.1-2025-04-14",
"evaluator_api_key": "sk-your-evaluator-key",
"evaluator_base_url": "https://api.openai.com/v1",
"evaluator_temperature": 0.0,
"evaluator_max_retries": 3,
"evaluator_extra_params": {
"max_tokens": 1024
}
}Benchmark Data Format
Benchmark data uses JSONL (JSON Lines) format, with one task per line. Multiple field name aliases are supported for compatibility with different datasets.
Standard Format
{"id": "task_001", "question": "What is the capital of France?", "answer": "Paris"}
{"id": "task_002", "question": "What is the chemical formula for water?", "answer": "H2O"}
{"id": "task_003", "question": "Calculate 15 * 27", "answer": "405"}Field Description
| Field | Aliases | Required | Description |
|---|---|---|---|
id | task_id | Yes | Unique task identifier |
question | query, input | Yes | Question / input text |
answer | ground_truth, expected | No | Ground truth answer (used for evaluation) |
kwargs | - | No | Additional parameters passed to tools (e.g., seed_path) |
Any other fields are automatically placed into metadata.
Example with Extra Parameters
{"id": "doc_001", "question": "What is the project deadline mentioned in the document?", "answer": "March 15, 2025", "kwargs": {"seed_path": "/data/docs/project_plan.pdf"}}Output Format
Inference Results File (JSONL)
Each task's inference result is saved as a single JSON line with the following fields:
{"task_id": "task_001", "question": "What is the capital of France?", "predicted_answer": "Paris", "success": true, "ground_truth": "Paris", "score": 1.0, "trajectory": {"task_id": "task_001", "question": "...", "messages": [...], "tool_calls": [...], "final_answer": "Paris", "total_turns": 3, "success": true, "start_time": "2025-02-04T10:00:00", "end_time": "2025-02-04T10:00:15", "execution_time_ms": 15000}}Result Field Description:
| Field | Type | Description |
|---|---|---|
task_id | str | Task ID |
question | str | Original question |
predicted_answer | str | Model's predicted answer |
success | bool | Whether the task completed successfully |
ground_truth | str | Ground truth answer |
score | float | Evaluation score |
error | str | Error message (on failure) |
trajectory | object | Full conversation trajectory (messages, tool calls, etc.) |
Evaluation Summary File (JSON)
After evaluation, a summary file summary_<benchmark>_<timestamp>.json is generated:
{
"benchmark_name": "rag_benchmark_v1",
"total_tasks": 100,
"successful_tasks": 95,
"failed_tasks": 5,
"average_score": 0.82,
"metric": "f1_score",
"total_time_seconds": 3600.5,
"results_file": "results/rag_eval/results.jsonl",
"evaluation_file": "results/rag_eval/evaluation.json",
"timestamp": "2025-02-04T10:30:00"
}Complete Usage Workflow
Step 1: Start the Sandbox Service
Rollout relies on the sandbox service for tool execution. Before running inference, start the sandbox:
# Start the sandbox service (default port 18890)
python -m sandbox.server --port 18890Alternatively, set sandbox_auto_start: true in the configuration to let Rollout start it automatically.
Step 2: Prepare the Configuration File
Create a JSON configuration file (e.g., configs/rollout/my_benchmark.json):
{
"data_path": "benchmark/my_benchmark.jsonl",
"output_dir": "results/my_eval",
"model_name": "gpt-4.1-2025-04-14",
"api_key": "sk-your-api-key",
"base_url": "https://api.openai.com/v1",
"max_turns": 50,
"available_tools": ["web:search", "web:browse"],
"evaluate_results": true,
"evaluation_metric": "f1_score",
"resource_types": ["web"],
"sandbox_server_url": "http://127.0.0.1:18890",
"save_results": true,
"save_trajectories": true,
"save_summary": true
}Step 3: Prepare Benchmark Data
Create a benchmark data file in JSONL format:
{"id": "q001", "question": "What tool types does AgentFlow support?", "answer": "web, doc, vm, code"}
{"id": "q002", "question": "How do you configure the sandbox timeout?", "answer": "Set the sandbox_timeout parameter"}Step 4: Run Inference and Evaluation
from rollout import rollout
# Option 1: Use configuration file (recommended)
summary = rollout(config_path="configs/rollout/my_benchmark.json")
# Option 2: Override specific parameters
summary = rollout(
config_path="configs/rollout/my_benchmark.json",
data_path="benchmark/another_dataset.jsonl",
model_name="gpt-4.1-2025-04-14",
max_tasks=10, # Run 10 tasks first as a test
evaluate=True,
metric="exact_match"
)
# View results
print(f"Total tasks: {summary['total_tasks']}")
print(f"Successful tasks: {summary['successful_tasks']}")
print(f"Average score: {summary['average_score']}")
print(f"Results file: {summary['results_file']}")Step 5: Review Results
After inference completes, the output directory contains the following files:
results/my_eval/
├── results.jsonl # Detailed inference results for each task
├── evaluation.json # Evaluation details
└── summary_my_benchmark_20250204T103000.json # Run summaryAdvanced Usage: Programmatic Control
For finer-grained control, you can use the lower-level API directly:
from rollout import load_config, load_tasks
from rollout.core.runner import AgentRunner
import asyncio
async def custom_run():
# Load config and tasks
config = load_config("configs/rollout/my_benchmark.json")
tasks = load_tasks("benchmark/my_benchmark.jsonl")
# Create and start the Runner
runner = AgentRunner(config, worker_id="custom_runner")
await runner.start()
try:
for task in tasks:
result = await runner.run_task(task)
print(f"Task {result.task_id}: {result.predicted_answer}")
print(f" Score: {result.score}, Success: {result.success}")
finally:
await runner.stop()
asyncio.run(custom_run())