Vision MCTS Reasoning Pipeline (API version)
About 853 wordsAbout 3 min
2026-01-11
1. Overview
The Vision MCTS Reasoning Pipeline is designed to build high-quality Process Supervision Data for multimodal large models. This pipeline handles two sources of data: existing Monte Carlo Tree Search (MCTS) trajectory data, or generating new reasoning chains directly using a VLM.
This pipeline is a core tool for Grounded-RL and SFT Data Construction. It "linearizes" complex tree-like search processes into a <think>...</think><answer>...</answer> format that the model can learn from.
We support the following application scenarios:
- Data Extraction from MCTS Trees: Converts high-value paths (Rollouts) in the search tree into linear training data.
- Hybrid Data Construction: Automatically falls back to using the VLM for CoT generation for samples without a search tree.
- Spatial Reasoning Enhancement: Supports generating spatial reasoning chains that include explicit coordinates (Bounding Boxes).
The main process of the pipeline includes:
- MCTS Tree Parsing: Parses the search tree structure in the input data and extracts successful reasoning paths.
- Visual Reasoning Generation (Fallback): For samples with missing tree structures or failed parsing, the VLM is used to regenerate the reasoning chain.
- Data Standardization: Outputs reasoning chain data in a unified format.
2. Quick Start
Step 1: Create a New DataFlow Working Directory
mkdir run_mcts_reasoning
cd run_mcts_reasoningStep 2: Initialize DataFlow-MM
dataflowmm initYou will then see:
api_pipelines/vision_mcts_api_pipeline.pyStep 3: Download Sample Data
huggingface-cli download --repo-type dataset OpenDCAI/dataflow-demo-image --local-dir ./example_dataStep 4: Configure API Key
Set your API Key environment variable in api_pipelines/vision_mcts_api_pipeline.py:
import os
os.environ["DF_API_KEY"] = "your_api_key"Step 5: Configure Parameters
Configure the API service and input data paths in api_pipelines/vision_mcts_api_pipeline.py. Ensure the input file (jsonl) contains a tree field (for extraction) or just question/image (for generation).
pipe = VisionMCTSReasoningPipeline(
first_entry_file="../example_data/capsbench_images/visual_mct_reasoning_demo.jsonl",
prompt_type="spatial",
)Step 6: Run with One Command
cd api_pipelines
python vision_mcts_api_pipeline.py3. Data Flow & Logic
1. Input Data
Input data typically originates from MCTS search process logs, or unannotated image-text pairs:
- image: Path to the image.
- question: The visual question.
- tree (Optional): JSON structure of the MCTS search tree, containing node Values, Visits, and Actions.
Input Data Example:
{
"image": "./images/puzzle.jpg",
"question": "What is the next step to solve this?",
"tree": { "root": { "children": [...], "value": 1.0, "text": "Step 1..." } }
}2. Core Operator Logic
This pipeline uses a hybrid strategy of "Extraction First, Generation as Fallback":
A. MCTSTreeRefiner (Tree Structure Parser)
This operator handles the tree field. It traverses the tree structure and filters out the best path from the root node to a leaf node based on the node's Q-value.
- Input:
treeobject. - Function: Linearizes tree paths, filtering out low-value or incomplete search branches.
- Output: A list of extracted reasoning chains (
mcts_chains).
B. VisualReasoningGenerator (Visual Reasoning Generator)
This operator is the "generation engine" of the pipeline. It receives the extraction result from the previous step as input.
Mechanism: Checks
input_existing_chains_key(i.e.,mcts_chains).If MCTS parsing is successful (chain exists), it is reused directly without inference (saving computational resources).
If the MCTS chain is empty (tree does not exist or parsing failed), it calls the VLM to generate the reasoning chain from scratch based on
prompt_type(e.g.,spatial).Prompt Types: Supports modes like
spatial(spatial coordinate reasoning) andlogical(logical reasoning).
3. Output Data
The finally generated output data (final_reasoning_chains) will contain high-quality chains of thought that can be directly used for SFT training.
Output Example:
{
"image": "./images/puzzle.jpg",
"final_reasoning_chains": [
"<think>First, locate the red block at [100, 200]. To solve the puzzle, it needs to move right...</think><answer>Move Red Block</answer>"
]
}4. Pipeline Example
Below is the complete VisionMCTSReasoningPipeline code implementation (API Version).
import os
os.environ["DF_API_KEY"] = "sk-xxxx"
from dataflow.utils.storage import FileStorage
from dataflow.serving.local_model_vlm_serving import LocalModelVLMServing_vllm
# 引入原子算子
from dataflow.operators.core_text import MCTSTreeRefiner
from dataflow.operators.core_vision import VisualReasoningGenerator
from dataflow.serving.api_vlm_serving_openai import APIVLMServing_openai
class VisionMCTSReasoningPipeline:
def __init__(
self,
first_entry_file: str,
cache_path: str = "../cache/cache_mcts",
file_name_prefix: str = "mcts_reason",
# Config
prompt_type: str = "spatial",
max_samples_per_file: int = 10000,
# Keys
input_question_key: str = "question",
input_image_key: str = "image",
input_tree_key: str = "tree",
output_key: str = "final_reasoning_chains",
):
self.storage = FileStorage(
first_entry_file_name=first_entry_file,
cache_path=cache_path,
file_name_prefix=file_name_prefix,
cache_type="jsonl"
)
self.vlm_serving = APIVLMServing_openai(
api_url="[https://dashscope.aliyuncs.com/compatible-mode/v1](https://dashscope.aliyuncs.com/compatible-mode/v1)", # Any API platform compatible with OpenAI format
model_name="gpt-4o-mini",
image_io=None,
send_request_stream=False,
max_workers=10,
timeout=1800
)
self.keys = {
"q": input_question_key,
"img": input_image_key,
"tree": input_tree_key,
"mcts_chains": "mcts_extracted_chains",
"final": output_key
}
# ================== Operators ==================
# 1. Refiner: MCTS -> Chains
self.op_mcts_refine = MCTSTreeRefiner(
max_chains_per_sample=max_samples_per_file
)
# 2. Generator: VLM -> Chains (Fallback)
self.op_vlm_gen = VisualReasoningGenerator(
serving=self.vlm_serving,
prompt_type=prompt_type
)
def forward(self):
print(">>> [Pipeline] Step 1: Extracting Chains from MCTS Trees...")
self.op_mcts_refine.run(
self.storage.step(),
input_tree_key=self.keys["tree"],
output_key=self.keys["mcts_chains"]
)
print(">>> [Pipeline] Step 2: Generating Chains via VLM (Fallback)...")
# 将 mcts_chains 作为 input_existing_chains_key 传入
# 如果 MCTS 解析成功,则复用;否则调用 VLM 生成
self.op_vlm_gen.run(
self.storage.step(),
input_question_key=self.keys["q"],
input_image_key=self.keys["img"],
input_existing_chains_key=self.keys["mcts_chains"],
output_key=self.keys["final"]
)
if __name__ == "__main__":
pipe = VisionMCTSReasoningPipeline(
first_entry_file="../example_data/capsbench_images/visual_mct_reasoning_demo.jsonl",
prompt_type="spatial",
)
pipe.forward()
