ContextVQA Multimodal QA Data Generation Pipeline
About 602 wordsAbout 2 min
2026-01-24
1. Overview
The ContextVQA Multimodal QA Data Generation Pipeline is designed to automatically generate visual question answering (Context-based VQA) data with external knowledge contexts starting from images. This pipeline utilizes Vision-Language Models (VLM) to generate Wikipedia-style articles related to the images and corresponding QA pairs, which are then parsed into structured data.
We support the following application scenarios:
- Knowledge-based VQA Data Synthesis: Building QA datasets that require external knowledge reasoning.
- Multimodal RAG Data Construction: Generating high-quality data for training Retrieval-Augmented Generation (RAG) systems.
- Visual Reasoning Training: Generating questions that point to the image, but require answers reasoned from the textual context.
The main flow includes:
- Data Loading: Reading data files containing image paths.
- Context and QA Generation: Utilizing a locally deployed VLM to generate Wikipedia-style articles and raw QA pairs based on the image.
- Data Cleaning and Structuring: Parsing raw text to extract a structured
{context, qas}format.
2. Quick Start
Step 1: Create a New DataFlow Work Folder
mkdir run_dataflow_mm
cd run_dataflow_mmStep 2: Initialize DataFlow-MM
dataflowmm initYou will now see:
gpu_pipelines/context_vqa.pyStep 3: Download Example Data
huggingface-cli download --repo-type dataset OpenDCAI/dataflow-demo-image --local-dir example_dataStep 4: Configure Model and Data Paths
Modify the class initialization parameters directly in context_vqa.py (no longer passed via command line arguments):
# Model Serving Configuration
self.serving = LocalModelVLMServing_vllm(
hf_model_name_or_path="Qwen/Qwen2.5-VL-3B-Instruct",
hf_cache_dir="~/.cache/huggingface",
hf_local_dir="./ckpt",
vllm_tensor_parallel_size=1,
vllm_max_tokens=512,
)
# Data Storage Configuration
self.storage = FileStorage(
first_entry_file_name="./example_data/image_contextvqa/sample_data.json",
cache_path="./cache_local",
file_name_prefix="context_vqa",
cache_type="json",
)Step 5: One-Click Run
python gpu_pipelines/context_vqa.py3. Data Flow and Pipeline Logic
1. Input Data
Input data is managed through FileStorage, supporting breakpoint resumption.
Input Data Example (sample_data.json):
[
{
"image": ["./example_data/image_contextvqa/person.png"],
"conversation": [
{
"from": "human",
"value": "Write a Wikipedia article related to this image without directly referring to the image. Then write question answer pairs..."
}
]
}
]2. Core Operator Logic
A. PromptedVQAGenerator (Context Generation)
This operator calls the local VLM model to generate raw text based on built-in Wikipedia-style prompt templates.
Operator Execution:
self.vqa_generator.run(
storage=self.storage.step(),
input_conversation_key="conversation",
input_image_key=input_image_key,
output_answer_key=output_answer_key,
)B. WikiQARefiner (Result Parsing)
This operator cleans the unstructured text generated by the VLM and converts it into a standard format, separating the article content (Context) from the question-answer pairs (QAs).
Operator Execution:
self.refiner.run(
storage=self.storage.step(),
input_key="vqa", # Raw text from the previous step
output_key="context_vqa" # Final structured data
)3. Output Data
The final structured data includes context (article) and qas (list of questions and answers).
Output Data Example:
{
"id": 1,
"image": ["./example_data/image_contextvqa/person.png"],
"context_vqa": {
"context": "Nightmare Alley is a 2021 American psychological thriller film...",
"qas": [
{
"question": "What genre does this film belong to?",
"answer": "Psychological thriller"
}
]
}
}4. Pipeline Example
Below is the complete ContextVQAPipeline code implementation.
import argparse
from dataflow.utils.storage import FileStorage
from dataflow.core import LLMServingABC
from dataflow.serving.local_model_vlm_serving import LocalModelVLMServing_vllm
from dataflow.operators.core_vision import PromptedVQAGenerator, WikiQARefiner
class ContextVQAPipeline:
"""
Batch generate ContextVQA data for images with a single command.
"""
def __init__(self, llm_serving: LLMServingABC = None):
# ---------- 1. Storage ----------
self.storage = FileStorage(
first_entry_file_name="./example_data/image_contextvqa/sample_data.json",
cache_path="./cache_local",
file_name_prefix="context_vqa",
cache_type="json",
)
# ---------- 2. Serving ----------
self.vlm_serving = LocalModelVLMServing_vllm(
hf_model_name_or_path="Qwen/Qwen2.5-VL-3B-Instruct",
hf_cache_dir="~/.cache/huggingface",
hf_local_dir="./ckpt",
vllm_tensor_parallel_size=1,
vllm_temperature=0.7,
vllm_top_p=0.9,
vllm_max_tokens=512,
)
# ---------- 3. Operator ----------
self.vqa_generator = PromptedVQAGenerator(
serving=self.vlm_serving,
system_prompt= "You are a helpful assistant."
)
self.refiner = WikiQARefiner()
# ------------------------------------------------------------------ #
def forward(self):
input_image_key = "image"
output_answer_key = "vqa"
output_wiki_key = "context_vqa"
self.vqa_generator.run(
storage=self.storage.step(),
input_conversation_key="conversation",
input_image_key=input_image_key,
output_answer_key=output_answer_key
)
self.refiner.run(
storage=self.storage.step(),
input_key=output_answer_key,
output_key=output_wiki_key
)
# ---------------------------- CLI Entry ------------------------------- #
if __name__ == "__main__":
pipe = ContextVQAPipeline()
pipe.forward()
