ContextVQA Multimodal QA Data Generation Pipeline (API Version)
About 780 wordsAbout 3 min
2026-01-24
1. Overview
The ContextVQA Multimodal QA Data Generation Pipeline (API Version) is designed to automatically generate visual question answering data with external knowledge context (Context-based VQA) starting from an image. This pipeline uses a Vision-Language Model (VLM) via API to generate Wikipedia-style articles and QA pairs, which are then parsed into structured data. This is ideal for building knowledge-based VQA and multimodal RAG (Retrieval-Augmented Generation) datasets.
We support the following application scenarios:
- Knowledge-based VQA Data Synthesis: Constructing QA datasets that require external knowledge reasoning.
- Multimodal RAG Data Construction: Generating high-quality data for training retrieval-augmented generation models.
- Visual Reasoning Training: Generating questions that point to an image but require answers derived from textual context reasoning.
The main flow of the pipeline includes:
- Data Loading: Reading data files containing image paths.
- Context and QA Generation: Using a VLM API to generate Wikipedia-style articles and raw QA pairs based on images.
- Data Cleaning and Structuring: Parsing raw text to extract a structured
{context, qas}format.
2. Quick Start
Step 1: Configure API Key
Set the API Key environment variable in your script:
import os
os.environ["DF_API_KEY"] = "sk-xxx"Step 2: Create a New DataFlow Work Folder
mkdir run_dataflow
cd run_dataflowStep 3: Initialize DataFlow-MM
dataflowmm initYou will see the following file created:
api_pipelines/image_contextvqa.pyStep 4: Download Example Data
huggingface-cli download --repo-type dataset OpenDCAI/dataflow-demo-image --local-dir example_dataStep 5: Configure Parameters
In image_contextvqa.py, configure the API service and input data paths (no argparse required, modify default paths directly in the code):
self.vlm_serving = APIVLMServing_openai(
api_url="http://172.96.141.132:3001/v1", # Any OpenAI-compatible API platform
key_name_of_api_key="DF_API_KEY", # Corresponding API key set in Step 1
model_name="gpt-5-nano-2025-08-07",
image_io=None,
send_request_stream=False,
max_workers=10,
timeout=1800
)self.storage = FileStorage(
first_entry_file_name="./example_data/image_contextvqa/sample_data.json",
cache_path="./cache_local",
file_name_prefix="context_vqa",
cache_type="json",
)Step 6: One-Click Run
python api_pipelines/image_contextvqa.py3. Data Flow and Pipeline Logic
1. Input Data
The input data for this process mainly includes the following fields:
- image: Path to the image file (local path or URL).
- id (Optional): Unique identifier for the data.
- conversation (Optional): Text in dialogue format used to supplement context generation.
Data is managed through FileStorage, which supports breakpoint resumption.
Input Data Example:
[
{
"image": ["./example_data/image_contextvqa/person.png"],
"conversation": [
{
"from": "human",
"value": "Write a Wikipedia article related to this image without directly referring to the image..."
}
]
}
]2. Core Operator Logic
This pipeline completes the task by concatenating two core operators:
A. PromptedVQAGenerator (Context Generation)
This operator is responsible for calling the VLM API to generate raw text based on a prompt template.
Features:
- Generates a Wikipedia-style popular science article based on the image.
- Generates QA pairs based on the article.
- Prompt Constraints: Questions refer to the image but avoid mentioning object names; answers are from the article and are not objects in the image; answers are concise.
Operator Execution:
self.vqa_generator.run(
storage=self.storage.step(),
input_conversation_key="conversation",
input_image_key="image",
output_answer_key="vqa"
)B. WikiQARefiner (Result Parsing)
This operator cleans the raw text generated by the VLM and converts it into a standard format.
Features:
- Cleans Markdown formatting and extra whitespace.
- Separates the article content (Context) from the QA pairs (QAs).
Operator Execution:
self.refiner.run(
storage=self.storage.step(),
input_key="vqa",
output_key="context_vqa"
)3. Output Data
The final output data generated by the pipeline will contain:
- image: Original image path.
- vqa: Raw text generated by the VLM (intermediate result).
- context_vqa: Final structured result containing
context(article) andqas(QA list).
Output Data Example:
[
{
"image": ["./example_data/image_contextvqa/person.png"],
"context_vqa": {
"context": "**Wikipedia Article:** *Nightmare Alley* is a 2021 American psychological thriller...",
"qas": [
{
"question": "What genre does this film belong to?",
"answer": "Psychological thriller"
}
]
}
}
]4. Pipeline Example
Below is the complete ContextVQAPipeline implementation.
import os
# Set API Key environment variable
os.environ["DF_API_KEY"] = "sk-xxx"
from dataflow.utils.storage import FileStorage
from dataflow.core import LLMServingABC
from dataflow.serving.api_vlm_serving_openai import APIVLMServing_openai
from dataflow.operators.core_vision import PromptedVQAGenerator
from dataflow.operators.core_vision import WikiQARefiner
class ContextVQAPipeline:
"""
Generate batch ContextVQA data for images with a single command.
"""
def __init__(self, llm_serving: LLMServingABC = None):
# ---------- 1. Storage ----------
self.storage = FileStorage(
first_entry_file_name="./example_data/image_contextvqa/sample_data.json",
cache_path="./cache_local",
file_name_prefix="context_vqa",
cache_type="json",
)
# ---------- 2. Serving ----------
self.vlm_serving = APIVLMServing_openai(
api_url="http://172.96.141.132:3001/v1",
key_name_of_api_key="DF_API_KEY",
model_name="gpt-5-nano-2025-08-07",
image_io=None,
send_request_stream=False,
max_workers=10,
timeout=1800
)
# ---------- 3. Operator ----------
self.vqa_generator = PromptedVQAGenerator(
serving=self.vlm_serving,
system_prompt="You are a helpful assistant."
)
self.refiner = WikiQARefiner()
def forward(self):
input_image_key = "image"
output_answer_key = "vqa"
output_wiki_key = "context_vqa"
self.vqa_generator.run(
storage=self.storage.step(),
input_conversation_key="conversation",
input_image_key=input_image_key,
output_answer_key=output_answer_key,
)
self.refiner.run(
storage=self.storage.step(),
input_key=output_answer_key,
output_key=output_wiki_key
)
if __name__ == "__main__":
pipe = ContextVQAPipeline()
pipe.forward()
